Training AI models on the suttas

pjteh · December 13, 2022, 9:54am

Definitely a use case I encounter repeatedly.

For me, one of the potentially exciting parts of training AI models on the suttas is that it allows for new & more natural queries which do not hit the exact keywords in a search.

E.g. “Which layperson said to a monk that, if the monk doesn’t teach the Dhamma, then he the layperson will teach the monk the Dhamma?” (Answer: Ugga the Householder) doesn’t yield good results in Google Searches on Suttacentral, but such queries might be a good fit for Chat GPT-like interfaces for AI models trained on the suttas!

I think the AI models really haven’t been trained on large enough data sets. Besides Chat GPT, recently I’ve also been playing with DALL-E by putting in specific phrases from the suttas into them, e.g. Four Noble Truths.

It’s quite telling that the images generated are very clichéd Buddhist images, often with very strange alien gibberish, which is symptomatic of a model that is trained with insufficient data sets.
E.g. the prompt There is what is given and what is offered and what is sacrificed; there is fruit and result of good and bad actions" yielded this alien masterpiece:

sujato:

This is all very preliminary discussions, but our goal is to establish a corpus including the entire Buddhist canons in all languages, and even the Sanskritic Brahmanical corpus, using the data model of SC’s Bilara. This will then provide a training model for translation, semantic search, graphing, and the like.

One of the key challenges is working on languages with modest-sized corpuses. Buddhist texts are on the scale of millions or tens of millions of words, whereas training models are typically orders of magnitude greater than that.

Ultimately you might be able to, let’s say, search for the idea of “renunciation” and find it within Pali, Sanskrit, Tibetan, and Chinese. You could do statistical analysis, context analysis and the like. Another application is in stemming and grammatical analysis. But the possibilities are endless, so it’s a creative field.

Very exciting!
The other potential benefit is the use of more natural query-language. In addition to searching “renunciation”, you could ask “What did the Buddha teach about renunciation? Provide references from the Pali Canon and Chinese paralllels”

TIL that, even though Buddhist texts are one of the largest religious corpuses in the world, even that isn’t big enough for training models… sigh.

Which probably points to where the underlying technology can improve. Recently, I was very surprised to learn that the ‘neurons’ in neural networks are actually a very oversimplified version of the organic natural thing… maybe improving the ‘neurons’ might be where future AI improvements will happen, like in Moore’s Law for semicon chips.