Training AI models on the suttas

sujato · December 2, 2022, 8:35pm

There are some interesting projects going on, and what you’ve tried here is a good example. Seems great, not quite ready for prime time.

Yesterday I had a long discussion with a bunch on folks on the future of Buddhist texts with AI. This is under the burgeoning auspices of Linguae Dharmae (their website is not ready yet):

This was founded by Ayya @vimala and @SebastianN Nehrdich, who formerly were working on BuddhaNexus (which is still going under the U of Hamburg).

Sebastian is headed to Berkeley, where he will work with Kurt Keutzer, a Tibetan Buddhist and AI expert (who among other things developed DeepScale, which is used in Tesla’s self-driving).

Also at the meeting was Alex Wynne, who many of you know from the Oxford Center for Buddhist Studies.

This is all very preliminary discussions, but our goal is to establish a corpus including the entire Buddhist canons in all languages, and even the Sanskritic Brahmanical corpus, using the data model of SC’s Bilara. This will then provide a training model for translation, semantic search, graphing, and the like.

One of the key challenges is working on languages with modest-sized corpuses. Buddhist texts are on the scale of millions or tens of millions of words, whereas training models are typically orders of magnitude greater than that.

Ultimately you might be able to, let’s say, search for the idea of “renunciation” and find it within Pali, Sanskrit, Tibetan, and Chinese. You could do statistical analysis, context analysis and the like. Another application is in stemming and grammatical analysis. But the possibilities are endless, so it’s a creative field.

@michaelh and @Charith are working on another AI project that’s interesting, too!