Possible NLP (Natural Language Processing) projects or developments

For grammatical analysis, I put my hope on Hindi NLP. There are almost a billion people who can speak Hindi, including Sundar Pichai himself. So, I expect some progress in the future. Some of them will be applicable to Pali due to their syntatic similarity, I believe.

Recent advance include Parsey’s Cousin:

They have applied TensorFlow to analyse syntax of 40 languages, including Hindi. I wish I knew Hindi, so I could play with it more and see how to convert/map it with Pali.

1 Like

Maybe, but I have my doubts. Hindi is probably no more closely related to Pali than, I don’t know, Latin is to English. If any lessons can be learned, they will need a great deal of tweaking. Still, given the vast numbers of Indians in IT these days, maybe something will be developed. Maybe we should reach out to some Hindi speakers and see if they have any thoughts on this.

The only thing I can contribute is that, last time I spoke with an Indian Pali professor, he said that if they wanted to make new Hindi translations, they would translate from English rather than from Pali. :disappointed:

Sanskrit, on the other hand, is very close to Pali, so anything there would be readily adaptable. I just checked Parsey’s Cousins, and they have developed models for ancient Greek and Latin, so Sanskrit/Pali is not out of the question. I wonder if it’s possible to put in a request?

It’s interesting that TensorFlow is still at a ceiling of about 85% accuracy.

Most advance in NLP now applies it to Big Data. TensorFlow and SyntaxNet will need a lot of data to train. I am afraid it doesn’t fit well with Pali data we have. Maybe, traditional NLP would suffice. If Hindi model cannot be used, we may need to develop our own parser (which should be doable).

I will need some months to brush up my NLP knowledge after many years away from the research. I am now taking an NLP course on Coursera (in case someone is interested)

May I ask, what kind of scale are we talking about? In the Pali texts, we have maybe 2 million words in the canon, several times that if we include later literature (the later texts have a somewhat different vocabulary, content, and style, but the grammar is pretty much the same). Sanskrit literature is much bigger.

I’m not really sure how it compares, but I don’t think there’s a huge corpus in ancient Greek, for example.

I don’t have an exact answer. Google must have trained those syntax model with Terabytes of data. I have just installed TensorFlow for a few weeks, and haven’t done any Pali experiment with it. Once I do, I will report what I found here. Now my focus is to learn the traditional approach in an online course, and to do Pail experiment along the way.

You can expect some easy stuff (e.g. character/word frequency) from me in a few days. Should I then report it here? Or as a new topic?

1 Like

Dear Ajahn @sujato,

I just mentioned this to your Metta teacher, Ajahn Chatchai, saying that if I had plenty of time, I would like to work with him on translating the whole Tipitika! Perhaps you yourself could persuade him to do it during his birthday celebration next year? :slight_smile:


Great, please post it as a new thread.

What a fantastic birthday present for him. He’s my metta teacher, I should treat him with kindness, not cruelty!


Actually, I forgot one important detail. We’ve been talking about the Pali texts, but forgetting the role of SC in aligning parallels. Now, to be able to do meangingful NLP on Pali, Chinese, Sanskrit, and Tibetan would be great. The reality is, though, that applications for any one of these are likely to be quite primitive (to be clear, ancient Chinese, while using mostly the same characters as modern Chinese, has a very different idiom and usages, and it’s not clear how modern language processing might apply.) Moreover, the ability to do this across these languages, treating them as a single corpus, is even more remote.

What would be much closer to the realm of the possible, though, is to do NLP on English translations. Now, clearly, as I mentioned before, this will not apply to many kinds of analysis. However, for higher level semantic analysis it might be possible. For example, one might want to research a particular topic, and see the distribution across the various collections; or analyze sentiment; or do statistical analysis of structure of suttas, and so on.

Currently we have a somewhat limited range of translations from languages other than Pali. However this is changing rapidly and I expect that within the next few years we will have a fairly good coverage of the main texts.

As an example, see the translations for Up 6.005 (from Tibetan), SA 9 (from Chinese) and SN 22.15 (from Pali).

This may be a main use case. A scholar or a practitioner want to research about ‘sati’ in Tipitaka. He may input the search term ‘sati’, the system pre-compute all word forms to their stems e.g. ‘sati’, ‘sati.m’, ‘sati~nca’, etc.
(BTW, how do I type pali here?). It finds a big samyutta on sati, another group in patisambhidamagga, etc. It displays a blurb on those big group with the number of suttas and tokens found. A few interesting short proverbs in gatha, etc. Those suttas and popular proverbs are provided with their popular scores (by favorite or page-ranking or some other scoring method). I want to design the ideal experience for a user. NLP and ML can play a part in pre-compute those meta data needed for the display.

My NLP course is now on the topic of stemming and synonyms. There are a few ideas I might pursue. Probably mining some data from PTS dictionary. I will share my findings later.

There’s a thread elsewhere on diacriticals. The short version is, you have to have do it based on your own operating system. We created a widget for inserting them, but it broke with an upgrade. Discourse is in the process of updating the editor, and we are waiting for this to be complete before adding the widget again.

This is all good stuff. Like on Google these days, we won’t just get a list of hits, but organized information.

FYI, we use elasticsearch (based on Lucene) for our search engine on SuttaCentral, and it was configured and maintained by @blake. He’ll be very happy to respond to any questions.

[quote=“sujato, post:14, topic:3238, full:true”]
we use elasticsearch (based on Lucene) for our search engine on SuttaCentral, and it was configured and maintained by @blake.[/quote]
I will look more into ElasticSearch. I hope the JSON interface will be flexible enough to adapt the display into what we want.

:grinning: :grin: :laughing:

My question is that running NLP tasks on any kind of corpus means that most of the words will be thrown out and considered noise (for instance: stop words that are frequent but carry little information, like “the” “and” “a”).

Still, this approach just does not feel right when processing the Suttas. How can I say that certain words are unimportant in these texts?

Am I just overthinking this or are there some guidelines one can follow when running text mining on the Suttas?

What does this acronym stand for? In psychology it stands for Neuro-Linguistic-Programming. I’m guessing it is something quite different in “tech-speak”.


Interestingly, it’s like the polar opposite. [Psychological] NLP is a failed theory that tried to treat humans like machines; [computational AI] NLP is a growing field of AI that is treating machines like people. So that machines can “understand” our language.

An example for the future that Alphabet is working on is being able to ask Google a question like the computers in sci-fi movies, that is, in a natural way; instead of an a-grammatical string of keywords.


What does this acronym stand for? In psychology it stands for Neuro-Linguistic-Programming. I’m guessing it is something quite different in “tech-speak”.

Natural language processing. Refers to techniques to let computers extract data or meaning from text or speech that was produced by humans for humans, as opposed to being specially structured for a computer- though in some cases there may be markup or pre-processing that adds structure to make the task easier for the computer.


@SCMatt and @ccollier are correct, it stands for natural language processing, but it’s not like machines actually understand language - surprisingly, creating AI that can beat people in GO, walk like dogs, somewhat drive cars, etc. were easier to solve than solving language. The reason for this is that language is always changing, there are always exceptions to every grammatical rule, words can change meaning based on context and a single word can change the meaning of entire sentences. Language is a creative process where we can invent new words and sentences and others will still be able to understand us.

The term “natural” is actually misleading as well: in theory, it is possible to create language tools that work well on all languages, in reality, you have to really tweak them or develop new ones based on the complexity and grammar of the language (hence the Pali NLP ideas).

Now this might sound like rocket science, but in reality, all these tools can do (since we have no better idea on how to tackle language with computers) is treat letters/words/sentences as numerical values and work with those values in mathematical models. If you write an essay in Word, there’s an indicator on the bottom that shows how many key presses or letters the essay has. As a human, you would have a really hard time counting all letters, but this is super easy for a computer to do. For this reason, NLP tools rely on solutions that are easier for computers to do, even if humans would never analyze a text based on letter frequency for instance. However, these seemingly strange methods can still yield a lot of information: for example it’s possible to tell by examining word frequencies and distribution, if two texts were written by the same person. That’s how J.K. Rowling was revealed to be a writer of a book other than Harry Potter.

Running NLP tasks on Suttas could boost the efficiency of online search. However, it doesn’t feel quite right to treat the Suttas as mere numbers.


Still, this approach just does not feel right when processing the Suttas. How can I say that certain words are unimportant in these texts?

I think you’re overthinking it a little. Things are important according to their context, and human language is full of redundancy because it has to cope with communicating via noisy channels- either literally in the sense of spoken speech, the root of all language, or figuratively in terms of overcoming the noise and bias in the human mind. Certain patterns and markers might be useful in keeping a big group on the same page if you’re chanting for an hour, but aren’t that useful for silent reading or creating a summary. That doesn’t mean the monks and nuns wasted their time by preserving them or that they ought to be discarded in the future, just that they emerge from a different context.

Also, rarely in NLP would you be preemptively throwing things out as unimportant- instead, the software is tweaking numbers under the hood to reflect the fact that some words contribute minimally to extracting the features that you are interested in. The only content that I would think about preemptively discarding would be things like section headings or visual formatting that were added later to make human navigation easier. Those are likely to vary between different recensions of the suttas anyway.

1 Like

I’ve been looking into potential NLP libraries and just came across this huge structured database of Chinese terms:

Also these examples using seq2seq in tensorflow looks quite useful:


Thanks, I’ll call @vimala’s attention to this.