Possible NLP (Natural Language Processing) projects or developments

kora · September 15, 2016, 1:21am

I have checked out the SC database from github. It’ll probably be my main data source to work with.

There are many ideas collected somewhere, here’s just what come up to my mind now

Translation Alignment: people want to know which Thai/English word correspond to which Pali. Some words are just added in translations to make things clearer.
Translation Variety: the same pali words/phrases are translated differently. What are the proper translations? Should we standardize?
Many Thais find English translation easier to understand than Thai. I want to make it easy to jump between the 3 languages.
Analysing Word distribution, and their grammatical functions. There are already standard analysis of English and other common languages. This can be applied to Pali to gain some interesting insight (not Vipassana, though)

Which do you think is most useful? most practical? or different/better approach?

sujato · September 15, 2016, 2:32am

We’re working on that. Currently I am translating the four nikayas into English on a segmented basis. Check out the discussions here on “pootle” and “segment”. Let me know if you want to check out our translation software at https://pootle.suttacentral.net, we can give you an account.

I’ll be translating consistently across the four nikayas, at least.

I don’t think there’s anything wrong with some variety of interpretation, and the line between interpretation and error is not always easy to draw.

The biggest issue, to my mind, is the poor quality of translations generally. While we have seen a very large expansion of the numbers of translations available, in many cases they could do with improvement.

In terms of the resources offered by SC, such as dictionaries, it would be good to correct known errors.

This will be possible with the segmented texts. Once it is working, all texts will be aligned segment by segment. Sharing a common ID, it will be easy to view the same segment in various languages. This will, however, only be possible for texts tranlated on our Pootle software, or those that have been adapted for it (as Vimala is currently doing for the new Vinaya translation.)

The bigger problem here, in my view, is that the Thai translations need a serious update. Perhaps @Dheerayupa would like to weigh in here. But I would love to see a modern, “plain Thai” translation, dispensing with the artificial Pali and Sanskritic forms and the excessive formalizations, and dare to have the Buddha speak as he does in the Pali: like a human being.

I agree, I would love to see something like this. For example, we might develop an interface so that you could select a text or a range of texts, and analyze features such as word length, sentence length, vocabulary, word proximity, sentiment, grammatical forms, and so on.

The problem is that to do this well requires the hard work: an accurate Pali grammar parser. Such a thing doesn’t exist, so far as I’m aware. This is a hard piece of NLP programming. Currently we use a series of hacks to identify different forms of a Pali term, but it falls a long way short of a proper grammar parser.

Sad but true!

It really depends on your area of interest. Having said which, most of the things you’ve mentioned are already in our pipeline. The holdup here is not the developers, but the translator. That’s me!

And by the way, this conversation is getting too technical for the watercooler, so let me take it over to meta/dev.

kora · September 15, 2016, 3:11am

Thank you for your guidance.
I have looked into meta/dev for a while, hoping someday to contribute.
I just find it a good chance for self introduction in your opening ‘watercooler’ post.

From now on, I will post my progress here.

My academic interest is probably duplicating some NLP study on Pali it self. (Starting with character count, word count; finding the 100-most frequent words, as an example)
My buddhist interest is about making Tipitaka more accessible/readable. (Starting with translation alignment and 3-language switching)

I also want to support Pali scholars. What do you want if we can apply all the new technologies (e.g. Machine Learning, AI) to Tipitaka?

sujato · September 15, 2016, 3:21am

Well, this all sounds great. Please stay in touch with us here, and our friendly developers @vimala and @blake will be happy to coordinate with you.

Again, I would love to do this. Since Google has open-sourced TensorFlow, there are now many first-class AI applications available freely. The thing is to imagine a really useful application.

One detail that might be worth bearing in mind. The vast majority of NLP work is, for better or for worse, in English. The reality is that the English tools will always be leagues ahead on anything available in Pali. When consistent English translations are available for the four nikayas, this opens up the possibility of doing NLP AI on them, and applying the results to Pali. Now, there are many kinds of situations where that will not work, such as grammar analysis and so on. But in terms of higher level semantic analysis, linking of topics, sentiment analysis, and so on, it might—might!—be a useful approach.

kora · September 15, 2016, 3:37am

For grammatical analysis, I put my hope on Hindi NLP. There are almost a billion people who can speak Hindi, including Sundar Pichai himself. So, I expect some progress in the future. Some of them will be applicable to Pali due to their syntatic similarity, I believe.

Recent advance include Parsey’s Cousin:

They have applied TensorFlow to analyse syntax of 40 languages, including Hindi. I wish I knew Hindi, so I could play with it more and see how to convert/map it with Pali.

sujato · September 15, 2016, 3:49am

Maybe, but I have my doubts. Hindi is probably no more closely related to Pali than, I don’t know, Latin is to English. If any lessons can be learned, they will need a great deal of tweaking. Still, given the vast numbers of Indians in IT these days, maybe something will be developed. Maybe we should reach out to some Hindi speakers and see if they have any thoughts on this.

The only thing I can contribute is that, last time I spoke with an Indian Pali professor, he said that if they wanted to make new Hindi translations, they would translate from English rather than from Pali.

Sanskrit, on the other hand, is very close to Pali, so anything there would be readily adaptable. I just checked Parsey’s Cousins, and they have developed models for ancient Greek and Latin, so Sanskrit/Pali is not out of the question. I wonder if it’s possible to put in a request?

It’s interesting that TensorFlow is still at a ceiling of about 85% accuracy.

kora · September 15, 2016, 4:04am

Most advance in NLP now applies it to Big Data. TensorFlow and SyntaxNet will need a lot of data to train. I am afraid it doesn’t fit well with Pali data we have. Maybe, traditional NLP would suffice. If Hindi model cannot be used, we may need to develop our own parser (which should be doable).

I will need some months to brush up my NLP knowledge after many years away from the research. I am now taking an NLP course on Coursera (in case someone is interested)
https://www.coursera.org/learn/natural-language-processing

sujato · September 15, 2016, 4:08am

May I ask, what kind of scale are we talking about? In the Pali texts, we have maybe 2 million words in the canon, several times that if we include later literature (the later texts have a somewhat different vocabulary, content, and style, but the grammar is pretty much the same). Sanskrit literature is much bigger.

I’m not really sure how it compares, but I don’t think there’s a huge corpus in ancient Greek, for example.

kora · September 15, 2016, 4:20am

I don’t have an exact answer. Google must have trained those syntax model with Terabytes of data. I have just installed TensorFlow for a few weeks, and haven’t done any Pali experiment with it. Once I do, I will report what I found here. Now my focus is to learn the traditional approach in an online course, and to do Pail experiment along the way.

You can expect some easy stuff (e.g. character/word frequency) from me in a few days. Should I then report it here? Or as a new topic?

Dheerayupa · September 15, 2016, 6:59am

Dear Ajahn @sujato,

I just mentioned this to your Metta teacher, Ajahn Chatchai, saying that if I had plenty of time, I would like to work with him on translating the whole Tipitika! Perhaps you yourself could persuade him to do it during his birthday celebration next year?

sujato · September 15, 2016, 7:49am

Great, please post it as a new thread.

What a fantastic birthday present for him. He’s my metta teacher, I should treat him with kindness, not cruelty!

sujato · September 15, 2016, 10:59pm

Actually, I forgot one important detail. We’ve been talking about the Pali texts, but forgetting the role of SC in aligning parallels. Now, to be able to do meangingful NLP on Pali, Chinese, Sanskrit, and Tibetan would be great. The reality is, though, that applications for any one of these are likely to be quite primitive (to be clear, ancient Chinese, while using mostly the same characters as modern Chinese, has a very different idiom and usages, and it’s not clear how modern language processing might apply.) Moreover, the ability to do this across these languages, treating them as a single corpus, is even more remote.

What would be much closer to the realm of the possible, though, is to do NLP on English translations. Now, clearly, as I mentioned before, this will not apply to many kinds of analysis. However, for higher level semantic analysis it might be possible. For example, one might want to research a particular topic, and see the distribution across the various collections; or analyze sentiment; or do statistical analysis of structure of suttas, and so on.

Currently we have a somewhat limited range of translations from languages other than Pali. However this is changing rapidly and I expect that within the next few years we will have a fairly good coverage of the main texts.

As an example, see the translations for Up 6.005 (from Tibetan), SA 9 (from Chinese) and SN 22.15 (from Pali).

kora · September 16, 2016, 9:06am

This may be a main use case. A scholar or a practitioner want to research about ‘sati’ in Tipitaka. He may input the search term ‘sati’, the system pre-compute all word forms to their stems e.g. ‘sati’, ‘sati.m’, ‘sati~nca’, etc.
(BTW, how do I type pali here?). It finds a big samyutta on sati, another group in patisambhidamagga, etc. It displays a blurb on those big group with the number of suttas and tokens found. A few interesting short proverbs in gatha, etc. Those suttas and popular proverbs are provided with their popular scores (by favorite or page-ranking or some other scoring method). I want to design the ideal experience for a user. NLP and ML can play a part in pre-compute those meta data needed for the display.

My NLP course is now on the topic of stemming and synonyms. There are a few ideas I might pursue. Probably mining some data from PTS dictionary. I will share my findings later.

sujato · September 16, 2016, 9:32am

There’s a thread elsewhere on diacriticals. The short version is, you have to have do it based on your own operating system. We created a widget for inserting them, but it broke with an upgrade. Discourse is in the process of updating the editor, and we are waiting for this to be complete before adding the widget again.

This is all good stuff. Like on Google these days, we won’t just get a list of hits, but organized information.

FYI, we use elasticsearch (based on Lucene) for our search engine on SuttaCentral, and it was configured and maintained by @blake. He’ll be very happy to respond to any questions.

kora · September 17, 2016, 4:30am

[quote=“sujato, post:14, topic:3238, full:true”]
we use elasticsearch (based on Lucene) for our search engine on SuttaCentral, and it was configured and maintained by @blake.[/quote]
I will look more into ElasticSearch. I hope the JSON interface will be flexible enough to adapt the display into what we want.

Dheerayupa · September 21, 2016, 8:23am

richard.nagyfi · August 28, 2018, 3:14pm

My question is that running NLP tasks on any kind of corpus means that most of the words will be thrown out and considered noise (for instance: stop words that are frequent but carry little information, like “the” “and” “a”).

Still, this approach just does not feel right when processing the Suttas. How can I say that certain words are unimportant in these texts?

Am I just overthinking this or are there some guidelines one can follow when running text mining on the Suttas?

Viveka · August 29, 2018, 2:45am

What does this acronym stand for? In psychology it stands for Neuro-Linguistic-Programming. I’m guessing it is something quite different in “tech-speak”.

SCMatt · August 29, 2018, 4:24am

Interestingly, it’s like the polar opposite. [Psychological] NLP is a failed theory that tried to treat humans like machines; [computational AI] NLP is a growing field of AI that is treating machines like people. So that machines can “understand” our language.

An example for the future that Alphabet is working on is being able to ask Google a question like the computers in sci-fi movies, that is, in a natural way; instead of an a-grammatical string of keywords.

ccollier · August 29, 2018, 5:44am

What does this acronym stand for? In psychology it stands for Neuro-Linguistic-Programming. I’m guessing it is something quite different in “tech-speak”.

Natural language processing. Refers to techniques to let computers extract data or meaning from text or speech that was produced by humans for humans, as opposed to being specially structured for a computer- though in some cases there may be markup or pre-processing that adds structure to make the task easier for the computer.