Possible NLP (Natural Language Processing) projects or developments


@SCMatt and @ccollier are correct, it stands for natural language processing, but it’s not like machines actually understand language - surprisingly, creating AI that can beat people in GO, walk like dogs, somewhat drive cars, etc. were easier to solve than solving language. The reason for this is that language is always changing, there are always exceptions to every grammatical rule, words can change meaning based on context and a single word can change the meaning of entire sentences. Language is a creative process where we can invent new words and sentences and others will still be able to understand us.

The term “natural” is actually misleading as well: in theory, it is possible to create language tools that work well on all languages, in reality, you have to really tweak them or develop new ones based on the complexity and grammar of the language (hence the Pali NLP ideas).

Now this might sound like rocket science, but in reality, all these tools can do (since we have no better idea on how to tackle language with computers) is treat letters/words/sentences as numerical values and work with those values in mathematical models. If you write an essay in Word, there’s an indicator on the bottom that shows how many key presses or letters the essay has. As a human, you would have a really hard time counting all letters, but this is super easy for a computer to do. For this reason, NLP tools rely on solutions that are easier for computers to do, even if humans would never analyze a text based on letter frequency for instance. However, these seemingly strange methods can still yield a lot of information: for example it’s possible to tell by examining word frequencies and distribution, if two texts were written by the same person. That’s how J.K. Rowling was revealed to be a writer of a book other than Harry Potter.

Running NLP tasks on Suttas could boost the efficiency of online search. However, it doesn’t feel quite right to treat the Suttas as mere numbers.


Still, this approach just does not feel right when processing the Suttas. How can I say that certain words are unimportant in these texts?

I think you’re overthinking it a little. Things are important according to their context, and human language is full of redundancy because it has to cope with communicating via noisy channels- either literally in the sense of spoken speech, the root of all language, or figuratively in terms of overcoming the noise and bias in the human mind. Certain patterns and markers might be useful in keeping a big group on the same page if you’re chanting for an hour, but aren’t that useful for silent reading or creating a summary. That doesn’t mean the monks and nuns wasted their time by preserving them or that they ought to be discarded in the future, just that they emerge from a different context.

Also, rarely in NLP would you be preemptively throwing things out as unimportant- instead, the software is tweaking numbers under the hood to reflect the fact that some words contribute minimally to extracting the features that you are interested in. The only content that I would think about preemptively discarding would be things like section headings or visual formatting that were added later to make human navigation easier. Those are likely to vary between different recensions of the suttas anyway.


I’ve been looking into potential NLP libraries and just came across this huge structured database of Chinese terms:

Also these examples using seq2seq in tensorflow looks quite useful:


Thanks, I’ll call @vimala’s attention to this.