@silence kindly offered to help out in making a better Pali compound-breaker for our lookup tool.
Currently it works on an automatic basis, parsing the compound and comparing it to a list of words, and seeing where it might be broken. Given the complex and unpredictable nature of Pali compounds, it’s not easy to get such an approach working all the time.
Just a couple of points worth bearing in mind.
The current lookup is based on the New Concise Pali English Dictionary (NCPED), which is Buddhadatta’s old CPED, enhanced and corrected based on Cone’s Dictionary of Pali. We are awaiting the release of her third volume before this can be finalized.
A fair number of extra improvements has been generated during the translation process, but these have not been integrated into the lookup. Essentially I broke compounds by hand while translating, but this was entirely ad hoc, not all were done by any means.
So to improve the tool, we have essentially three kinds of angles that can help:
Improve the underlying resources, especially the dictionary.
Improve the automated analysis and breakup.
Manually insert breakpoints in the text.
@silence, can you describe what your experience has been in doing this in the past, and how you envisage approaching the task?
I have to say I have zero programming skills beyond the most basic html features, so I think it is important to include a knowledgeable programmer in the discussion, but the basic idea is to give a hand to the algorithm by marking breakup points “manually”.
To try to give more detail, I imagine the current algorithm considers as a word any character string surrounded by blank spaces, so the idea is to insert an user-invisible tag inside the compounds that the algorithm would then treat as blank space. For example, it could go like this:
The user would see:
We would add tags within the word (I am using [ instead of <):
So this could be done “manually,” which would take some time, but with a RegEx Find-&-Replace tool, the operation could be extended to the entire canon all at once, which would save tremendous amounts of time.
Thanks to both of you. So it seems Nibbanka is suggesting an improved algorithm, while Silence is suggesting hand-curated breaks. It’s not an either/or situation, as in cases like this it’s virtually impossible to get 100% accuracy with an algorithm. If we can get the algorithm to do as much as possible, then do the rest by hand, that would be great.
Forgive me if I’ve forgotten this, but what form is this in? Do you have a JS script that will do this?
Yes, the key would be to figure out how to do it most efficiently.
In terms of what to insert, the best way would be to simply insert any arbitrary character, say - (hyphen) for example, and we can then change it at the end to whatever we want.
Currently SC inserts soft hyphens into long words, which allows long Pali terms to break with a hyphen at the end of line. This is necessary since browsers don’t currently have well-supported hyphenation (in general) and none at all for Pali. IIRC the soft hyphens are inserted following the LaTeX hyphenation rules for Sanskrit. In any case, one approach would be to remove these soft hyphens and instead insert them only on the compound breakpoints. That would keep the markup clean, and perform a dual function of both indicating possible line-breaks and indicating breakpoints to a lookup tool.
I am wondering whether we could do something like this:
Make a list of all Pali words in the Tipitaka.
Filter out all those in the dictionary as well as variant spellings, etc. The remaining should be (mostly) a list of compounds.
Apply the best compound-breaking algorithm we have, inserting some character at the breakpoints.
Go over it by hand, correcting the mistakes.
Integrate the corrected versions back into the Pali text.
Once that has been done, the compound breaks are hard-coded into the text files, and there’s no need for a browser widget to analyze the compound and break up the words. So that makes the job of the lookup tool a lot simpler.
The author of the above tool, Gérard Huet, apparently has scripts for Sanskrit.
In my experience, it’s essential to take in account the fact that Pali language underwent significant changes in the course of centuries. Pali of Majjhima-Nikaya is not quite the same as Pali of Apadana. Sandhi rules became more and more random with time.
Therefore, I would suggest to pay attention first of all to the early strata (1-4 in the list of Dr. Bimala Churn Law). Especially since existing dictionaries focus on the words from these strata.
And among the compounds from early strata, to analyze first of all those that occur frequently. Here’s a word list based on first four Nikayas, sorted by frequency:
The basic process of AI is to create an algorithm, train it, make corrections and continue with this process. Now, the Pali is a finite work, therefore it may not be necessary to go that far as to have a highly trained AI application, but it will need to go along those lines.
I think this is a good approach. But I would leave the Pali text as it is and create a secondary consultation source, just to have the original text as it is, or at least have it somewhere.
Do you have this NCPED available? I could do some tests and see if I can offer some assistance.
I have been working in the NPL for the Canon and this specific topic has given me a lot of trouble, specially for not having an appropriate dictionary as a foundation.
The proposal I was thinking is basically what you proposed, but I ran into some problems:
Make a list of all Pali words in the Tipitaka: I have done some work in this, but my major problem is cleaning all the html. My last test gave 175,613 different words. But there are (many) mistakes such as misspelled words, titles, numbers, different spellings and general ‘garbage’ coming from the html.
This is a CSV dump of what I got: http://lasangha.org/files/tmp/words.csv.zip
Is there a ‘clean’ Canon? Without html.
I think this is a good idea, start with a subset of the text and move on from there.
Finally, there would be 2 processes:
One to do the heavy initial work (c/c++, PHP or some other scripting language).
We (at BDRC) want to build a Pali analyzer for Lucene and are thus quite interested in this problem too. We are almost done (!) building a Sanskrit analyzer that splits compounds and lemmatizes words, we were thinking to apply the same strategy for Pali… I see that you are using Elasticsearch, are you planning to implement this strategy in a Lucene analyzer for Elasticsearch? Or would the compounds be split offline?
Very interesting! Just so you know, I have also been in contact with a group in Myanmar who want to do something similar.
I’ll get @Blake in on this, as he is the one who has worked in this area, but so far as I can say, we have a Pali lookup tool which does basic stemming and compound breaking, written in js. Then for Elasticsearch, I think we just do simple stemming. There’s certainly nothing in terms of a more advanced lexical analysis.
For Sanskrit, I’m wondering how well that would work with the early Buddhist texts in Sanskrit, as they tend to be somewhat variable in their spelling.
For Pali, I am not confident that I could say how much use a Sanskrit-based tool would be. In general terms, you basically lose information when shifting to Pali, as the morphology and so on are less complex. So that might make it harder to disambiguate terms and forms in Pali.
The problem with a corpus the size of the Pali canon is that you are right on the edge of what is reasonable to do. It’s a sizeable corpus, but finite. It seems to me that, by the time you’ve coded an automatic parser, you could probably just go through the entire vocabulary by hand and define each word and compound. Of course, extending it to include the commentaries and so on would change this.
But maybe if a parser can be adapted from Sanskrit without too much trouble, it could work. Perhaps the best approach is to automate what can be automated, and focus on generating quality error messages. Then we can do the remainder by hand.
our analyzer only works with Sanskrit following “orthodox” grammar rules; BHS (Buddhist Hybrid Sanskrit) will be much more difficult… For Pali at least the rules are more stable than in BHS, although, as you say, there will be many more ambiguities
I think our goal is really to have something automatic as I’m quite confident we will have a chance to pass an OCR on our Pali collection (see here and here) at some point, which would be way too big to handle manually
because even for Sanskrit we have ambiguities, there will need to be at some point an additional pass to assess the probability of the various possibilities, based on old school techniques (word frequencies) or machine learning, but both of these will require a curated corpus, so having the Canon manually split will absolutely not be a waste of time!
is the code of the js tool on Github?
adapting our Sanskrit code to Pali should be fairly easy as only the data will change. Currently we use the dictionary from Gérard Huet, with some of his code to compute all possible flexional endings plus sandhis, so if we have just that for Pali we would be able to create a Lucene analyzer… I think that would be worth a try… even with just 90% accuracy (we are aiming 95% with Sanskrit in our first step), that would still be quite useful I believe…
The elasticsearch analyzer I hacked together is pretty basic. All it does is a little normalization (by->vy) type thing and then stems by slicing off known endings. No attempt is made to break down compounds, or even to recognize the appropriate declension:
For the Pali Lookup tools we have two implementations for splitting compouns. The more advanced one generates all possible combinations which could form the compound and throws them all at the user. The more basic one just starts at the start of the word, finds the longest stem which matches, then repeats for the unmatched portion of the compound.
For a search engines I don’t see why the first approach couldn’t be used, generate all possible splits and index all of them, if a compound really does contain a stem, that’ll have the highest chance of correctly identifying it. Of course it’ll also generate more false positives but a human can filter those out.
I see! Our Sanskrit Analyzer currently uses the second technique (max match). We did that with what I think is the most performance approach (walking the string and a Trie of possibilities at the same time), but the code became more complex than I imagined… our goal is to have it fully working by the end of this month though. If you already have some data for your second approach, we could use that to build a pali analyzer at the same time, and you could use it in elasticsearch. Is the data/code of the Pali lookup on the Github? BTW, I don’t know if you’re indexing Tibetan or Chinese in Elasticsearch? If so you may be interested in our analyzers (for Tibetan, for Chinese)
Thanks for the pointer! I understand better now, the strategy is indeed a little bit different from ours, as we’re building a giant Trie first with all morphology and sandhi, then analyze the string… but the dictionaries will be quite useful, thanks! I notice some fields that look interesting in pi2en-entries.json : “priority” and “boost”, are they based on frequency analysis?
Priority is just for the dictionaries - some definitions are considered better than others. I can’t remember how exactly boost is calculated, but it’s probably a combination of frequency analysis and the length of the definitions (under the logic that something worth writing a lot about, is probably more significant).
Just to clarify for everyone here about the dictionaries.
CPED is the old Concise dictionary by Buddhadatta.
NCPED is an updated version of that, made by @Russell under our guidance for SC. It corrects the entries in CPED following Cone’s Dictionary of Pali. It does not contain the full details of Cone’s dictionary, but keeps to the CPED approach of making a useful student’s dictionary.
This has been completed as far as the two published volumes of Cone’s dictionary allow, and we will update it further as subsequent volumes are issued. Going forward, the NPCED will be the primary dictionary on SC for search results and Pali lookup. We would recommend that future dictionary work be based on the NCPED, allowing for the fact that it will be updated.