There seems to have been a lot of discussion of Pali stemmers over the years. If someone is interested in helping with the algorithm, I’d be interested in converting it to Snowball.
Take this example for the English language:
Having a Snowball stemmer for Pali would give us the option of indexing our texts in Arangosearch, ElasticSearch, Tantivy and so forth.
I’m happy to tackle the technical aspects if someone has an approach for the grammar.
Having a look here gives us a starting point:
They require:
This must include an algorithmic description of the stemmer, an implementation in Snowball, and a representative language vocabulary of about 30,000 words that can be used as part of a standard test.
If there is some interest, I might share it with the Snowball mailing list.
I asked Bhante Bodhirasa for his help on the grammar part and got back a dismissive response that the DPD DB has “solved” Pāḷi term matching and that it was (iirc) a bad idea to build a stemmer at all because then someone might be tempted to use it!
That and various theoretical questions about the handling of compounds and sandhi that Bhante brought up bogged me down, so I abandoned the project, though I did end up building an extremely hacky Pali stemmer for my own use in parsing the Vinaya Vibangha.
I think the first step here would be to use the DPD DB to craft a large test set of terms that should match (and ideally those that should not) and to hammer out the theoretical questions around how we want to handle combined words. Once we have some idea of what we want the behavior to be, we can start to try out algorithms against the test set and see how many tests we can get to pass.
I agree. Start with some good tests and then refine the algorithm.
As for compound words, perhaps there would be more than one step. First would be to produce the stream of tokens with the stemmer acting as a filter on the tokens as they’re produced.
Say we’re working with Elasticsearch. For a Pali language analyzer, you’d have a tokenizer:
Then you would take each token and apply a stemmer filter.
The stemmed tokens would then form the basis of the index.
When analyzing text we first need to produce tokens. Say we come across this text:
uddhaccakukkuccaṁ
Should we break this into two tokens and stem both? Not sure how that would work but it seems quite similar to tokenising German. For example:
Straßenwaschmaschine
A real example. My dad was in Europe and was wondering what the truck in front was. Suffice to say that a tokenizer will need to do more than look for white space.
Once we have tokens and have stemmed them, we can apply other filters such as lowercase and diacritic removal.
Have you used DPD before? Look up almost any word. If it has a root, it gives you the root. It also gives the other words that this root can form, including the prefixes, suffixes, various declensions, conjugations etc.
And for other usage, any generic AI out there is a very good Pāli translator already. I don’t see the benefit of this project.
The point is for searching SuttaCentral. Right now you have to search SC for the exact form of the Pāḷi as it is in our version. Any slight deviation and search can’t find the text you’re looking for.
The point is not to build an automated translator.
I noted this a few years ago. That’s why I became reliant on DPD for the roots and compound info. Plus Bhante Sujato tries to include this info in some of his notations – a laborious effort tantamount to a walk for peace
And not everyone has the resources or ability to obtain and use Margaret Cone’s three volumes!
@NgXinZhao - DPR’s search is, iirc, just a straight-up Regex search, no? I think we are targeting something a little more “user friendly” for SC since our users tend to be less tech-savvy. Plus there’s only so much you can do with a regex to match forms before you start catching other terms in your net.
As far as compound words are concerned, they can certainly be taken care of by another token filter. It could be done algorithmically or with a dictionary. For example, if we come across one token, we produce two:
basketball => basket ball
This might not be necessary, but as far as the stemmer is concerned we can assume this happens elsewhere.
I am not sure if this person had solved this issue before already. Can also check him out.
Not sure why people flagged my last post as off topic when I am giving suggestions to check out possible existing work first and not waste time and effort replicating if others already done the work.
BTW, scVoice/iOS uses lemmatized search with on-device SQLiite for supported authors in contemporary languages. It works great and doesn’t require ElasticSearch, etc. The contemporary translations from SC have all been lemmatized in a lemmas column for each segment. Very fast. Offline on device. Works better than sc-voice.net. We haven’t yet got to Pali, but I just downloaded Pali Practice, so possibilities blossom here.
I had to find out what snowball is and laughed when i learned it is a new language paying homage to SNOBOL.
MS-DPD has CLI as well so i think your team could glue something together for server and browser use. For lemma search, scVoice/ios has the SCPali in a SQLite segments table That table has a lemma column and a text column The lemma column is a lemmatized version of the text column. In use, just lemmatize a pali query and use normal sql with like %lemmaquery%. very simple very fast. I use Claude so cannot help you directly, but msdpd is ai free.