Building a Pali language analyzer

Hey folks!

There seems to have been a lot of discussion of Pali stemmers over the years. If someone is interested in helping with the algorithm, I’d be interested in converting it to Snowball.

Take this example for the English language:

Having a Snowball stemmer for Pali would give us the option of indexing our texts in Arangosearch, ElasticSearch, Tantivy and so forth.

I’m happy to tackle the technical aspects if someone has an approach for the grammar.

Having a look here gives us a starting point:

They require:

This must include an algorithmic description of the stemmer, an implementation in Snowball, and a representative language vocabulary of about 30,000 words that can be used as part of a standard test.

If there is some interest, I might share it with the Snowball mailing list.

Cheers,

Ajahn J.R.

2 Likes

In addition to being used in arangosearch, this is how you’d integrate it with Elastic:

1 Like

I put up a very very first draft for one up here:

I asked Bhante Bodhirasa for his help on the grammar part and got back a dismissive response that the DPD DB has “solved” Pāḷi term matching and that it was (iirc) a bad idea to build a stemmer at all because then someone might be tempted to use it!

That and various theoretical questions about the handling of compounds and sandhi that Bhante brought up bogged me down, so I abandoned the project, though I did end up building an extremely hacky Pali stemmer for my own use in parsing the Vinaya Vibangha.

4 Likes

I think the first step here would be to use the DPD DB to craft a large test set of terms that should match (and ideally those that should not) and to hammer out the theoretical questions around how we want to handle combined words. Once we have some idea of what we want the behavior to be, we can start to try out algorithms against the test set and see how many tests we can get to pass.

2 Likes

I agree. Start with some good tests and then refine the algorithm.

As for compound words, perhaps there would be more than one step. First would be to produce the stream of tokens with the stemmer acting as a filter on the tokens as they’re produced.

Say we’re working with Elasticsearch. For a Pali language analyzer, you’d have a tokenizer:

Then you would take each token and apply a stemmer filter.

The stemmed tokens would then form the basis of the index.

Dunno. We’ll see.

1 Like

Yeah I think that’s right. Not really the stemmer’s job. :+1:

1 Like

We can broaden our scope a bit. I’ve updated the topic heading.

A stemmer is one of the building blocks for a language analyzer. You can read about Arangosearch analyzers here:

https://docs.arango.ai/arangodb/3.11/indexes-and-search/analyzers/

When analyzing text we first need to produce tokens. Say we come across this text:

uddhaccakukkuccaṁ

Should we break this into two tokens and stem both? Not sure how that would work but it seems quite similar to tokenising German. For example:

Straßenwaschmaschine

A real example. My dad was in Europe and was wondering what the truck in front was. Suffice to say that a tokenizer will need to do more than look for white space.

Once we have tokens and have stemmed them, we can apply other filters such as lowercase and diacritic removal.

This can all be tweaked and tuned.

EDIT: Oh, and we might want a list of stopwords.

1 Like

Apparently the term is “decompounding”:

A dictionary decompounder is also an option:

2 Likes

Have you used DPD before? Look up almost any word. If it has a root, it gives you the root. It also gives the other words that this root can form, including the prefixes, suffixes, various declensions, conjugations etc.

And for other usage, any generic AI out there is a very good Pāli translator already. I don’t see the benefit of this project.

Also, DPD already can break down long compounds.

1 Like

The point is for searching SuttaCentral. Right now you have to search SC for the exact form of the Pāḷi as it is in our version. Any slight deviation and search can’t find the text you’re looking for.

The point is not to build an automated translator.

4 Likes

Correct. The Pali analyzer will be used in the context of indexing and querying the search engine.

2 Likes

Thank you @Jhanarato and @Khemarato.bhikkhu for taking up this project :folded_hands:.

I noted this a few years ago. That’s why I became reliant on DPD for the roots and compound info. Plus Bhante Sujato tries to include this info in some of his notations – a laborious effort tantamount to a walk for peace :peace_symbol:

And not everyone has the resources or ability to obtain and use Margaret Cone’s three volumes!

Thank you again :beating_heart:

DPR is able to get very good search engine, why not just ask the creators for how they did it?

Of all the search engines for sutta, that’s the best. I have wondered why other sites not just copy it.

@NgXinZhao - DPR’s search is, iirc, just a straight-up Regex search, no? I think we are targeting something a little more “user friendly” for SC since our users tend to be less tech-savvy. Plus there’s only so much you can do with a regex to match forms before you start catching other terms in your net.

2 Likes

As far as compound words are concerned, they can certainly be taken care of by another token filter. It could be done algorithmically or with a dictionary. For example, if we come across one token, we produce two:

basketball => basket ball

This might not be necessary, but as far as the stemmer is concerned we can assume this happens elsewhere.

2 Likes

I am not sure if this person had solved this issue before already. Can also check him out.

Not sure why people flagged my last post as off topic when I am giving suggestions to check out possible existing work first and not waste time and effort replicating if others already done the work.

For scVoice/iOS I was just going to use the DPD. We have GitHub - sc-voice/ms-dpd: Pali Javascript library library that we use for sc-voice.net. It’s a compact DPD. And yes, the DPD knows about lemmas. (Many thanks to Ven. Bodhirasa. :folded_hands: )

BTW, scVoice/iOS uses lemmatized search with on-device SQLiite for supported authors in contemporary languages. It works great and doesn’t require ElasticSearch, etc. The contemporary translations from SC have all been lemmatized in a lemmas column for each segment. Very fast. Offline on device. Works better than sc-voice.net. We haven’t yet got to Pali, but I just downloaded Pali Practice, so possibilities blossom here. :seedling:

2 Likes

Thanks @karl_lew !

@sabbamitta has filled me in on your prior art.

I’m agnostic as to how we solve the problem. If a big old lookup table works, then good.

Anyone who can turn a 2GB SQLite database into a 30 line Snowball algorithm gets a tip of the hat from me.

3 Likes

I had to find out what snowball is and laughed when i learned it is a new language paying homage to SNOBOL. :joy:

MS-DPD has CLI as well so i think your team could glue something together for server and browser use. For lemma search, scVoice/ios has the SCPali in a SQLite segments table That table has a lemma column and a text column The lemma column is a lemmatized version of the text column. In use, just lemmatize a pali query and use normal sql with like %lemmaquery%. very simple very fast. I use Claude so cannot help you directly, but msdpd is ai free.

:folded_hands:

1 Like