Pali specific search tweaks

sujato · October 22, 2015, 8:17am

Using the search on SC, I regularly find that there are certain quirks in Pali that a clever search engine should know about, but doesn’t. As @blake evolves the search, it would be good to include these when possible. They may well be useful for the Pali lookup as well. Sometimes these matches might be best done only if regular matching doesn’t work.

I’ll update this list as I come across new cases.

Meanwhile, if anyone has any other cases, please put them in the comments. Note that this topic is specifically for what I have said here. If you have other suggestions for search, best open a new topic.

Initial vy- = by-. These commonly differ in different editions. Our dicts usually have vy- while the text has by-.
ṅ = ṁ = ṃ.
Final -ti. Obviously there’s a problem differentiating it from verbal endings. Still:
“long vowel-ti” —> “short vowel iti” or “long vowel iti”
Final -nti —> " iti" (i.e. drop sandhi)
ñeva and ññeva —> yeva
Often our text includes quote marks, which makes this trivial: bhiyyo’ti, nirodhetun”ti
Final -pi (= api) is similar to -ti, except less readily mistaken for a verbal ending. Examples:
cepi —> ce api
-mpi —> -ṃ api
Final -ce and -ca. These commonly create a sandhi ñ, eg. puthujjanānañcepi = puthujjanānaṃ ce api. So:
-ñce —> -ṃ ce
-ñca —> -ṃ ca
Final -va can stand for iva or eva, and it’s usually not possible to differentiate, so match both. The usual sandhi issues apply, eg mayañceva = mayaṃ ce eva:
-āva—> a iva or a eva or ā iva or ā eva.
Initial a: try stripping it and finding the positive term. Sometimes there’s a sandhi.
asekha —> a-sekha
aneka —> a-(n)-eka
Initial n: see if it’s an na-:
neva —> na eva (although as it happens this example is found in our dicts already)
nāsaññā —> na asaññā
ṃ before s is inconsistent in some words: mahisa vs. mahiṃsa.

blake · October 23, 2015, 12:02pm

These are both straightforward, the easy way is to just absolutely always make the substitution

To be clear, do you want to be able to search for “iti” and “yeva”, or just strip these forms off the back? The quote mark case is easy, but I’ve updated it to turn n“ into ṃ

At the moment the processing is done in Elasticsearch using regular expressions. It’s very crude, but it’s easy. In order to do fuzzy matching it simply slices anything off the end which looks remotely like a conjugation or declension suffix.

By the way, you may not be aware of this since it’s not particularly documented, but you can use wildcards, for example you could try searching for bhavis*, a wildcard query works as you expect.
You can also use wildcards at the start of a word or even inside a word, like bhavis*ma (in this case, you can get a fuzzy match on a word like bhavissāmi since the fuzzy matcher still slices’n’dices the end of the term).

Because of the less than surgical precision of the fuzzy matcher you might often get better results from wildcards, and it’s also a good way of searching in compounds.

sujato · October 23, 2015, 11:23pm

I’m not quite sure what you’re getting at here. I wasn’t suggesting any change to the way iti or yeva was handled, but that final -nti and (ñ)ñeva be substituted for search purposes with -n (or ṃ as you suggest) iti and yeva respectively. Although, in fact, search for these terms is about as useful as searching for “and” and “is” in English, so maybe just stripping them would be e better idea.

I do use the wildcards occasionally.

I’ll keep adding cases to this list as I go along.