Statistical analysis of Early Buddhist Texts

Vimala · October 14, 2022, 5:59am

You can very easily filter these things out with regex inside your code.

Indeed, we are currently also working on a Pali stemmer. We have a Sanskrit stemmer here: Buddhanexus and SuttaCentral has a Pali stemmer somewhere in the depths of the code … I think the stemmer is used within Elasticseach so you would need to look into the SC backend code in python. It also uses a stemmer at the frontend (I suspect) for breaking down compounds. In any pali text, turn on the Pali->English lookup tool and click on a long word to see it work.
BuddhaNexus takes the compounds into account when comparing texts to find parallels.

This file might be of help. It has both the stemmer and replacement of all typography: suttacentral/client/elements/lookups/sc-lookup-pli.js at master · suttacentral/suttacentral · GitHub