Statistical analysis of Early Buddhist Texts

thomaslaw · October 8, 2022, 1:45am

Regarding the statistical analysis of EBTs (such as the four principal Nikayas/Agamas), I think it may be good to look at these three aspects: speaker (who speaks), message (what is spoken), and audience (to whom it is spoken). One identifies and discusses the differences and similarities, including the numbers (representing a particular quantity or amount), on the three aspects between the EBTs.

josephzizys · October 8, 2022, 2:30am

thanks @trusolo !! I assume you have permission to share this with me? I know that the moderators are pretty strict when it comes to copyright material

trusolo · October 8, 2022, 2:32am

oops! I didn’t check. I had it through my academic institution. I can delete it and send you via PM or DM or whatever it is called.

josephzizys · October 8, 2022, 2:35am

No need @trusolo I made very sure to copy it and save it to my buddhist library and back it up to a second disk before I replied to you and I am reading it right now, and the first thing to say about it is that it opens with word counts and unique word counts of the Nikayas!! (he got different numbers to my clumsy attempts but it’s great to see someone in print putting their numerical money where their
mouth is.)

trusolo · October 8, 2022, 2:41am

For future reference, if you want any article in somewhat obscure journals or behind a paywall, do let me know. I have access to journals I didn’t even know existed!

Vimala · October 12, 2022, 12:07pm

I know I’m coming quite late to the party but if anybody is interested in doing statistical analysis on all the Pali buddhist texts (SC + VRI) then there is a repo here:

If you want to do it on the Chinese, Sanskrit or Tibetan files, there are other repos on BuddhaNexus for that. Note however that the Sanskrit files are not complete while the collections in Pali, Chinese and Tibetan are complete collections.

cdpatton · October 12, 2022, 7:05pm

It isn’t, but two things jump out as candidates to me:

The amount of repetition inside individual suttas (the same phrase being repeated twice or thrice, etc.).
The number of nearly identical variant suttas found in SN and AN.

DN and MN don’t have (many or any) variant suttas in them, so that may be the larger issue that affects the ratios. Overall, the shift from repetitive oral tradition styles to more literary narrative styles in DN might be part of the difference, too.

I know from experience that these factors would be much bigger if we did the same thing with Chinese Agamas because they have less abbreviation overall. [<- On second thought, I not sure if that’s true. I spend alot of time reading Sujato’s translations, which skip large swaths of Pali, whereas I translate the Chinese literally without abbreviating it. So, my experience I think creates this impression that’s maybe even the reverse of reality. Pali suttas in DN and MN are more developed than Chinese parallels when looked at more closely.] But, then, actually counting unique “words” would be a huge undertaking since there are no word boundaries beyond punctuation, which is somewhat arbitrary …

But that also begs the question as to what to do about the abbreviations in present day Buddhist texts. If they were expanded, the ratios would be quite different I imagine since the removed material is pretty large sometimes. This is largely why I’ve abandoned alot of this mechanical analysis, myself. Chinese texts make it really difficult, anyway, and the results tell us what we already know in a general way.

josephzizys · October 12, 2022, 9:21pm

I think there’s still much to do and potentially much to be gained. My impression is that the state of our knowledge is very much in its infancy. For example the Bilara files are littered with stops, commas, various qoutatuon marks, colons and semi colons, all of which are applied in an ad hoc and inconsistent way and all of which break string searches in digital Pali reader unless you make extensive use of regular expressions and wildcards. Even pe is done in about six different ways (as pe, as pe…, as …pe…, etc)

Just having a clean file for each of the Nikayas would be a good thing for machine stuff, with no punctuation marks, no capitals, only single spaces, etc etc

Down the track it should be possible to do things like determine the number of case endings in a given sutta/vagga/nikaya - at least in non compound words… it should probably be feasible to identify where words are compound in many cases where they occur as repetitions with non compounded occurances elsewhere in the canon by wildcard searching for string matches with and without spaces.

Obviously all these things have thier limitations and caveats, but I think there’s plenty of opportunity at least with the Pali.

Metta.

Vimala · October 14, 2022, 5:59am

You can very easily filter these things out with regex inside your code.

Indeed, we are currently also working on a Pali stemmer. We have a Sanskrit stemmer here: Buddhanexus and SuttaCentral has a Pali stemmer somewhere in the depths of the code … I think the stemmer is used within Elasticseach so you would need to look into the SC backend code in python. It also uses a stemmer at the frontend (I suspect) for breaking down compounds. In any pali text, turn on the Pali->English lookup tool and click on a long word to see it work.
BuddhaNexus takes the compounds into account when comparing texts to find parallels.

This file might be of help. It has both the stemmer and replacement of all typography: suttacentral/client/elements/lookups/sc-lookup-pli.js at master · suttacentral/suttacentral · GitHub

michaelh · February 1, 2023, 4:28am

Another Pali stemmer/inflection splitter being actively worked on presently (many commits from a month or so ago) is at: GitHub - digitalpalidictionary/inflection-generator: generate all inflections from scratch .