Statistical analysis of Early Buddhist Texts

Yes, I believe this was all discussed in a previous thread.


Yes @Snowbird and @stephen people did ask the “what do you hope do gain” question in the other thread. I am not sure that really constitutes “discussion”.

I want to know how long the 4 Nikayas are relative to each other and how large the vocabulary is in each work.

I am hopeful that as I move forward with learning Linux text processing I can learn more numerical and statistical facts about the Nikayas.

I am not sure what you mean by “useful”. Useful for what? It’s of interest to me for its own sake, I like knowing numerical things, it helps to frame things, it also provides context to arguments about frequency because I will have denominators for my numerators.

How many words are there?
How many of them are different?
Do some occur more frequently in some places?
Do some specific parts of words differ in different places?

What’s wrong with asking these questions?

People seemed fine with @Vimala asking numerical questions, is it that monastics can like numbers but the “laity” are not permitted?

I know I am starting about as simple as you can get but you have to start somewhere and simple is where I like to start.

For one thing when I write my “big book of British Buddhism” I will be able to say that the 4 primary Nikayas have blah words and a vocabulary of blah” instead of what I see in dozens of books and articles in the literature that say things like “millions” and “vast” and “volumes” and “enough to fill several shelves” all of which, when I have seen them, and I have seen them all, have irritated me, and caused me to ask the question in my head; “well, how many words exactly!?” and yet I have never seen this answered in print.

I get that there are complications like compound words and … pe… and all that, but why give up on even beginning to ask the questions just because you can envision problems?

I am having fun, that is “useful” to me.

I thought I’d try dividing unique words by total words to arrive at … what is the proportion of, not sure, maybe “linguistic freshness”?

  • DN = .14
  • MN = .11
  • SN = .09
  • AN = .09
  • total = .104

I don’t know if this really measuring anything, but the results seem to match with expectations. DN is the most linguistically innovative, with the greatest proportion of unique words. Then MN, then the two collections of shorter suttas. That seems about right, as the longer the suttas are the more room there is for linguistic playfulness. The total is, as expected, an average.

Based on this I’d predict that the 6 verse collections in KN would show a higher degree of “freshness”, but that this will vary between collections. The Atthakavagga poems share a lot of vocabulary, so maybe that would be down?

Another metric to look at would be how much vocabulary is shared between collections.


yes @sujato ! this is the sort of thing I love! ratios!! I am actually a little surprised DN beats out MN, which especially towards the back I always found a bit… ragged? BUt this is the sort of thing I am fascinated by, it provides a great counter to my sense that DN preserves the older material compared to SN, which is exciting, I think now I will try a more “fine grained” approach, and do “freshness” ratios for each of the DN suttas, and a “freshness ratio” for SN split between sagathavagga and the prose part. might think about a way to dice up MN too.

I know this in no way rises to the standard of “statistical linguistics” or anything, but there might be suggestive ratios to be found nonetheless.

I have recently forked over a ridiculous amount of money for Paul Kingsbury’s PhD “the case of the aorists: the chronology of the Pali canon” in pdf format and am reading it with interest, I have to say that my first impression is that the arguments put forward are not particularly strong.

The other paper I am hoping to get my hands on is Dan Zigmond’s Toward a Computational Analysis of the Pali Canon at jocbs but they haven’t made it accessible yet and I am not sure I want to fork over another fistful of cash for membership.

Apart from that (and @Vimala 's work) I can’t really find much in terms of concrete publicly available work in this area (statistical and numerical arguments about the NIkayas).

If anyone has any suggestions about academic literature or even enthusiastic amateurs like myself feel free to post here or pm me!

How does linguistic “freshness” prove that the text is earlier? Being more innovative could also mean it’s later, couldn’t it? :thinking:

Also, as Bhante @sujato already mentioned, it’s in the nature of longer texts to be more elaborated. The longer texts also have more narrative sections, for which I would expect a more diverse vocabulary than for doctrinal texts.

I’d guess if you focus your research on doctrinal passages, the differences across Nikayas would perhaps tend to level out, except for verse passages.

1 Like

thats what I mean @sabbamitta “counter” as in counter-argument :slight_smile: I am one of those wierdos who gets excited when there is an actual argument given, even against my own positions, because it stimulates my thinking!


1 Like

Oh I understood “counter”, from “counting”, “to count” … 1, 2, 3, 4, …


1 Like

IMO, what is needed is theory that connects the age of the text to corpus-wide (population) statistics like word proportions.

I would argue that it is not obvious what the difference in proportion between, e.g., DN and SN, mean.

Or in other words, how large are the effect sizes here? The historical processes that lead to differences in unique words in different collections of EBTs – what differences should we expect them to produce? Are these differences large or small compared to other corpuses (corporae? corpi?).

It’s actually a surprisingly difficult task to explain why some causal process should result in some difference in average, median, correlation, proportion etc. Even more difficult to say something about what magnitude we should expect the differences to be (e.g. to falsify our hypothesis).

Bhante @sujato, I know critical text scholars have formulated principles, like about how texts tend to be standardized over time etc., but have these ideas been translated into statistical models anywhere? (:nerd_face: )

Edit: Another approach could be to attempt to classify words or themes in the EBTs as earlier and later, and look for systematic differences between the baskets.

I think with the Nikayas it is fairly certain that there has been a lot of editorial copy-pasting of formulas from one place to another, for example the insertion of the aggregates into DN14, but it is of course rarely so obvious as in that case, so that muddies the waters for a start. Then there is the fact that some suttas with very early or archaic features are also suttas that where “open” to additions for a long time, so for example the sekkha patipada that we see in the silavagga of DN seems “early” but it is only the “nucleus” around which those suttas have added material, some of which might be quite late. Similarly with the parayanavagga for example, the core of the poem is obviously “early” the framing introduction obviously late.

Another issue is that the prose is very different in its focus to the poetry and its quite hard to compare apples with oranges as it where.

Again narratives would tend to have more descriptive words than purely doctrinal exposition.

Finally as you say, doctrinal standardisation would reduce the unique vocabulary rather than increase it, so while more archaic material might have been less wordy so too would later standardised material.

As I have said above, for me, numbers are fun! And I think having a more quantitative sense of things can be helpful to the intuition, for example it may be more interesting to to a unique word count within Nikayas from sutta to sutta, in DN this MAY turn out to be a good proxy for chronology (of course it also may not).

All this is to say that as far as I can tell, apart form a few tentative and ambiguous things involving aorists, there is currently no established and rigorous quantitate science of the Nikayas. So we are still in the pre-scientific “fun with numbers” stage of looking around and seeing what if anything jumps out.

1 Like

We have to throw around a bunch of useless questions before we stumble on to anything meaningful.

If you have the chance, a summary would be interesting.

You can usually email the author for these things. It’s a hassle not having institutional membership, but academic scholars are usually thrilled to find that someone—anyone!—is actually interested in their work.

But more than, we need a robust means of excluding historical development. If our default hypothesis is that “difference implies evolution” then we’ll just end up reading it in to everything. There are plenty of differences that have nothing to do with time. The poetry of Vangisa, just to pick an example off the top of my head, is full of sophisticated poetic techniques that could be regarded as “late”. But we know he was a poet: it’s just an expression of who he was as a person. Otherwise it’s topic, or audience, or geography, or whatever. So we need to do more than just pick up patterns, we need to test them against falsifying hypotheses. This is, of course, why the method of using multiple, independent criteria is so important.

It could easily be explained as a different style for a different purpose for a different audience; namely, converting brahmins (who were used to sophisticated literary tracts).

Or else, yes, I have no idea how this works out statistically. I guess we’d have to frame a set of hypotheses and test them. How well, for example, do they match unrelated information, such as geography, etc.?

There’s a huge problem currently in climate science, where the latest models (CMIP 6) are vastly more sophisticated than previous ones, yet they map less well on to historical data. No-one knows why. So sometimes it isn’t the case that more data and better modelling leads to better outcomes.

I’m not sure. But in the case of Buddhism, no. I think we’re still in the phase of poking the data and going, “ooh”.

This, and all the differences you point to, are relevant data points. What we should do, I believe, is enrich the text data with a detailed markup to identify these different styles. We already have this to some extent, for example, verses are marked. And in the Vinaya, there is quite extensive semantic markup.

Once this has been done, we could isolate, say:

  • narrative
  • doctrinal formulas
  • verse
  • analytical exposition (vyakarana)
  • conversation
  • other

The markup can’t be over-detailed, lest you lose any statistical significance.

Then all these kinds of texts can be run orthogonally. Compare DN with MN. Then compare narrative in DN and MN with doctrinal passages in DN and MN. That would allow a far greater degree of precision.

It’s easy to do this in SC with our bilara system, someone just needs to get their head down and do the work.


Regarding the statistical analysis of EBTs (such as the four principal Nikayas/Agamas), I think it may be good to look at these three aspects: speaker (who speaks), message (what is spoken), and audience (to whom it is spoken). One identifies and discusses the differences and similarities, including the numbers (representing a particular quantity or amount), on the three aspects between the EBTs.

1 Like

thanks @trusolo !! I assume you have permission to share this with me? I know that the moderators are pretty strict when it comes to copyright material :slight_smile:

1 Like

oops! I didn’t check. I had it through my academic institution. I can delete it and send you via PM or DM or whatever it is called.

1 Like

No need @trusolo I made very sure to copy it and save it to my buddhist library and back it up to a second disk before I replied to you :slight_smile: and I am reading it right now, and the first thing to say about it is that it opens with word counts and unique word counts of the Nikayas!! (he got different numbers to my clumsy attempts but it’s great to see someone in print putting their numerical money where their
mouth is.)

1 Like

For future reference, if you want any article in somewhat obscure journals or behind a paywall, do let me know. I have access to journals I didn’t even know existed! :grin:

1 Like

I know I’m coming quite late to the party but if anybody is interested in doing statistical analysis on all the Pali buddhist texts (SC + VRI) then there is a repo here:

If you want to do it on the Chinese, Sanskrit or Tibetan files, there are other repos on BuddhaNexus for that. Note however that the Sanskrit files are not complete while the collections in Pali, Chinese and Tibetan are complete collections.


It isn’t, but two things jump out as candidates to me:

  1. The amount of repetition inside individual suttas (the same phrase being repeated twice or thrice, etc.).
  2. The number of nearly identical variant suttas found in SN and AN.

DN and MN don’t have (many or any) variant suttas in them, so that may be the larger issue that affects the ratios. Overall, the shift from repetitive oral tradition styles to more literary narrative styles in DN might be part of the difference, too.

I know from experience that these factors would be much bigger if we did the same thing with Chinese Agamas because they have less abbreviation overall. [<- On second thought, I not sure if that’s true. I spend alot of time reading Sujato’s translations, which skip large swaths of Pali, whereas I translate the Chinese literally without abbreviating it. So, my experience I think creates this impression that’s maybe even the reverse of reality. Pali suttas in DN and MN are more developed than Chinese parallels when looked at more closely.] But, then, actually counting unique “words” would be a huge undertaking since there are no word boundaries beyond punctuation, which is somewhat arbitrary …

But that also begs the question as to what to do about the abbreviations in present day Buddhist texts. If they were expanded, the ratios would be quite different I imagine since the removed material is pretty large sometimes. This is largely why I’ve abandoned alot of this mechanical analysis, myself. Chinese texts make it really difficult, anyway, and the results tell us what we already know in a general way.


I think there’s still much to do and potentially much to be gained. My impression is that the state of our knowledge is very much in its infancy. For example the Bilara files are littered with stops, commas, various qoutatuon marks, colons and semi colons, all of which are applied in an ad hoc and inconsistent way and all of which break string searches in digital Pali reader unless you make extensive use of regular expressions and wildcards. Even pe is done in about six different ways (as pe, as pe…, as …pe…, etc)

Just having a clean file for each of the Nikayas would be a good thing for machine stuff, with no punctuation marks, no capitals, only single spaces, etc etc

Down the track it should be possible to do things like determine the number of case endings in a given sutta/vagga/nikaya - at least in non compound words… it should probably be feasible to identify where words are compound in many cases where they occur as repetitions with non compounded occurances elsewhere in the canon by wildcard searching for string matches with and without spaces.

Obviously all these things have thier limitations and caveats, but I think there’s plenty of opportunity at least with the Pali.


You can very easily filter these things out with regex inside your code.

Indeed, we are currently also working on a Pali stemmer. We have a Sanskrit stemmer here: Buddhanexus and SuttaCentral has a Pali stemmer somewhere in the depths of the code … :grinning: I think the stemmer is used within Elasticseach so you would need to look into the SC backend code in python. It also uses a stemmer at the frontend (I suspect) for breaking down compounds. In any pali text, turn on the Pali->English lookup tool and click on a long word to see it work.
BuddhaNexus takes the compounds into account when comparing texts to find parallels.

This file might be of help. It has both the stemmer and replacement of all typography: suttacentral/client/elements/lookups/sc-lookup-pli.js at master · suttacentral/suttacentral · GitHub


Another Pali stemmer/inflection splitter being actively worked on presently (many commits from a month or so ago) is at: GitHub - digitalpalidictionary/inflection-generator: generate all inflections from scratch .