Statistical analysis of Early Buddhist Texts

sujato · July 14, 2020, 11:58pm

Well, for a start, Sublime is cross-platform, so I can actually use it at all, so there’s that!

It’s apple and oranges, but for what it’s worth, I can trivially search for a segment ID and get the results of all the associated information.

Or search for Pali segments that have the phrases “ca” and “nigaṇṭhe”, separated by any number of characters.

Or anything else that a regex can do.

But it has limitations. Searching for multiple words in different segments is not trivial.

But it’s not so much about the specific tools, it’s about the clarity of the underlying data. Our JSON files allow you to convert and process the content of the suttas easily and reliably in ways that are just not possible in an XML-based system like CSCD. Here, try it out. I just ran bilara i/o on DN, and generated a CSV file.

dn.zip (816.3 KB)

Unzip it and open in a spreadsheet, or import it to a data-table, or open in a text editor, or run it through a NL program, or play with it in airtable, or whatever. The point is, the data is not bound to the application. I’m hoping that, over time, lots of people will take it and use it in fun ways!

Gabriel · July 15, 2020, 10:16am

I think I get a vague idea of the freedom of this approach. Still, I think I speak for many people who are just mildly tech-savvy and see that it would take time to get into such an application-free data processing approach. Plus, I know the effect that even if I’d learn it once I would probably forget it again if I didn’t use it for a while. Like I was pretty good with SPSS back 1997, and in 2003 I had to learn it all over again. Thanks for the examples!

sujato · July 15, 2020, 10:52pm

Sure. But there are lots of programmers in the world, now they have a new toy to play with!

josephzizys · October 5, 2022, 1:21am

I am quite interested in this topic, discussed a bit here and in a few scattered threads like

and

recently there was

Inspired by this I asked:

and got no on topic replies, so I thought I would bump this thread, I am going to try and use some of the info gleaned from some of @sujato 's posts in this thread to use

to work out a couple of things that I would have thought would be of quite general interest but which I can’t find any definitive information about, namely;

how many letters long are each of the Nikayas? this would give some frame to questions of how often a given phrase appears in say, MN vs AN. Is AN much longer than SN? if so how much longer? I was surprised that this is not more easily available as info.
how many different words are in each of the Nikayas? i.e what is the vocabulary of each Nikaya/Vagga/whatever, this is of similar interest it seems to me to @Vimala 's word length research.

About 20 odd years ago I was a web developer, hacking together things in asp and coldfusion, working with sql etc, but that was a long time ago, more recently I got it in my head to try and host a local copy of suttacentral and see if I could implement the new search engine idea, well, just getting docker and the site to actually run as is took so much out of me that I have avoided anything technical ever since, but I have licked my wounds enough now and will try to get back to this thread with answers to my 2 questions and a description of how I did it.

Maybe if other people have pet projects, examples or other threads of interest they could post here and we could get some mutual support going?

Metta

josephzizys · October 5, 2022, 8:55am

ok for question one I forked bilara-data and downloaded a zip, extracted the files and navigated to root/pli/ms/sutta then I ran

cat * | wc -w

but sn and an where in subfolders so for those i first ran

find . -mindepth 2 -type f -print -exec mv {} . \;

my results where

DN: 173,906
MN: 294,643
SN: 357,384
AN: 388,889

Hooray!

josephzizys · October 5, 2022, 9:06am

for question 2 I did the same thing but with

cat * | sort | uniq -c | wc -l

and got

DN: 16403
MN: 27197
SN: 43464
AN: 41843

Now I have no idea if this is in any way accurate, for starters I haven’t removed any of the json segment numbers so they will all come up as unique i guess…

Snowbird · October 5, 2022, 9:26am

Keep in mind that the concept of letters and words is not easy to map between Pali and English. Perhaps syllables would be more accurate.

For example “dh” is one letter. Really, “dha” is one letter. I’m not sure what you are trying to prove by counting letter, though.

BTW, you can edit your post to include more information rather than replying to yourself multiple times.

josephzizys · October 5, 2022, 9:37am

Oh, sorry @Snowbird I am counting words not letters, that is strings separated by spaces in the bilara jsons

josephzizys · October 5, 2022, 9:44am

So I concatanated all the files in DN, MN, SN and AN respectively to have one big file each and then removed the json segment numbers with

sed 's/[0-9]*//g' mn.json > mnscrubbed.json

then I did fresh wordcounts and unique counts on each file and got:

TOTAL WORDS:
DN: 173872
MN: 294491
SN: 357348
AN: 388781

UNIQUE WORDS:
DN: 24009
MN: 31455
SN: 34115
AN: 37176

Snowbird · October 5, 2022, 10:37am

Even words is a bit dodgy. With so many compound words and the synthetic nature of Pali, I wonder what useful information you hope to glean from word counts.

stephen · October 5, 2022, 12:22pm

Yes, I believe this was all discussed in a previous thread.

josephzizys · October 5, 2022, 9:44pm

Yes @Snowbird and @stephen people did ask the “what do you hope do gain” question in the other thread. I am not sure that really constitutes “discussion”.

I want to know how long the 4 Nikayas are relative to each other and how large the vocabulary is in each work.

I am hopeful that as I move forward with learning Linux text processing I can learn more numerical and statistical facts about the Nikayas.

I am not sure what you mean by “useful”. Useful for what? It’s of interest to me for its own sake, I like knowing numerical things, it helps to frame things, it also provides context to arguments about frequency because I will have denominators for my numerators.

How many words are there?
How many of them are different?
Do some occur more frequently in some places?
Do some specific parts of words differ in different places?
Etc
Etc

What’s wrong with asking these questions?

People seemed fine with @Vimala asking numerical questions, is it that monastics can like numbers but the “laity” are not permitted?

I know I am starting about as simple as you can get but you have to start somewhere and simple is where I like to start.

For one thing when I write my “big book of British Buddhism” I will be able to say that the 4 primary Nikayas have blah words and a vocabulary of blah” instead of what I see in dozens of books and articles in the literature that say things like “millions” and “vast” and “volumes” and “enough to fill several shelves” all of which, when I have seen them, and I have seen them all, have irritated me, and caused me to ask the question in my head; “well, how many words exactly!?” and yet I have never seen this answered in print.

I get that there are complications like compound words and … pe… and all that, but why give up on even beginning to ask the questions just because you can envision problems?

I am having fun, that is “useful” to me.

sujato · October 6, 2022, 9:06pm

I thought I’d try dividing unique words by total words to arrive at … what is the proportion of, not sure, maybe “linguistic freshness”?

DN = .14
MN = .11
SN = .09
AN = .09
total = .104

I don’t know if this really measuring anything, but the results seem to match with expectations. DN is the most linguistically innovative, with the greatest proportion of unique words. Then MN, then the two collections of shorter suttas. That seems about right, as the longer the suttas are the more room there is for linguistic playfulness. The total is, as expected, an average.

Based on this I’d predict that the 6 verse collections in KN would show a higher degree of “freshness”, but that this will vary between collections. The Atthakavagga poems share a lot of vocabulary, so maybe that would be down?

Another metric to look at would be how much vocabulary is shared between collections.

josephzizys · October 6, 2022, 9:27pm

yes @sujato ! this is the sort of thing I love! ratios!! I am actually a little surprised DN beats out MN, which especially towards the back I always found a bit… ragged? BUt this is the sort of thing I am fascinated by, it provides a great counter to my sense that DN preserves the older material compared to SN, which is exciting, I think now I will try a more “fine grained” approach, and do “freshness” ratios for each of the DN suttas, and a “freshness ratio” for SN split between sagathavagga and the prose part. might think about a way to dice up MN too.

I know this in no way rises to the standard of “statistical linguistics” or anything, but there might be suggestive ratios to be found nonetheless.

I have recently forked over a ridiculous amount of money for Paul Kingsbury’s PhD “the case of the aorists: the chronology of the Pali canon” in pdf format and am reading it with interest, I have to say that my first impression is that the arguments put forward are not particularly strong.

The other paper I am hoping to get my hands on is Dan Zigmond’s Toward a Computational Analysis of the Pali Canon at jocbs but they haven’t made it accessible yet and I am not sure I want to fork over another fistful of cash for membership.

Apart from that (and @Vimala 's work) I can’t really find much in terms of concrete publicly available work in this area (statistical and numerical arguments about the NIkayas).

If anyone has any suggestions about academic literature or even enthusiastic amateurs like myself feel free to post here or pm me!

sabbamitta · October 7, 2022, 8:11am

How does linguistic “freshness” prove that the text is earlier? Being more innovative could also mean it’s later, couldn’t it?

Also, as Bhante @sujato already mentioned, it’s in the nature of longer texts to be more elaborated. The longer texts also have more narrative sections, for which I would expect a more diverse vocabulary than for doctrinal texts.

I’d guess if you focus your research on doctrinal passages, the differences across Nikayas would perhaps tend to level out, except for verse passages.

josephzizys · October 7, 2022, 9:04am

thats what I mean @sabbamitta “counter” as in counter-argument I am one of those wierdos who gets excited when there is an actual argument given, even against my own positions, because it stimulates my thinking!

Metta.

sabbamitta · October 7, 2022, 10:19am

Oh I understood “counter”, from “counting”, “to count” … 1, 2, 3, 4, …

Erika_ODonnell · October 7, 2022, 11:45am

IMO, what is needed is theory that connects the age of the text to corpus-wide (population) statistics like word proportions.

I would argue that it is not obvious what the difference in proportion between, e.g., DN and SN, mean.

Or in other words, how large are the effect sizes here? The historical processes that lead to differences in unique words in different collections of EBTs – what differences should we expect them to produce? Are these differences large or small compared to other corpuses (corporae? corpi?).

It’s actually a surprisingly difficult task to explain why some causal process should result in some difference in average, median, correlation, proportion etc. Even more difficult to say something about what magnitude we should expect the differences to be (e.g. to falsify our hypothesis).

Bhante @sujato, I know critical text scholars have formulated principles, like about how texts tend to be standardized over time etc., but have these ideas been translated into statistical models anywhere? ( )

Edit: Another approach could be to attempt to classify words or themes in the EBTs as earlier and later, and look for systematic differences between the baskets.

josephzizys · October 7, 2022, 9:22pm

I think with the Nikayas it is fairly certain that there has been a lot of editorial copy-pasting of formulas from one place to another, for example the insertion of the aggregates into DN14, but it is of course rarely so obvious as in that case, so that muddies the waters for a start. Then there is the fact that some suttas with very early or archaic features are also suttas that where “open” to additions for a long time, so for example the sekkha patipada that we see in the silavagga of DN seems “early” but it is only the “nucleus” around which those suttas have added material, some of which might be quite late. Similarly with the parayanavagga for example, the core of the poem is obviously “early” the framing introduction obviously late.

Another issue is that the prose is very different in its focus to the poetry and its quite hard to compare apples with oranges as it where.

Again narratives would tend to have more descriptive words than purely doctrinal exposition.

Finally as you say, doctrinal standardisation would reduce the unique vocabulary rather than increase it, so while more archaic material might have been less wordy so too would later standardised material.

As I have said above, for me, numbers are fun! And I think having a more quantitative sense of things can be helpful to the intuition, for example it may be more interesting to to a unique word count within Nikayas from sutta to sutta, in DN this MAY turn out to be a good proxy for chronology (of course it also may not).

All this is to say that as far as I can tell, apart form a few tentative and ambiguous things involving aorists, there is currently no established and rigorous quantitate science of the Nikayas. So we are still in the pre-scientific “fun with numbers” stage of looking around and seeing what if anything jumps out.

sujato · October 8, 2022, 12:22am

We have to throw around a bunch of useless questions before we stumble on to anything meaningful.

If you have the chance, a summary would be interesting.

You can usually email the author for these things. It’s a hassle not having institutional membership, but academic scholars are usually thrilled to find that someone—anyone!—is actually interested in their work.

But more than, we need a robust means of excluding historical development. If our default hypothesis is that “difference implies evolution” then we’ll just end up reading it in to everything. There are plenty of differences that have nothing to do with time. The poetry of Vangisa, just to pick an example off the top of my head, is full of sophisticated poetic techniques that could be regarded as “late”. But we know he was a poet: it’s just an expression of who he was as a person. Otherwise it’s topic, or audience, or geography, or whatever. So we need to do more than just pick up patterns, we need to test them against falsifying hypotheses. This is, of course, why the method of using multiple, independent criteria is so important.

It could easily be explained as a different style for a different purpose for a different audience; namely, converting brahmins (who were used to sophisticated literary tracts).

Or else, yes, I have no idea how this works out statistically. I guess we’d have to frame a set of hypotheses and test them. How well, for example, do they match unrelated information, such as geography, etc.?

There’s a huge problem currently in climate science, where the latest models (CMIP 6) are vastly more sophisticated than previous ones, yet they map less well on to historical data. No-one knows why. So sometimes it isn’t the case that more data and better modelling leads to better outcomes.

I’m not sure. But in the case of Buddhism, no. I think we’re still in the phase of poking the data and going, “ooh”.

This, and all the differences you point to, are relevant data points. What we should do, I believe, is enrich the text data with a detailed markup to identify these different styles. We already have this to some extent, for example, verses are marked. And in the Vinaya, there is quite extensive semantic markup.

Once this has been done, we could isolate, say:

narrative
doctrinal formulas
verse
analytical exposition (vyakarana)
conversation
other

The markup can’t be over-detailed, lest you lose any statistical significance.

Then all these kinds of texts can be run orthogonally. Compare DN with MN. Then compare narrative in DN and MN with doctrinal passages in DN and MN. That would allow a far greater degree of precision.

It’s easy to do this in SC with our bilara system, someone just needs to get their head down and do the work.