Statistical analysis of Early Buddhist Texts

Sorry everyone and @Gillian . I was away. Thank you to @musiko to put my post into context. Very interesting replies. Thank you all !

2 Likes

Thanks. I used cst4 before. Very interesting tool.

1 Like

Well, for a start, Sublime is cross-platform, so I can actually use it at all, so there’s that!

It’s apple and oranges, but for what it’s worth, I can trivially search for a segment ID and get the results of all the associated information.

Or search for Pali segments that have the phrases “ca” and “nigaṇṭhe”, separated by any number of characters.

Or anything else that a regex can do.

But it has limitations. Searching for multiple words in different segments is not trivial.

But it’s not so much about the specific tools, it’s about the clarity of the underlying data. Our JSON files allow you to convert and process the content of the suttas easily and reliably in ways that are just not possible in an XML-based system like CSCD. Here, try it out. I just ran bilara i/o on DN, and generated a CSV file.

dn.zip (816.3 KB)

Unzip it and open in a spreadsheet, or import it to a data-table, or open in a text editor, or run it through a NL program, or play with it in airtable, or whatever. The point is, the data is not bound to the application. I’m hoping that, over time, lots of people will take it and use it in fun ways!

7 Likes

I think I get a vague idea of the freedom of this approach. Still, I think I speak for many people who are just mildly tech-savvy and see that it would take time to get into such an application-free data processing approach. Plus, I know the effect that even if I’d learn it once I would probably forget it again if I didn’t use it for a while. Like I was pretty good with SPSS back 1997, and in 2003 I had to learn it all over again. Thanks for the examples!

1 Like

Sure. But there are lots of programmers in the world, now they have a new toy to play with!

1 Like

I am quite interested in this topic, discussed a bit here and in a few scattered threads like

and

recently there was

Inspired by this I asked:

and got no on topic replies, so I thought I would bump this thread, I am going to try and use some of the info gleaned from some of @sujato 's posts in this thread to use

to work out a couple of things that I would have thought would be of quite general interest but which I can’t find any definitive information about, namely;

  1. how many letters long are each of the Nikayas? this would give some frame to questions of how often a given phrase appears in say, MN vs AN. Is AN much longer than SN? if so how much longer? I was surprised that this is not more easily available as info.

  2. how many different words are in each of the Nikayas? i.e what is the vocabulary of each Nikaya/Vagga/whatever, this is of similar interest it seems to me to @Vimala 's word length research.

About 20 odd years ago I was a web developer, hacking together things in asp and coldfusion, working with sql etc, but that was a long time ago, more recently I got it in my head to try and host a local copy of suttacentral and see if I could implement the new search engine idea, well, just getting docker and the site to actually run as is took so much out of me that I have avoided anything technical ever since, but I have licked my wounds enough now and will try to get back to this thread with answers to my 2 questions and a description of how I did it.

Maybe if other people have pet projects, examples or other threads of interest they could post here and we could get some mutual support going?

Metta

ok for question one I forked bilara-data and downloaded a zip, extracted the files and navigated to root/pli/ms/sutta then I ran

cat * | wc -w

but sn and an where in subfolders so for those i first ran

find . -mindepth 2 -type f -print -exec mv {} . \;

my results where

DN: 173,906
MN: 294,643
SN: 357,384
AN: 388,889

Hooray!

for question 2 I did the same thing but with

cat * | sort | uniq -c | wc -l

and got

DN: 16403
MN: 27197
SN: 43464
AN: 41843

Now I have no idea if this is in any way accurate, for starters I haven’t removed any of the json segment numbers so they will all come up as unique i guess…

Keep in mind that the concept of letters and words is not easy to map between Pali and English. Perhaps syllables would be more accurate.

For example “dh” is one letter. Really, “dha” is one letter. I’m not sure what you are trying to prove by counting letter, though.

BTW, you can edit your post to include more information rather than replying to yourself multiple times.

1 Like

Oh, sorry @Snowbird I am counting words not letters, that is strings separated by spaces in the bilara jsons

So I concatanated all the files in DN, MN, SN and AN respectively to have one big file each and then removed the json segment numbers with

sed 's/[0-9]*//g' mn.json > mnscrubbed.json

then I did fresh wordcounts and unique counts on each file and got:

TOTAL WORDS:
DN: 173872
MN: 294491
SN: 357348
AN: 388781

UNIQUE WORDS:
DN: 24009
MN: 31455
SN: 34115
AN: 37176

Even words is a bit dodgy. With so many compound words and the synthetic nature of Pali, I wonder what useful information you hope to glean from word counts.

5 Likes

Yes, I believe this was all discussed in a previous thread.

2 Likes

Yes @Snowbird and @stephen people did ask the “what do you hope do gain” question in the other thread. I am not sure that really constitutes “discussion”.

I want to know how long the 4 Nikayas are relative to each other and how large the vocabulary is in each work.

I am hopeful that as I move forward with learning Linux text processing I can learn more numerical and statistical facts about the Nikayas.

I am not sure what you mean by “useful”. Useful for what? It’s of interest to me for its own sake, I like knowing numerical things, it helps to frame things, it also provides context to arguments about frequency because I will have denominators for my numerators.

How many words are there?
How many of them are different?
Do some occur more frequently in some places?
Do some specific parts of words differ in different places?
Etc
Etc

What’s wrong with asking these questions?

People seemed fine with @Vimala asking numerical questions, is it that monastics can like numbers but the “laity” are not permitted?

I know I am starting about as simple as you can get but you have to start somewhere and simple is where I like to start.

For one thing when I write my “big book of British Buddhism” I will be able to say that the 4 primary Nikayas have blah words and a vocabulary of blah” instead of what I see in dozens of books and articles in the literature that say things like “millions” and “vast” and “volumes” and “enough to fill several shelves” all of which, when I have seen them, and I have seen them all, have irritated me, and caused me to ask the question in my head; “well, how many words exactly!?” and yet I have never seen this answered in print.

I get that there are complications like compound words and … pe… and all that, but why give up on even beginning to ask the questions just because you can envision problems?

I am having fun, that is “useful” to me.

I thought I’d try dividing unique words by total words to arrive at … what is the proportion of, not sure, maybe “linguistic freshness”?

  • DN = .14
  • MN = .11
  • SN = .09
  • AN = .09
  • total = .104

I don’t know if this really measuring anything, but the results seem to match with expectations. DN is the most linguistically innovative, with the greatest proportion of unique words. Then MN, then the two collections of shorter suttas. That seems about right, as the longer the suttas are the more room there is for linguistic playfulness. The total is, as expected, an average.

Based on this I’d predict that the 6 verse collections in KN would show a higher degree of “freshness”, but that this will vary between collections. The Atthakavagga poems share a lot of vocabulary, so maybe that would be down?

Another metric to look at would be how much vocabulary is shared between collections.

2 Likes

yes @sujato ! this is the sort of thing I love! ratios!! I am actually a little surprised DN beats out MN, which especially towards the back I always found a bit… ragged? BUt this is the sort of thing I am fascinated by, it provides a great counter to my sense that DN preserves the older material compared to SN, which is exciting, I think now I will try a more “fine grained” approach, and do “freshness” ratios for each of the DN suttas, and a “freshness ratio” for SN split between sagathavagga and the prose part. might think about a way to dice up MN too.

I know this in no way rises to the standard of “statistical linguistics” or anything, but there might be suggestive ratios to be found nonetheless.

I have recently forked over a ridiculous amount of money for Paul Kingsbury’s PhD “the case of the aorists: the chronology of the Pali canon” in pdf format and am reading it with interest, I have to say that my first impression is that the arguments put forward are not particularly strong.

The other paper I am hoping to get my hands on is Dan Zigmond’s Toward a Computational Analysis of the Pali Canon at jocbs but they haven’t made it accessible yet and I am not sure I want to fork over another fistful of cash for membership.

Apart from that (and @Vimala 's work) I can’t really find much in terms of concrete publicly available work in this area (statistical and numerical arguments about the NIkayas).

If anyone has any suggestions about academic literature or even enthusiastic amateurs like myself feel free to post here or pm me!

How does linguistic “freshness” prove that the text is earlier? Being more innovative could also mean it’s later, couldn’t it? :thinking:

Also, as Bhante @sujato already mentioned, it’s in the nature of longer texts to be more elaborated. The longer texts also have more narrative sections, for which I would expect a more diverse vocabulary than for doctrinal texts.

I’d guess if you focus your research on doctrinal passages, the differences across Nikayas would perhaps tend to level out, except for verse passages.

1 Like

thats what I mean @sabbamitta “counter” as in counter-argument :slight_smile: I am one of those wierdos who gets excited when there is an actual argument given, even against my own positions, because it stimulates my thinking!

Metta.

1 Like

Oh I understood “counter”, from “counting”, “to count” … 1, 2, 3, 4, …

:laughing:

1 Like

IMO, what is needed is theory that connects the age of the text to corpus-wide (population) statistics like word proportions.

I would argue that it is not obvious what the difference in proportion between, e.g., DN and SN, mean.

Or in other words, how large are the effect sizes here? The historical processes that lead to differences in unique words in different collections of EBTs – what differences should we expect them to produce? Are these differences large or small compared to other corpuses (corporae? corpi?).

It’s actually a surprisingly difficult task to explain why some causal process should result in some difference in average, median, correlation, proportion etc. Even more difficult to say something about what magnitude we should expect the differences to be (e.g. to falsify our hypothesis).

Bhante @sujato, I know critical text scholars have formulated principles, like about how texts tend to be standardized over time etc., but have these ideas been translated into statistical models anywhere? (:nerd_face: )

Edit: Another approach could be to attempt to classify words or themes in the EBTs as earlier and later, and look for systematic differences between the baskets.