Stratification of the Suttas

Does anyone know of a work which continues in the same direction as Govind Chandra Pande’s Studies in the origins of Buddhism in terms of stratification of the suttas?
Thank you!


I am contemplating trying to create files from his classification of DN suttas (early /late) and then apply text mining methods (with gensim for example ) and see what comes up. But I am in a monastery with very limited access to the internet. Maybe one day

It is not easy to master Latent Dirichlet Allocation and other such advanced statistical tools, so I have resorted to a very simple word count.

Here is what I have done:

  1. I created 3 files.
    a) One file contains suttas from DN and MN classified by Pande as early (DN 2, DN 13, MN 17, 24, 26, 29, 61, 63, 71, 108, 144). This file contains 267k characters
    b) Another file contains suttas from DN classified by Pande as late (DN 17, 18, 22, 24-30, 32-34). This file contains 454k characters
    c) A third file contains suttas from MN classified by Pande as late (MN 12, 28, 33, 35, 41, 43, 44, 50 etc.). This file contains 518k characters
  2. I did a word count and then compared the frequency of words in the three files.

I noticed that there are sometimes striking differences. Often the difference between early DN-MN suttas and late MN suttas is amplified between early DN-MN and late DN suttas.

This is still a work in progress, but here are some early findings:

Late suttas use much less the words cittaṃ, pajānāti, viharati, natthi, kammaṃ. This means that beyond purely stylistic differences, they talk less about the mind, about understanding things and about kamma.

On the other hand, late suttas use much more the words ahosi, rājā, deva, ānanda, ahesuṃ. Again, beyond stylistic differences, this means they talk a lot more about kings, devas and Ananda than early suttas (not much surprise so far, but this confirms what would have been expected).

I will continue the dissection of the results and will report later on interesting findings I may come across. In the meantime, I welcome comments and critics on methodology.


It seems the stratification of the texts is about the formation of strata in the EBTs. If so, it may be connected with the issue of angas (classifications) in EBTs. The following article by Choong Mun-keat may be useful:

“Ācāriya Buddhaghosa and Master Yinshun 印順 on the Three-aṅga Structure of Early Buddhist Texts”, Research on the Saṃyukta-āgama (Dharma Drum Institute of Liberal Arts, Research Series 8; edited by Dhammadinnā), Taiwan: Dharma Drum Corporation, August 2020, pp. 883-932.

((PDF) Ācāriya Buddhaghosa and Master Yinshun 印順 on the Three-aṅga Structure of Early Buddhist Texts | Mun Keat Choong -

Or this site:

Unfortunately, I don’t have an easy access to this document. But looking for it, I came across this one:


It seems Bhikkhu Analayo believes that the Pali and Chinese early Buddhist texts (such as the five Nikayas and four Agamas) had originated and finalised at once from the first Saṅgha council in their complete form (structure) and content, although some late, and different components of the texts identified by him.

1 Like

Bhikkhu Analayo completely ignores the relevant findings of Ven. Yinshun on SA/SN (i.e. the synthesis of the three aṅgas ) and the Ceylonese/Burmese version’s reading in MN 122:
“na kho Ānanda arahati sāvako satthāraṃ anubandhituṃ yadidaṃ suttaṃ geyyaṃ veyyākaraṇaṃ tassa hetu” (“It is not right, Ānanda, that a disciple should seek the Teacher’s company for this reason, namely sutta, geyya, veyyākaraṇa.”).

This Pali version’s reading is clearly supported by the Chinese version in the Madhyama-āgama, MA 191 at T I 739c4–5:
“佛言。阿難。不其正經.歌詠.記說故。信弟子隨世尊行奉事至命盡也” (“The Buddha said: Ānanda, it is not for this reason, namely sūtra, geya, vyākaraṇa, that a disciple follows the World-Honoured One with respect until the end of life.”).

Only the first three aṅgas (sūtra, geya, vyākaraṇa) are mentioned in the Mahāsuññatā-sutta, MN 122 at MN III 115,17 and its Chinese counterpart, the Dakong jing 大空經, MA 191 at T I 739c4. This suggests the possibility that only the three aṅgas existed in the period of Early (or pre-sectarian) Buddhism.

Accordingly, Bhikkhu Analayo is apparently unable to present a clear and precise argument or analysis regarding why only the first three aṅgas are mentioned in MN 122 and its Chinese counterpart, MA 191.

Venerable Anālayo has a forthcoming article (completed in June and now in the hands of publishers) responding to Choong’s concerns that Western scholarship on early Buddhism has ignored Yinshun’s proposal that the three aṅgas served as an early ordering principle of the Buddhist scriptures. In the article, he examines the five premises that Yinshun’s hypothesis rests on.


Very good. Hopefully he does not just repeat/reprint the same five points shown in the following paper, pp. 983-997. If so, he again completely ignores the relevant findings of Ven. Yinshun on SA/SN (i.e. the synthesis of the three aṅgas ):

Travagnin, Stefania and Anālayo, Bhikkhu. 2020. “Assessing the Field of Āgama Studies in Twentieth-century China: With a Focus on Master Yinshun’s 印順 Three-aṅga Theory”. Research on the Saṃyukta-āgama (Dharma Drum Institute of Liberal Arts, Research Series 8), edited by Dhammadinnā, 933-1007. Taiwan: Dharma Drum Corporation.

This is really interesting as an initial trial. May I make a few points?

Generally speaking, Pande’s work is a mixed bag: he was a knowledgeable and careful scholar, but his work is limited in time and place, and in some cases makes theoretical assumptions that are, in my view, not well founded. Sorry, I can’t remember details, this is just my memory of reading him many years ago. Anyway, point being, his findings should not be accepted uncritically—but you knew that!

Methodologically, the crucial thing is that your analysis must be an independent test of Pande’s work, and must not explicitly or implicitly rely on shared assumptions. For example, if Pande has identified legendary narratives involving kings as late, and then in those texts we find an increased incidence of ahosi, rājā, deva, ānanda, ahesuṃ, this merely confirms that they are legendary narratives involving kings, and doesn’t tell us anything about the date.

Are you familiar with Ayya @vimala’s BuddhaNexus project? This does a lot of the analytical crunching, but it needs informed analysis to make sense of the data.


If I recall correctly, he seemed to think it was unlikely that the Buddha taught the 4 Noble Truths. But I appreciated that bias, because it means he wasn’t blinded by positive beliefs, which is sometimes a limitation for some scholars who are deeply invested in Buddhism and can have a hard time taking a step back to question their own beliefs.

Yes, that would be proper research :grinning: I am just playing around and trying to see if anything comes up. Ideally, I would like to come up with a list of words and expression that indicate either earliness or lateness. Some are easy to figure out, for example any passage mentioning the number 84 is very likely a late addition to the Canon. But this is certainly a difficult ground to progress on due to the sheer amount of unknowns. But I think people need to be aware that what is written in some suttas is sometimes to be taken with a grain of salt, or even sometimes thrown to the trash (like when it says that mount Sineru is a million kilometers high and the oceans are 5,000 km deep or something along these lines).

I wasn’t. This seems quite interesting, thank you for the link Bhante. There’s still a lot to be achieved in comparative studies too, including to assess the earliness or lateness of a sutta or passage.

1 Like

Venerable @Vimala also tried an analysis on the age of texts, via the length of words:

All these are just initial steps to check out the terrain, before definitive conclusions can be drawn. But we have to start somewhere!


It’s a good point. Sometimes you learn the most from the people you disagree with.

Also a valid activity!


This is funny, I tried that (just divided the number of characters by the number of words in each file) but I couldn’t see a striking difference (the scope was much narrower than in Ven. Vimala’s study though). Also, I found the AWL to be around 10, but I didn’t try to remove all the headers, and there can be a lot of them in DN.

I will have a closer look at Ven. Vimala’s work.


What I found is that the AWL can be used as an indicator, but no more than an indicator, of the relative age of entire collections. It does not work well on individual suttas because there are also other factors involved; it needs to have a considerable large sample of text for it to be more reliable.

Next to the PDF and charts as Sabbamitta mentioned above, here are some of the raw outputdata if you want to have a look at it yourself:


Regarding DN/DA (長阿含), Ven. Yin Shun states it was developed and expanded from the Geya (祇夜) anga portion of SA/SN. The following is the quotation, in Chapter 10, Section 4, from the book The Formation of Early Buddhist Texts by Ven. Yin Shun:

第四節 結說











In this statement, Ven. Yin Shun suggests:

Even if the original form of SA/SN is being considered the most ancient/earliest, it does not mean one is able to understand SA/SN is a synthesis of the three parts/aṅgas (sūtra, geya, vyākaraṇa). Without knowing the characteristics of the three aṅgas and their connection with the formation of the other three Agamas/Nikayas (MA/MN, DA/DN, EA/AN), it is impossible to understand the process of forming the four Agamas/Nikayas according to SA/SN.

1 Like

Hi @silence ! you may be interested in some patterns I have come across in looking for ways to provoke @sujato in relation to his Gist theory :slight_smile:

the Nikayas are approximately

DN: 173,906
MN: 294,643
SN: 357,384
AN: 388,889

words long.

comparing the Nikayas for doctrinal terms in the Digital Pali Dictionary’s frequency tool gives:

arahant vs ariya:

DN: 367
MN: 336
SN: 255
AN: 340

DN: 109
MN: 282
SN: 589
AN: 265

So as a taster we have DN mentioning arahant more than 3 times as frequently as it mentions ariya, and mentioning arahant more often than the almost twice as long SN, while SN mentions ariya twice as often as it does arahant, mentioning ariya almost six times as often as the half as long DN.
In fact SN is the only Nikaya to mention arahants less than 300 times.
SN mentions ariya well over 200 times more often than the similarly long AN.
SN is the only Nikaya to mention ariya more often than arahant.

kamma/jhāna vs satipaṭṭhāna/upādānakkhandha/anatta:

DN: 142
MN 279
SN 73
AN 442

Here we see a stark absense of kamma from SN compared to any of the other nikayas, with it mentioning kamma half as often as the half as long DN, and one sixth as often as AN.

DN: 119
MN: 235
SN: 180
AN: 307

Here we see SN mentioning jhāna with significantly less frequency than the other 3 Nikayas.

DN: 22
MN: 34
SN: 185
AN: 45

Here we see SN mentioning satipaṭṭhāna much more frequently than the other nikayas, 4 times as much as the similarly long AN for example, and if we control for the nearly identical foundations of mindfulness suttas in MN and DN, and the 37 aides list which merely enumerates topics, then the contrast is even higher.
SN is the only Nikaya that mentions satipaṭṭhāna more often than jhāna.

DN: 7
MN: 23
SN: 61
AN: 10

As Pande points out the occurances of upādānakkhandha in DN are all pretty palpably late, so the difference here is again even more stark than the raw numbers indicate.

DN: 17
MN: 66
SN: 215
AN: 24

Here again, SN mentions anatta almost ten times as often as the similarly long AN.

āsava vs avijjā

DN: 49
MN: 202
SN: 138
AN: 432

DN: 4
MN: 42
SN: 157
AN: 44

Here again we see a remarkable and striking inverse, with DN mentioning āsava ten times more often than it mentions avijjā, and all the other 3 Nikayas mentioning āsava more often than avijjā, but SN the reverse, mentioning avijjā more often than āsava.

I have given just a few doctrinal terms but I am sure more examples of these kinds of distribution could be given.

I also think that dividing MN into 2 sections, perhaps into a section of the first 30 and a section of the last 120 would make the numbers even more glaring.

My impression is that as DN/MN grew, new suttas where added to MN that began to degrade in quality, like the very messy MN102 for example, and DN suttas kept getting re-written with more and more supernatural hyperbole, so a need was felt to “get back to the fundamentals” and thus SN was born, eventually providing the source for the later abbidhamma project.

Anyway, I am slowly working on an essay about it entitled “SN is different”, but thought I would share some of my bits and pieces with you as I am partly inspired by Pande and also aspire to do much more sophisticated statistical analysis of the texts one day when I learn how to use a computer properly :slight_smile:


1 Like

Interesting! It would be great to follow through on some of these correlations, and assess them in context.

It’s probably possible to do something similar with the Agamas, as they tend to translate technical terms reasonably consistently.

Computers are dumb! Something like this should work:

  • MN: 43
1 Like

just on kamma, using DPR to separate out the sagatha vagga makes the anti-kamma leaning of the prose portion of the samyutta even stronger, Digital Pali Reader has a different engine to Digital Pali Dictionary so the numbers look a bit different, but the pattern is the same;

Bodhi’s translation’s page numbers for the sections;
sagathavagga 59 - 341 282
nidanavagga 505 - 725 220
khandavagga 827 - 1043 216
salayatanavagga 1109 - 1397 288
mahavagga 1461 - 1882 421

so sagatha vagga is 282 pages out of 1427


47 of 185 occurrences of kamma related words in SN in the Digital Pali Reader occur in the Sagathavagga, so more than a quarter of the occurrences in less than a fifth of the book.

So if we break up SN into SSN (for sagatha SN) and PSN (for prose SN) we get:

DN: 195
MN: 365
SSN: 47
PSN: 138
AN: 555

DPR includes all words with kamma in them, DPD tries to just give kamma and its variations, not just any word with the string, so the numbers are different, but it is still a striking disparity, especially considering that it is still very noticeable even when the non-technical uses of kamma, which we might expect to be more or less evenly distributed, are left in.

Anyway, I hope to find some time to go a bit deeper in the next few months when I have some more substantial time off work.