Average Word Length as indicator of Pali text age

Vimala · March 1, 2020, 9:17am

The Pali language, as every other language, has evolved over time. One of the changes that the language has undergone through the centuries is that words have become increasingly complex and words have merged together to form longer words. We would therefore expect that a statistical analysis of average word-length could be an indication of relative ”age” of the texts in question.

In this research I analyse the average word-length of each text and each book in the Pali canon to find trends in the canon and test the above hypothesis; to show that average word-length can be used as an indicator to show the development of the texts over time. Average word-length in itself remains only an indication of ”lateness” of texts, among other indications like the existence and quality of parallels, but nevertheless can prove a valueble resource in determining the development of the texts in the Pali canon and ultimately what the Buddha taught.

The attached PDF shows the entire draft study of 22 pages so here I will only give a short summary of it.

wordlength.zip (3.4 MB)

The first results in analyzing the average word length are shown below. Although the chart shows a definate upward trend, there are a few unexpected anomalies in the data, especially within the Dhammapada results, that made me question the causes of these discrepancies.

So I ran the algorithm again on a dataset that had the headers removed. The chart below shows both the old set (in blue) and the new set with headers removed (in red). It clearly shows that the value for the Dhammapada has suddenly decreased a lot. Not only that, but removing the headers seems to have a greater impact on earlier texts while it has hardly any impact on the commentaries. Headers were only added to the texts at the time these were written down and therefore reflect the time and use of language of that time rather than the text they are referring to. In other words, headers are later inclusions and this shows in the data.

And this is the same chart with the 0-axis moved up so all collections that we currently know as “early” give a negative value.

The other thing I noticed is the relatively low values for verse as opposed to prose texts. I expand a little bit more on this in the study.

I also look in more detail at the Jataka collections. Although at first glance you can argue that these are pre-Buddhist texts and therefore even earlier with lower average word lenght, which might indeed be the case, the distribution of the Average Word Length within the collection itself also shows an interesting trend.

The study also shows charts and sankey-graphs for some of the collections and compares them with known parallels data across languages for a better understanding.

In the study I briefly touch upon the ratio calculations to show the impact of individual files on the total average word length of the collections and the spread therein.

Conclusion

Analysing the Pali canon based on average word length (AWL), corrected for abbreviations and without headers, shows that the AWL is a reliable indicator for lateness of collections and the Pali canon shows a clear trend in use of language across time. The clear impact of headers on the values shows that these have been added at a later date. Verse collections as well as mātikā have a relatively lower AWL than prose of roughly the same age.

The following graph shows the Average Word Length sorted by value and color-coded to show that early suttas have an overall lower AWL while commentaries have an overall very high AWL. It also shows that as a general trend the Vinaya, Abhidhamma and later suttas have all developed around the same time.

However, we have to be careful when using the AWL for files within collections because these can show considerable variations based on several factors, especially if files are very short. The AWL is only an indicator, which can be used together with other indicators like f.i. the number and quality of parallels, especially parallels with other canons from other schools.

Feedback

I would very much welcome your feedback on this study, especially on the anomalies that are observed. If needed, I can also try to make different charts or show different sankey-graphs or do different calculations on the data so if you have a good idea on this, please let me know.

Gabriel_L · March 1, 2020, 9:28am

This is amazing.
Well done ayya @Vimala! Sadhu!

Gillian · March 1, 2020, 10:07am

This is very interesting. Sadhu!

The Average Word Length (hereinafter called AWL) of a text is defined as the number of characters in a text divided by the number of words. (p7)

This is presumably Roman characters. It would be interesting to see if computing word length by the number of phonemes, or the number of morphemes per word would strengthen or weaken the results.

Gabriel · March 1, 2020, 5:35pm

According to chart 2 the corrected jatakas have a negative chart-AWL. That means that words in average have less than seven characters, right?

It would be nice to see the full value chart, because now I think the presentation overvalues the differences. So for example among all suttas most values vary between 7.7 and 8. Do you plan to check if the differences are statistically valid?

I think it’s a very interesting approach, even though several variables might affect AWL. For example, if certain texts were addressing a lay population (in contrast to scholastic monastics), in spite of a later composition I can still imagine the AWL to be low, for ‘entertainment purposes’. As today the AWL of ‘The Sun’ is probably lower than today’s ‘The Atlantic’.

Also, I’d be very interested in the distribution within the Nikayas. We have so many discussions of this or that sutta being early/late. Sometimes it’s quite a clear case for linguistic reasons. So if the AWL method could reproduce these results that would be a great indication.

For example within the Snp. So far, we are quite sure that Snp 4 and Snp 5 are old - in comparison to Snp 3 for example. Does it show in the AWL? Also we know relatively well that the introduction of Snp 5 was added long after the rest of Snp 5 - does that show in the AWL as well?

For sure fascinating approach

Njeul · March 1, 2020, 6:24pm

can you give a complete list of the abbreviated titles in a simple text, such as
w=
pv=
mnd=
and so on

Vimala · March 2, 2020, 11:15am

Very interesting suggestion. As the short comparison of the Dhammapada in the study shows, this also depends quite a lot on the Tipitika that is used. I used the Mahāsaṅgīti Tipiṭaka Buddhavasse 2500 because that is not only the one used on SuttaCentral and VRI, but we have a complete digitized set of those. If you happen to know another digitized Tipitika somewhere that would be great to compare.
But what I can try is to see how this compares to the same Tipitaka in another script. But it will take me a while to get 'round to doing that.

Very good point. I should add that to the study. I’m so used to using the SuttaCentral abbreviations that I don’t think about such things any more! Here it is:

Abbreviations

dn: Dīghanikāya
mn: Majjhimanikāya
sn: Saṃyuttanikāya
an: Aṅguttaranikāya
kp: Khuddakapāṭha
dhp: Dhammapada
ud: Udāna
iti: Itivuttaka
snp: Suttanipāta
vv: Vimānavatthu
pv: Petavatthu
thag: Theragāthā
thig: Therīgāthā
tha-ap: Therāpadāna
thi-ap: Therīapadāna
bv: Buddhavaṃsa
cp: Cariyāpiṭaka
ja: Jātaka
mnd: Mahāniddesa
cnd: Cūḷaniddesa
ps: Paṭisambhidāmagga
ne: Netti
pe: Peṭakopadesa
mil: Milindapañha
pli-tv-bu-pm: Bhikkhu Pātimokkha
pli-tv-bi-pm: Bhikkhunī Pātimokkha
pli-tv-bu-vb: Bhikkhu Vibhaṅga
pli-tv-bi-vb: Bhikkhunī Vibhaṅga
pli-tv-kd: Khandhaka
pli-tv-pvr: Parivāra
ds: Dhammasaṅgaṇī
vb: Vibhaṅga
dt: Dhātukathā
pp: Puggalapaññatti
kv: Kathāvatthu
ya: Yamaka
patthana: Patthana
atk-s: Aṭṭhakathā Suttas
atk-vin: Aṭṭhakathā Vinaya
atk-abh: Aṭṭhakathā Abhidhamma
tika-s: Tīkā Suttas
tika-vin: Tīkā Vinaya
tika-abh: Tīkā Abhidhamma
anya-e: Anya

Yes it does but I will come back to your point in a bit more detail when I have a bit more time. Thank you for your question. You make some good points.

Ravi · March 2, 2020, 5:28pm

Thanks for this great analysis!
I think this reflects the fact that early texts are much closer to spoken language/oral culture - with shorter words and avoiding long compounds; Usage of longer words and long compounds must have started to increase with increase in written works…

Nessie · March 2, 2020, 8:49pm

Very nice idea! It is amazing to see something like this applied to the texts of our deep concern.

I’ve one methodical question/doubt: for computing a trend, and especially to apply linear regression, it is required, that the x-axis represents a numerical entity (like natural or real numbers, not simply unordered codes like telephone- or zip codes or simply numbering-through of qualitative properties ).
Now in which way did you translate the name of the textgroup into a numerical value? One idea is of course: the numerical age - but this needed a a-priori knowledge of which we want to extract by the analysis… If there is no numerical value assigned and the order from left-to-right is somehow arbitrary (even if only in the subcategories) the resulting trend-line is in the same way arbitrary (and for instance different, if you resort in the textgroups the subentries alphabetically ascending or descending).

Your idea of sorting the textgroups for AWL-index as shown as the last picture is of course such an idea of applying meaningful numerical values to each text/textgroup giving them a linear ordering. Then the goal of the analysis should go the way to consider the occuring mixture of the single texts in the textgroups and whether (or: how) this observable mixture can be explained, or meaningful/interesting information can be extracted from it. So the last picture somehow becomes the initial state of all further analysis.

Keywords for the problem of the non-numericaliness of the entries of the x-axis are “nominal-” and “ordinal-scale” with each having sets of procedures. Unfortunately, at the moment I’ve not much at the top of my head about this all and cannot go more into detail ( after my retiring two years ago I didn’t involve in statistics at all and focused my mind much to another direction).

Gabriel · March 3, 2020, 6:02am

That’s an important point. I would just add that for a first exploratory phase it’s legitimate to play around with numbers and see in which directions it goes. In a final presentation, a journal paper etc., a trendline or linear regression on an ordinal x-axis is probably too misleading - because it suggests equidistant items of an interval scale.

I would add another exploratory suggestion, viz. to do a cluster analysis on the suttas with the corrected numbers. Depending on how distinct the different time periods had a ‘AWL stamp’ a cluster analysis could reveal suttas to come from a similar time period. I think it’s reasonable to treat prose and verse sections of the sutta separately for this purpose.

karl_lew · March 3, 2020, 1:42pm

Wow! These charts really brought home to me the value of thoroughly studying the suttas before the vinaya, abhidhamma and commentaries. The best teachers I’ve had always used shorter words. Thank you for this work!

Vimala · March 4, 2020, 10:58am

Thank you so much for your feedback and yes, you are absolutely right. It is indeed a rough, a-priori understanding of age that is at the basis of the order in which the collections are presented. And although I also agree with @Gabriel that it is indeed legitimate to play around with numbers in the first exploration, I feel that the trendline in these first graphs is indeed misleading and probably unnecessary: it does not actually give us any useful additional information. I will take those out.

Further in the analysys I use the trendlines for Jataka collection, where the numerical x-axis consists on the sutta numbers. I feel there it is justified as it shows a very clear trend throughout the collection.

yes, it’s been a while for me too and I’ve been racking my brain to try and remember my university statistics. So if you have any tips, I’m very happy to hear them!

I’d be very careful with this. For the analysis of AWL a single sutta is often too small a unit to work with. For instance, a sutta with just one paragraph can have a very high AWL, while a longer one in the same collection has a much lower one. The longer one will have a much higher impact on the total AWL of the entire collection and it does also not mean that the shorter one is actually from a later date; the dataset is just too small to be statistically significant. And there are other reasons why a high or low AWL can be reached for a single sutta.

I analysed for instance MN118, which had a relatively high AWL within the Majjhima Nikaya and I was wondering why.
What I’ve been using as a dataset is the Mahāsaṅgīti Tipiṭaka. There are many abbreviations and one of them in the MN118 is for instance:

Santi, bhikkhave, bhikkhū imasmiṃ bhikkhusaṅghe mettābhāvanānuyogamanuyuttā viharanti … karuṇābhāvanānuyogamanuyuttā viharanti … muditābhāvanānuyogamanuyuttā viharanti … upekkhābhāvanānuyogamanuyuttā viharanti … asubhabhāvanānuyogamanuyuttā viharanti … aniccasaññābhāvanānuyogamanuyuttā viharanti—evarūpāpi, bhikkhave, santi bhikkhū imasmiṃ bhikkhusaṅghe.

The same text in the PTS version reads:

Santi bhikkhave, bhikkhū imasmiṃ bhikkhusaṅghe mettābhāvanānuyogamanuyuttā viharanti. Evarūpāpi bhikkhave, santi bhikkhū imasmiṃ bhikkhusaṅghe. Santi bhikkhave, bhikkhū imasmiṃ bhikkhusaṅghe karuṇābhāvanānuyogamanuyuttā viharanti. Evarūpāpi bhikkhave, santi bhikkhū imasmiṃ bhikkhusaṅghe. Santi bhikkhave, bhikkhū imasmiṃ bhikkhusaṅghe muditābhāvanānuyogamanuyuttā viharanti. Evarūpāpi bhikkhave, santi bhikkhū imasmiṃ bhikkhusaṅghe. Santi bhikkhave, bhikkhū imasmiṃ bhikkhusaṅghe upekkhābhāvanānuyogamanuyuttā viharanti. Evarūpāpi bhikkhave, santi bhikkhū imasmiṃ bhikkhusaṅghe. Santi bhikkhave, bhikkhū imasmiṃ bhikkhusaṅghe asubhabhāvanānuyogamanuyuttā viharanti. Evarūpāpi bhikkhave, santi bhikkhū imasmiṃ bhikkhusaṅghe. Santi bhikkhave, bhikkhū imasmiṃ bhikkhusaṅghe aniccasaññābhāvanānuyogamanuyuttā viharanti. Evarūpāpi bhikkhave, santi bhikkhū imasmiṃ bhikkhusaṅghe.

The AWL of these two bits of text are 13.8 and 9.9 resp. Now over a whole collection, this will balance out more but it can give a bit of a warped value for just one sutta. Moreover, there are suttas that have parts in it that are from a later date.

One good example is snp5.19 (SuttaCentral). The first part has an AWL of 7.7, the second part (from Suttuddānaṃ onwards, and which I feel is a later part) of 11.2. Over the whole collection this does not have that much of an impact but in analyzing the SNP on a vagga level (see below), I noticed that the impact was considerable.

So coming back to your original post:

indeed.

You can see a chart in the study itself that lists also the ranges within the collections (i.e. maximum and minimum values). Within that you can see that for the Suttas, with the exception of the Khuddakapāṭha, the ranges are fairly consistent. The Khuddakapāṭha I have taken out of later calculations because it only consists of 8 fairly short suttas and is therefore statistically less relevant and also illustrates the point I was trying to raise above.

Indeed and therefore we have to also be careful to evaluate individual suttas. Your point however might be a possible explanation why the Vimānavatthu and Petavatthu appear among the lower range of AWL in text while you would expect them to have a higher AWL based on what we know about them as later texts.

As I stated above, on an individual sutta level this would be difficult to determine but on a collection or even vagga level this might be possible.

I took your example of the SNP and calculated the vagga values. This is the result (but I corrected 5.19 for the above second half). It would appear based on this method that snp4 is the earliest part together with snp1 while snp3 is the latest.

snp_awl

Looking at the range within the vaggas we see more variation, especially in snp5. snp4 is fairly consistent in it’s low values.

I wonder if snp5 can be a combination of old and new materials? When I look for instance at the parallels of snp5, I’m not entirely convinced that these texts are so early. Most of them only have a link with the Cūḷaniddesa and not with early texts. As I said: the AWL can only be an indication, that can be used together with other indications like parallels.

Dhammanando · March 4, 2020, 7:02pm

In connection with this, may I ask @Vimala did you first reduce the aspirated consonants —kh, gh, ch, jh, etc.— to a single character (as they are in all Asian scripts) before you began counting? It occurs to me that neglect of this might give rise to misleading results.

For example, the phonemes khyā and yā are of identical length, each being a bimoraic monosyllable, yet the mere counting of characters would give the unwarranted impression that one was twice the length of the other.

There’s a further problem here in that even when khyā has been reduced to kyā it still remains one character longer than yā. With this in mind, wouldn’t it be better to do the counting either by syllables or (even better) by morae? Syllable-counting would be very easy — since each Pali syllable contains just one vowel one would merely need to do a search-and-replace to get rid of all the consonants. The length of each word would then be equal to the number of vowels in it.

Morae-counting would be a little more complicated. One would need to devise a search-and-replace routine that would replace every monomoraic syllable with, say, an asterisk, and every bimoraic one with two asterisks. The trick would be to do things in the right order as it would be quite easy to get it all wrong (e.g., by replacing a, i and u with one asterisk before you’ve replaced aṃ, iṃ and uṃ with two).

One last thought on methodology…

I believe what’s more likely than anything else to skew the results are the monosyllabic words. Every such word will contribute to a lower AWL, yet the greater or lesser frequency of certain of them has largely to do with the genre of text and isn’t any reflection of earliness or lateness.

Take, for example, the particle na (‘not’). In the Abhidhamma Piṭaka there are 12,526 occurrences of this in the Yamaka alone, which is only about 500 short of the count for the whole of the first four Nikāyas of the Suttanta Piṭaka. Moreover, na makes up 6.6% of the words in the Yamaka, as contrasted with the Saṃyutta Nikāya where it makes up only 1.2% of the words.

Needless to say, the higher frequency of na in the Yamaka is not indicative of its predating the SN. Rather, it reflects the fact that the text consists almost entirely of short sentences affirming or denying various propositions about dhammas. It goes without saying that the use of ‘not’ is going to be uncommonly common in a text of this sort, much as it would be in a treatise on formal logic. For this reason I think there might be a good case for getting rid of all the monosyllabic words and even quite a few of the bisyllabic ones (e.g., eva) before doing the counting.

Vimala · March 5, 2020, 11:01am

It will only give misleading results if these consonants are not used the same way throughout the texts. I am not looking for an absolute value of the AWL (something like "early texts are texts with an AWL between the values of X and Y) but only the relative values (i.e. is collection X earlier or later than collection Y).
Absolute values also differ in different Tipitikas as I’ve briefly touched upon in my discussion of the Dhammapada.

However, it will be interesting to see if changing to an Asian script will change the relative values because if it does, we have a second indicator of “lateness” namely that these consonants were used more (or less) in the early scriptures. Then we can count the number of these consonants divided by the total number of characters in a text to get another indicator.

As I already said above, I will do this at some point before doing log regression tests as was the suggestion of somebody who emailed me. I think these log regression tests will indeed be very valueble.

I will come back to the rest of your post when I have some more time.

Vimala · March 5, 2020, 2:25pm

I did a quick test and this is the result:

With correction of aspirated consonants:

And the old one without the correction:

As you see, the relative values of the AWL are not affected. Indeed, as was expected, the overall value of the AWL is lower this way but the pattern has remained very similar.

This was just a quick test and by no means a full change to an Asian script. I intent to change the whole thing to Devanagari using SuttaCentral’s script changer and then run it again at some point.

Dhammanando · March 5, 2020, 2:31pm

To count a single consonant as two simply because it happens to be represented by two characters in romanized Pali is almost certain to give misleading results.

Here’s one way it will happen:

The later the Pali, the greater the number of sanskritisms. One consequence of the growing sanskritization of Pali diction was a reduction in the the proportion of aspirated consonants.

In the Saṃyutta Nikāya, for example, the aspirates make up 21.2% of the stops, while in the Saṃyutta Atthakathā they make up only 16.7%. That being so, to count each aspirate as two rather than one is sure to give a skewed impression of the AWL of the SN relative to its commentary, by making the former appear longer than it ought to be.

Vimala · March 5, 2020, 2:40pm

So basically what you are saying here is that the use of consonants is indeed a second indicator of “lateness”. That would indeed be interesting but it would also mean that the AWL value of the earlier texts will reduce even more in relation to those of the later texts, therewith strengthening my point that the AWL can be used as an indicator of lateness.
In fact, in the above charts you can vaguely see that the commentaries are less affected by the correction for aspirated consonants.

This would indeed be interesting to try. Thank you for the suggestion. It might have to wait a while as I will be on retreat for a few months.

I touched upon this in the essay where I compared verse, prose and matika. But you are right that this point could be further expanded in the essay. Your suggestion of doing a test without monosyllabic words is also interesting.

I hear that there are many more tests that can be done just to see what the effect is. Thank you so much for your ideas!

Vimala · March 7, 2020, 12:35pm

@Dhammanando: I’ve run a few tests with your suggestions.
I tried converting the text into Devanagari, Sinhala and Thai but ran into a problem here in that the computer cannot accurately calculate the length of these words, often counting characters as 2 rather than 1 because they are made up out of two parts. Of course I could go over all the possible combinations of characters and eliminate the occurance of this but feel that this would be a lot of work for little extra benefit. So I made a full listing of possible double-counted consonants and counted those as 1. The result is this:

Comparing this with the original chart above:

As you see the totals are a lot lower but the pattern remains the same. The correction seems to have more of an effect on the earlier suttas than on the later commentaries.

Sorting these as I had done above we get:

So the pattern here is still very much the same as we had before. A few small changes where collections switched places but not overall a great difference.

I then did you second suggestion: to count just the vowels and this is the result. I have moved the axis to the postion of 3.22 in order to be able to compare it better with the above values for the Average Word Length.

Comparing these two together you get this:

So it is clear that the vowel count has gives some different results. The overall pattern is still the same but there are a few interesting changes, especially in the Abhidhamma and Commentaries.

To illustrate this a bit better I have also sorted these:

The general order is still: verses, early suttas, late suttas/vinaya/abhidhamma moving into commentaries but some Abhidhamma collections now have a much lower value, even lower than the first 4 Nikayas. This is very interesting. Ven. @Dhammanando, I would be interested in hearing your opinion on this.

Dhammanando · March 9, 2020, 5:08pm

I don’t think it’s very surprising that the Abhidhamma Piṭaka would have a lowish AWL. There are only a small number of words in it that are not found in the suttas and most of these coinages have only 2-4 syllables. There’s nothing comparable in length to commentarial behemoths like: pavarasurāsuragaruḍamanujabhujagagandhabbamakuṭakūṭacumbitaselasaṅghaṭṭitacaraṇo

Vimala · March 10, 2020, 12:03pm

It is not the AWL that is low, but the average number of vowels per word that has a low value for the Kathāvatthu only.

Basically what you are saying here is that if you take small words out, the Yamaka will have a much higher AWL than it currently has, while the Suttas will be less affected by the change (and thus relatively much lower and thus confirming the trend as observed). What I understand your point to be is that there are differences due to the various types of texts and how these are used. I never denied this and briefly touched upon this in the pdf essay but you are right that such points need to be clarified a bit more. The fact that different genres of text were written at different times is not so surprising as we know that this has happened. But the AWL calculations pick that up. More noticable than with the Abhidhamma, this happens with verses, where the genre automatically acquires a lower AWL because of the specific use of language. However, with the Abhidhamma it seems that every extra correction just creates a greater gap between the AWL of the Abhidhamma texts and the Nikayas.

Like I said, the AWL is only an indication, not an absolute. It has to be used together with other indications like the existence of parallels, etc.

BTW: love your cat

Ceisiwr · July 15, 2020, 12:13am

I might be being dense here but is this much different from Johannes Goropius?

Goropius theorized that Antwerpian Brabantic, spoken in the region between the Scheldt and Meuse Rivers, was the original language spoken in Paradise. Goropius believed that the most ancient language on Earth would be the simplest language, and that the simplest language would contain mostly short words. Since Brabantic has a higher number of short words than do Latin, Greek, and Hebrew, Goropius reasoned that it was the older language.

A corollary of this theory was that all languages derived ultimately from Brabantic. The Latin word for “oak,” quercus , Goropius derived from werd-cou (“keeps out cold”); the Hebrew name “Noah” he derived from nood (“need”). Goropius also believed that Adam and Eve were Brabantic names (from Hath-Dam , or “dam against hate"; and Eu-Vat , “barrel from which people originated,” or from Eet-Vat , “oath-barrel,” respectively). Another corollary involved locating the Garden of Eden itself in the Brabant region. In the book known as Hieroglyphica , Goropius also allegedly proved to his own satisfaction that Egyptian hieroglyphics represented Brabantic.

[…]

Goropius is considered to have given Dutch linguistics, and Gothic philology in general, a bad name. Though Goropius had admirers (among them Abraham Ortelius and Richard Hakluyt), his etymologies have been considered “linguistic chauvinism,” and Leibniz coined the term goropism , meaning absurd etymological theories. Justus Lipsius and Hugo Grotius discounted Goropius’s linguistic theories. “Never have I read greater nonsense,” the scholar Joseph Scaliger wrote of Goropius’s etymologies.