Average Word Length as indicator of Pali text age

He compared different languages, not the changes within one language over time. This is very different because it is a known fact that Pali has changed over time. One of the changes is that words have sort of merged together to form longer words.

This research basically only confirms a lot of what we already know, also from other sources. Of course there are other factors then time that influence the average wordlength like for instance the genre and the above research also shows that for instance verses use relatively much shorter words.

If you want to know more about the various indicators as to which texts are considered early and which are considered later, I suggest you read Bhante Sujato’s and Ajahn Brahmali’s excellent work on this. You can find it here together with all the YouTube videos with the classes they gave in 2013.



Is there a plan to add such information in SuttaCentral?

It would be interested to have some sort of indication about the earlyness/lateness of a sutta (or collection of sutta) one is reading… like a symbol or little text or coloured-text; based on objective criteria (or as objective as possible let’s say) such as the ones mentioned in B. Sujato and Brahmali’s book. Filtering searches or even the website using such criterion could also be very interesting.

Of course, such tool could easily be misinterpreted… so it would have to come with clear warnings about the limitations of such classifications.

1 Like

Thank you for clarifying Ayya.

Although a bit old Pande wrote about which suttas he considered to be old and late, if you are interested: http://www.ahandfulofleaves.org/documents/Studies%20in%20the%20Origins%20of%20Buddhism_Pande.pdf

I’ve attached an example of his findings regarding the MN.

1 Like

No, it is too uncertain. Most of the interesting results here apply to a longer span than just the suttas.

Pande’s findings are really unreliable. Kudos for him for trying, but there is a reason why subsequent scholars have not picked up this task. Maybe at some point it could be done.


Interesting. I suspected his ideas might be out of date by now.

:grin: (a recent discovery that we had to hyphenate for sanity. Smartphone palm leaves are a bit narrow)

MN142:4.3: abhivādanapaccuṭṭhānaañjalikammasāmīcikammacīvarapiṇḍapātasenāsanagilānappaccayabhesajjaparikkhārānuppadānena.


Could this method be applied to suttas where portions appear to be later than others? For example the Pali version of SN 56.11 has the gods exclaiming the triumph of the Buddha yet the Chinese versions don’t mention that at all. If applied to a single sutta, could it show sections with distinct anomalies, thus showing the possibility of a later addition?

1 Like

Anyone wanting to help us out with this, upvote the proposal for hyphenation aliasses in CSS!

Here on D&D, this word hyphenates because the browser has been told that it is English, and it applies English hyphenation rules. Properly, however, the browser should know that it is Pali. But there are no hyphenation rules for Pali built into browsers. Anyway, the above proposal addresses this problem.


You can always try to calculate the average word length. But it is only a very rough indication and it does not work very well with smaller portions of text; it works best to see general trends in collections. There are other indicators that work better for smaller portions of text like the parallels in other schools; so look a the parallel in Chinese and see if the same portion is in there as well or not.

This would be a great solution, also for Sanskrit. On the old site we used to have a hyphenation algorithm that worked reasonably well, but also added a those pesky invisible characters in the text. For instance for Sanskrit:
Screenshot from 2020-07-20 10-31-11

On the new SC site this was scrapped and it now just lets the word wrap:
Screenshot from 2020-07-20 10-32-19

@karl_lew: try reducing your screen on this (from the legacy site) and see what happens.

This comes from running the text though the hypenator: legacy-suttacentral/hyphenate.py at master · suttacentral/legacy-suttacentral · GitHub. But a browser-solution would be a whole lot better of course.

But I think this topic deserves it’s own tread in Meta.


Aha! Voice has much the same in Javascript. Thank you!

Voice supports Chrome and Chrome does not support hyphenation fully. Indeed, looking at all the bug requests about this issue is somewhat sad, since they appear to have thrown up their hands for good, and limit their work to Mac and Android. Not even my Google Chromebook is worthy enough of their attention. :see_no_evil:

So Voice will hyphenate just like the legacy Python. I’ll compare algorithms thanks to your link. My own implementation was a bit haphazard.


Yes, it’s puzzling TBH. Google seems to be a bit all over the place when it comes to international support.

You’d think with an Indian CEO they’d be more aware of long words!


When you are done with it, can you please send it to me? I might be able to use it for BuddhaNexus, although our calculation algorithms are a bit complicated and adding invisible characters in the string might mess it up. But it’s worth a try.

Is it in vanilla JS?


Hmm. What I have right now is pali.js which is a class in a NodeJS library for searching the Pali canon. It’s not directly usable in the browser because pali.js also implements a fuzzy set recognizer for Pali words used in the Pali canon. Voice uses this recognizer to pick out the Pali words spoken in English or German for special pronunciation handling. The hyphenation part of this, however, is indeed file independent. I have not yet translated the python into Javascript, so the existing hyphenation is quite crude but acceptable for simple legibility. I’d be happy to craft a shared library that met other needs.

On a side note, scv-bilara is the Linux command line tool I use to search the Pali canon. It is actually the search engine behind Voice and has more functions than Voice exposes.


:smiley: Yes, that is vanilla JS. That’s great because we are using the same setup as SuttaCentral (lit-elements), which also uses that.

Sofar the only thing we have done is put an overflow-wrap: break-word; on the segments so you don’t get those overly long words messing up your display but it is far from ideal. So please give me a heads-up when you are finished revamping that module and I will see if it can work on our system too.

We are also messing around with a search engine at the moment (using ArangoDB’s inbuild possibilities) so I might come back to you about this later on too. But I notice that you use a romanized version of the pali words, which we have chosen not to do because a diacritical can change the meaning of the word. So we are using another system which you can find here: buddhanexus-utils/paliwords.py at master · BuddhaNexus/buddhanexus-utils · GitHub (so you need to make sure everything is in lowercase first).


Anagarika @Sabbamitta, I’ve captured this as a new issue for the next release.


Thank you. I’ve only just one more request: Could we have instead of Vanilla JS … Chocolate JS?

:yum: :wink:

(Sorry, just couldn’t restrain myself. :grimacing:)


:laughing: :chocolate_bar: :chocolate_bar: :chocolate_bar: :chocolate_bar: :chocolate_bar:


Of course you can!
Here you go


Chocolate is an experimental full stack Node.js webapp framework built using Coffeescript.



My goodness! As the old saying goes, „Es gibt nichts, was es nicht gibt!“

Looks indeed very delicious. @karl_lew, what would Voice do with so much tasty stuff? Sit there all day, drinking coffee and eating chocolate, just too busy to answer your search request?