He compared different languages, not the changes within one language over time. This is very different because it is a known fact that Pali has changed over time. One of the changes is that words have sort of merged together to form longer words.
This research basically only confirms a lot of what we already know, also from other sources. Of course there are other factors then time that influence the average wordlength like for instance the genre and the above research also shows that for instance verses use relatively much shorter words.
If you want to know more about the various indicators as to which texts are considered early and which are considered later, I suggest you read Bhante Sujato’s and Ajahn Brahmali’s excellent work on this. You can find it here together with all the YouTube videos with the classes they gave in 2013.
Is there a plan to add such information in SuttaCentral?
It would be interested to have some sort of indication about the earlyness/lateness of a sutta (or collection of sutta) one is reading… like a symbol or little text or coloured-text; based on objective criteria (or as objective as possible let’s say) such as the ones mentioned in B. Sujato and Brahmali’s book. Filtering searches or even the website using such criterion could also be very interesting.
Of course, such tool could easily be misinterpreted… so it would have to come with clear warnings about the limitations of such classifications.
Could this method be applied to suttas where portions appear to be later than others? For example the Pali version of SN 56.11 has the gods exclaiming the triumph of the Buddha yet the Chinese versions don’t mention that at all. If applied to a single sutta, could it show sections with distinct anomalies, thus showing the possibility of a later addition?
Anyone wanting to help us out with this, upvote the proposal for hyphenation aliasses in CSS!
Here on D&D, this word hyphenates because the browser has been told that it is English, and it applies English hyphenation rules. Properly, however, the browser should know that it is Pali. But there are no hyphenation rules for Pali built into browsers. Anyway, the above proposal addresses this problem.
You can always try to calculate the average word length. But it is only a very rough indication and it does not work very well with smaller portions of text; it works best to see general trends in collections. There are other indicators that work better for smaller portions of text like the parallels in other schools; so look a the parallel in Chinese and see if the same portion is in there as well or not.
This would be a great solution, also for Sanskrit. On the old site we used to have a hyphenation algorithm that worked reasonably well, but also added a those pesky invisible characters in the text. For instance for Sanskrit:
On the new SC site this was scrapped and it now just lets the word wrap:
@karl_lew: try reducing your screen on this (from the legacy site) and see what happens.
Voice supports Chrome and Chrome does not support hyphenation fully. Indeed, looking at all the bug requests about this issue is somewhat sad, since they appear to have thrown up their hands for good, and limit their work to Mac and Android. Not even my Google Chromebook is worthy enough of their attention.
So Voice will hyphenate just like the legacy Python. I’ll compare algorithms thanks to your link. My own implementation was a bit haphazard.
When you are done with it, can you please send it to me? I might be able to use it for BuddhaNexus, although our calculation algorithms are a bit complicated and adding invisible characters in the string might mess it up. But it’s worth a try.
On a side note, scv-bilara is the Linux command line tool I use to search the Pali canon. It is actually the search engine behind Voice and has more functions than Voice exposes.
Yes, that is vanilla JS. That’s great because we are using the same setup as SuttaCentral (lit-elements), which also uses that.
Sofar the only thing we have done is put an overflow-wrap: break-word; on the segments so you don’t get those overly long words messing up your display but it is far from ideal. So please give me a heads-up when you are finished revamping that module and I will see if it can work on our system too.
We are also messing around with a search engine at the moment (using ArangoDB’s inbuild possibilities) so I might come back to you about this later on too. But I notice that you use a romanized version of the pali words, which we have chosen not to do because a diacritical can change the meaning of the word. So we are using another system which you can find here: buddhanexus-utils/paliwords.py at master · BuddhaNexus/buddhanexus-utils · GitHub (so you need to make sure everything is in lowercase first).