Average Word Length as indicator of Pali text age

Vimala · July 16, 2020, 11:40am

He compared different languages, not the changes within one language over time. This is very different because it is a known fact that Pali has changed over time. One of the changes is that words have sort of merged together to form longer words.

This research basically only confirms a lot of what we already know, also from other sources. Of course there are other factors then time that influence the average wordlength like for instance the genre and the above research also shows that for instance verses use relatively much shorter words.

If you want to know more about the various indicators as to which texts are considered early and which are considered later, I suggest you read Bhante Sujato’s and Ajahn Brahmali’s excellent work on this. You can find it here together with all the YouTube videos with the classes they gave in 2013.

https://www.samita.be/en/media/authenticity-of-early-buddhist-texts/

Yasoj · July 16, 2020, 11:55am

Is there a plan to add such information in SuttaCentral?

It would be interested to have some sort of indication about the earlyness/lateness of a sutta (or collection of sutta) one is reading… like a symbol or little text or coloured-text; based on objective criteria (or as objective as possible let’s say) such as the ones mentioned in B. Sujato and Brahmali’s book. Filtering searches or even the website using such criterion could also be very interesting.

Of course, such tool could easily be misinterpreted… so it would have to come with clear warnings about the limitations of such classifications.

Ceisiwr · July 16, 2020, 12:49pm

Thank you for clarifying Ayya.

Ceisiwr · July 16, 2020, 12:52pm

Although a bit old Pande wrote about which suttas he considered to be old and late, if you are interested: http://www.ahandfulofleaves.org/documents/Studies%20in%20the%20Origins%20of%20Buddhism_Pande.pdf

I’ve attached an example of his findings regarding the MN.

sujato · July 16, 2020, 10:38pm

No, it is too uncertain. Most of the interesting results here apply to a longer span than just the suttas.

Pande’s findings are really unreliable. Kudos for him for trying, but there is a reason why subsequent scholars have not picked up this task. Maybe at some point it could be done.

Ceisiwr · July 16, 2020, 10:55pm

Interesting. I suspected his ideas might be out of date by now.

karl_lew · July 17, 2020, 6:44pm

(a recent discovery that we had to hyphenate for sanity. Smartphone palm leaves are a bit narrow)

MN142:4.3: abhivādanapaccuṭṭhānaañjalikammasāmīcikammacīvarapiṇḍapātasenāsanagilānappaccayabhesajjaparikkhārānuppadānena.

Adutiya · July 18, 2020, 3:14pm

Could this method be applied to suttas where portions appear to be later than others? For example the Pali version of SN 56.11 has the gods exclaiming the triumph of the Buddha yet the Chinese versions don’t mention that at all. If applied to a single sutta, could it show sections with distinct anomalies, thus showing the possibility of a later addition?

sujato · July 19, 2020, 10:11pm

Anyone wanting to help us out with this, upvote the proposal for hyphenation aliasses in CSS!

github.com/w3c/csswg-drafts

[css-text] Allow alias for language hyphenation

opened 01:34AM - 30 Jun 20 UTC

sujato

css-text-4 i18n-tracker

The CSS spec provides for hyphenation of text, leaving the choice of language up… to the UA: https://www.w3.org/TR/css-text-4/#hyphenation Currently Firefox offers the best support, but even they only support fairly small subset of the world's languages. https://developer.mozilla.org/en-US/docs/Web/CSS/hyphens The thing is, it is sometimes better to have imperfect hyphenation than none at all. No hyphenation can result in a broken UI and unreadable text, whereas imperfect hyphenation might work fine, or at worst be merely inelegant. I work with texts in Pali and Sanskrit, which can have very long words formed by compounding. There is no browser support for hyphenation for these, nor is there likely to be. Surely these are not the only languages affected. Here is a typical example, rendered in firefox: ![Screenshot from 2020-06-30 09-26-16](https://user-images.githubusercontent.com/6112010/86071515-ccc36280-bac2-11ea-8338-b6cee8657cdf.png) It is possible to hack around this by activating hyphens and setting `lang='la'`: ![Screenshot from 2020-06-30 09-25-58](https://user-images.githubusercontent.com/6112010/86071483-b6b5a200-bac2-11ea-906e-5b711bc018b8.png) This is identical to the result that a proper Pali hyphenation would produce. Note that in tradition Indic orthography, there is no concept of a correct breakpoint; scribes merely wrote to the end of the line and continued on the next line. Thus the traditional practice would agree with the idea that sometimes *any* breakpoint is better than none. However, it's obviously not a good idea to deliberately set a false language. Hence my proposal: **Allow the CSS to declare a language alias for hyphenation**. So the text language is unaffected, and the HTML does not change. But the user can declare via CSS something like: ``` hyphenate-alias-languages: pli, la; ``` Meaning: "for the purpose of hyphenation, Latin and Pali may be substituted." Such substitution would apply only if explicit support for that language is missing. So if `lang='pli'` is set on the HTML, then if one UA has support for Pali hyphens, that is used, if not, it looks for support for Latin.

Here on D&D, this word hyphenates because the browser has been told that it is English, and it applies English hyphenation rules. Properly, however, the browser should know that it is Pali. But there are no hyphenation rules for Pali built into browsers. Anyway, the above proposal addresses this problem.

Vimala · July 20, 2020, 8:41am

You can always try to calculate the average word length. But it is only a very rough indication and it does not work very well with smaller portions of text; it works best to see general trends in collections. There are other indicators that work better for smaller portions of text like the parallels in other schools; so look a the parallel in Chinese and see if the same portion is in there as well or not.

Upvoted!
This would be a great solution, also for Sanskrit. On the old site we used to have a hyphenation algorithm that worked reasonably well, but also added a those pesky invisible characters in the text. For instance for Sanskrit:

On the new SC site this was scrapped and it now just lets the word wrap:

@karl_lew: try reducing your screen on this (from the legacy site) and see what happens.
abhivādanapaccuṭṭhānaañjalikammasāmīcikammacīvarapiṇḍapātasenāsanagilānappaccayabhesajjaparikkhārānuppadānena.

This comes from running the text though the hypenator: legacy-suttacentral/utility/pali-tools/hyphenate.py at master · suttacentral/legacy-suttacentral · GitHub. But a browser-solution would be a whole lot better of course.

But I think this topic deserves it’s own tread in Meta.

karl_lew · July 20, 2020, 7:29pm

Aha! Voice has much the same in Javascript. Thank you!

Voice supports Chrome and Chrome does not support hyphenation fully. Indeed, looking at all the bug requests about this issue is somewhat sad, since they appear to have thrown up their hands for good, and limit their work to Mac and Android. Not even my Google Chromebook is worthy enough of their attention.

So Voice will hyphenate just like the legacy Python. I’ll compare algorithms thanks to your link. My own implementation was a bit haphazard.

sujato · July 20, 2020, 11:41pm

Yes, it’s puzzling TBH. Google seems to be a bit all over the place when it comes to international support.

You’d think with an Indian CEO they’d be more aware of long words!

Vimala · July 21, 2020, 5:58am

When you are done with it, can you please send it to me? I might be able to use it for BuddhaNexus, although our calculation algorithms are a bit complicated and adding invisible characters in the string might mess it up. But it’s worth a try.

Is it in vanilla JS?

karl_lew · July 21, 2020, 12:12pm

Hmm. What I have right now is pali.js which is a class in a NodeJS library for searching the Pali canon. It’s not directly usable in the browser because pali.js also implements a fuzzy set recognizer for Pali words used in the Pali canon. Voice uses this recognizer to pick out the Pali words spoken in English or German for special pronunciation handling. The hyphenation part of this, however, is indeed file independent. I have not yet translated the python into Javascript, so the existing hyphenation is quite crude but acceptable for simple legibility. I’d be happy to craft a shared library that met other needs.

On a side note, scv-bilara is the Linux command line tool I use to search the Pali canon. It is actually the search engine behind Voice and has more functions than Voice exposes.

Vimala · July 22, 2020, 6:35am

Yes, that is vanilla JS. That’s great because we are using the same setup as SuttaCentral (lit-elements), which also uses that.

Sofar the only thing we have done is put an overflow-wrap: break-word; on the segments so you don’t get those overly long words messing up your display but it is far from ideal. So please give me a heads-up when you are finished revamping that module and I will see if it can work on our system too.

We are also messing around with a search engine at the moment (using ArangoDB’s inbuild possibilities) so I might come back to you about this later on too. But I notice that you use a romanized version of the pali words, which we have chosen not to do because a diacritical can change the meaning of the word. So we are using another system which you can find here: https://github.com/BuddhaNexus/buddhanexus-utils/blob/master/paliwords.py (so you need to make sure everything is in lowercase first).

karl_lew · July 22, 2020, 9:55am

Anagarika @Sabbamitta, I’ve captured this as a new issue for the next release.

sabbamitta · July 22, 2020, 9:58am

Thank you. I’ve only just one more request: Could we have instead of Vanilla JS … Chocolate JS?

(Sorry, just couldn’t restrain myself. )

karl_lew · July 22, 2020, 10:34am

Vimala · July 22, 2020, 11:18am

Of course you can!
Here you go

https://chocolatejs.org/

Chocolate is an experimental full stack Node.js webapp framework built using Coffeescript.

sabbamitta · July 22, 2020, 11:49am

My goodness! As the old saying goes, „Es gibt nichts, was es nicht gibt!“

Looks indeed very delicious. @karl_lew, what would Voice do with so much tasty stuff? Sit there all day, drinking coffee and eating chocolate, just too busy to answer your search request?