Average Word Length as indicator of Pali text age

karl_lew · July 22, 2020, 11:51am

Now that is hysterical.

karl_lew · July 22, 2020, 12:26pm

Ayya @Vimala we have moved the Pali hyphenation work into the current release given that we are awaiting SuttaCentral’s own completion of the “published” branch. Your Python hyphenation code will be an improvement over the current Voice hyphenation. I’ll start work on this tomorrow.

Vimala · July 23, 2020, 6:02am

Well, it is not so crazy if you remember that JavaScript was named after … Java coffee.
It’s original name was actually Mocha. (I guess developers really love their coffee!!!)

karl_lew · July 29, 2020, 1:09pm

Ayya @Vimala, I have fixed your khamma.

The new, combined hyphenation module is in src/pali-hyphenator.js. I elected to use a separate file that may be included separately for use in the browser. In this way, hyphenation can become a client view amendment with no server processing penalty.

I’ve combined our algorithms. Rather than use regex for exhaustive hyphenation, I’ve elected to keep the “split in half recursively” algorithm for maintainability given the nature of complex regular expressions.

For testing examples, please see test/pali-hyphenator.

For installation with an existing NPM application:

npm install --save scv-bilara

For use in Javascript:

const { PaliHyphenator } = require("scv-bilara");

var ph = new PaliHyphenator({maxWord:10});
ph.hyphenate("kammakammakameleon"); // kamma-kamma-kameleon

p.s., Your actual kamma was unaffected.

sujato · July 29, 2020, 10:31pm

Karl, on what basis (roughly) does the algorithm decide to insert a hyphen? Can it be adjusted for word length?

Vimala · July 30, 2020, 10:48am

Thanks @karl_lew! Very much appreciated.
Not sure if we can actually use it for Buddhanexus at the moment (problems with the other algorithm which devides each word of a segment that has a potential match into an array of letters, each with it’s own code) but will see what can be done about this. It might not be working before the launch though!

karl_lew · July 30, 2020, 11:18am

The hyphenator has parameter:

hyphen: is the (soft) hyphenation character (default 0x00AD Unicode soft hyphen)
minWord: minimum chunk size of hyphenated word (default 5)
maxWord: maximum word size. Larger words trigger hyphenation (default 20)

Roughly, the algorithm chews on any word chunk longer than maxWord. It recursively chops all larger chunks in half, splitting between consonants (except before “h”) or between a vowel followed by a consonant. Certain words, defined by Ayya Vimala in an array, are atomic and never broken. I’ve had to amend her list to deal with special cases I found (e.g., ekaṁ, ekacce, kacca). Let us know of any special cases we need to address.

Actually, I can’t fully grasp the algorithm and always consult the tests that show actual splits for sample Pali words. I just write the code in response to the tests. It does what it does.

The hyphenation code should be applied by the browser to each viewed segment for presentation after the search stuff has done its magic in selecting a segment. I’ll be using the new hyphenator in the Voice client. This will shift the hyphenation labor from the server to the client, where it properly belongs. Let me know if you need any hyphenator changes!

sujato · July 30, 2020, 10:11pm

Okay, thanks, it’s good to know it can be tweaked.

I was just thinking, maybe it would be minimize the work of the script by increasing the maxWord as much as possible, perhaps to 40 characters. It’s only a serious problem if the word length is longer than the line length. But anyway, that’s an application-level concern.

karl_lew · July 31, 2020, 1:30pm

Exactly. For Voice, it turns out that 20 works well for iPhone8 mobile without too much “jagginess” on the right. It also provides a nice balance for side-by-side vs line-by-line bilingual viewing.

sujato · July 31, 2020, 10:26pm

Say, you wouldn’t be able to whip up a quick list of word-length vs. frequency for long words?

karl_lew · August 1, 2020, 1:23pm

I think Ayya @Vimala may have the tools already to do such. If not, we can add that issue to the Voice backlog.

Vimala · August 1, 2020, 1:36pm

Yes, indeed I can easily do this. Weekends are very busy for me but would be happy to do it somewhere next week.

I was thinking that it would also be interesting to do the same not just for the total frequency but show the division of frequencies between the various groups of collections. F.I. how often do certain words appear in the suttas, the abhidhamma and the commentaries?

SeriousFun136 · August 1, 2020, 6:21pm

Just a general question: are you open to mentor/collaborate with others in the future on such projects related to the quantitative analysis of Buddhist texts to determine authenticity?

I have thought on this topic from time to time, but this is the first time that I think I came across someone who has actually completed a concrete project in this area.

I am very impressed and happy with your project overall.

Vimala · August 3, 2020, 5:56am

Here is a quick list of words of over 40 characters in length. If you want I can also run it for less characters; it only takes a few minutes. Unfortunately, D&D does not allow me to upload .zip files so here it is: https://drive.google.com/file/d/1UtgVekklLV8_TYprB9PQ5HVyURaX6-iF/view?usp=sharing

There are several columns with numbers. The first column is the total word-length in characters, the second is the frequency (i.e. how often the word was found in the texts) and the rest is how this number is divided between Early Suttas, Late Suttas, Vinaya, Abhidhamma and Commentaries.
The spreadsheet is sorted on the latter 5 columns in decending order so the first words that appear are those that are most used in the Early Suttas. But you can sort it any way you like of course.

Hope this helps.

Thank you for your interest. Do you have a specific idea in mind? I can easily run some scripts across the whole Pali corpus of text as well as the Chinese and Tibetan. Sanskrit however is a bit more problematic due to the rather bad quality of the input files.

sujato · August 3, 2020, 6:01am

So, for words longer than 40 characters, there are 85 in the EBTs, and about 7500 in later texts. That’s a big difference!

Vimala · August 3, 2020, 6:03am

Yes, indeed. 83 in the Early Suttas, 330 in the Sutta/Vinaya/Abhidhamma together and the rest of the 7500 is in the commentaries.

But I just noticed a bug: I need to make everything lowercase to really compare.

Vimala · August 3, 2020, 6:15am

Bhante @sujato : new version with slightly different results. I made everything lowercase first now. It was calculating capatalized letters as separate: https://drive.google.com/file/d/1SZp8iCfyAc3JMXHOLZDvcsJ-eAR79dPD/view?usp=sharing

SeriousFun136 · August 3, 2020, 10:53am

Possible explanation:
People tend to use shorter words when speaking than when writing?

Early discourses might be aiming to represent actual spoken discourse,
whereas as later material might aiming to conceptually elaborate on those - introducing complicated and overly formal and technical words that may otherwise not be spoken in real life?

Thank you for letting me know. Let me think. Can I pm you about this?

karl_lew · August 3, 2020, 6:17pm

@Hongda @Sujato @Blake

The PaliHyphenator used by voice is now part of a new repository js-ebt. Voice uses it to hyphenate directly in the browser client. It works for all browsers and incorporates Ayya Vimala’s own hyphenator logic. Hyphenation can be done client or server side.

Screenshot 2020-08-03 at 08.56.43

I had to move the code out of scv-bilara, which has NodeJS dependencies. The new repository has no such dependencies.

sujato · August 3, 2020, 9:28pm

Okay, awesome, thanks so much to both of you. I’ll open a ticket to see whether we want to implement this. I’m guessing we’d probably want to do it client-side, but we’ll see.