Now that is hysterical.
Ayya @Vimala we have moved the Pali hyphenation work into the current release given that we are awaiting SuttaCentralās own completion of the āpublishedā branch. Your Python hyphenation code will be an improvement over the current Voice hyphenation. Iāll start work on this tomorrow.
Well, it is not so crazy if you remember that JavaScript was named after ā¦ Java coffee.
Itās original name was actually Mocha. (I guess developers really love their coffee!!!)
Ayya @Vimala, I have fixed your khamma.
The new, combined hyphenation module is in src/pali-hyphenator.js. I elected to use a separate file that may be included separately for use in the browser. In this way, hyphenation can become a client view amendment with no server processing penalty.
Iāve combined our algorithms. Rather than use regex for exhaustive hyphenation, Iāve elected to keep the āsplit in half recursivelyā algorithm for maintainability given the nature of complex regular expressions.
For testing examples, please see test/pali-hyphenator.
For installation with an existing NPM application:
npm install --save scv-bilara
For use in Javascript:
const { PaliHyphenator } = require("scv-bilara");
var ph = new PaliHyphenator({maxWord:10});
ph.hyphenate("kammakammakameleon"); // kamma-kamma-kameleon
p.s., Your actual kamma was unaffected.
Karl, on what basis (roughly) does the algorithm decide to insert a hyphen? Can it be adjusted for word length?
Thanks @karl_lew! Very much appreciated.
Not sure if we can actually use it for Buddhanexus at the moment (problems with the other algorithm which devides each word of a segment that has a potential match into an array of letters, each with itās own code) but will see what can be done about this. It might not be working before the launch though!
The hyphenator has parameter:
- hyphen: is the (soft) hyphenation character (default 0x00AD Unicode soft hyphen)
- minWord: minimum chunk size of hyphenated word (default 5)
- maxWord: maximum word size. Larger words trigger hyphenation (default 20)
Roughly, the algorithm chews on any word chunk longer than maxWord. It recursively chops all larger chunks in half, splitting between consonants (except before āhā) or between a vowel followed by a consonant. Certain words, defined by Ayya Vimala in an array, are atomic and never broken. Iāve had to amend her list to deal with special cases I found (e.g., ekaį¹, ekacce, kacca). Let us know of any special cases we need to address.
Actually, I canāt fully grasp the algorithm and always consult the tests that show actual splits for sample Pali words. I just write the code in response to the tests. It does what it does.
The hyphenation code should be applied by the browser to each viewed segment for presentation after the search stuff has done its magic in selecting a segment. Iāll be using the new hyphenator in the Voice client. This will shift the hyphenation labor from the server to the client, where it properly belongs. Let me know if you need any hyphenator changes!
Okay, thanks, itās good to know it can be tweaked.
I was just thinking, maybe it would be minimize the work of the script by increasing the maxWord as much as possible, perhaps to 40 characters. Itās only a serious problem if the word length is longer than the line length. But anyway, thatās an application-level concern.
Exactly. For Voice, it turns out that 20 works well for iPhone8 mobile without too much ājagginessā on the right. It also provides a nice balance for side-by-side vs line-by-line bilingual viewing.
Say, you wouldnāt be able to whip up a quick list of word-length vs. frequency for long words?
I think Ayya @Vimala may have the tools already to do such. If not, we can add that issue to the Voice backlog.
Yes, indeed I can easily do this. Weekends are very busy for me but would be happy to do it somewhere next week.
I was thinking that it would also be interesting to do the same not just for the total frequency but show the division of frequencies between the various groups of collections. F.I. how often do certain words appear in the suttas, the abhidhamma and the commentaries?
Just a general question: are you open to mentor/collaborate with others in the future on such projects related to the quantitative analysis of Buddhist texts to determine authenticity?
I have thought on this topic from time to time, but this is the first time that I think I came across someone who has actually completed a concrete project in this area.
I am very impressed and happy with your project overall.
Here is a quick list of words of over 40 characters in length. If you want I can also run it for less characters; it only takes a few minutes. Unfortunately, D&D does not allow me to upload .zip files so here it is: https://drive.google.com/file/d/1UtgVekklLV8_TYprB9PQ5HVyURaX6-iF/view?usp=sharing
There are several columns with numbers. The first column is the total word-length in characters, the second is the frequency (i.e. how often the word was found in the texts) and the rest is how this number is divided between Early Suttas, Late Suttas, Vinaya, Abhidhamma and Commentaries.
The spreadsheet is sorted on the latter 5 columns in decending order so the first words that appear are those that are most used in the Early Suttas. But you can sort it any way you like of course.
Hope this helps.
Thank you for your interest. Do you have a specific idea in mind? I can easily run some scripts across the whole Pali corpus of text as well as the Chinese and Tibetan. Sanskrit however is a bit more problematic due to the rather bad quality of the input files.
So, for words longer than 40 characters, there are 85 in the EBTs, and about 7500 in later texts. Thatās a big difference!
Yes, indeed. 83 in the Early Suttas, 330 in the Sutta/Vinaya/Abhidhamma together and the rest of the 7500 is in the commentaries.
But I just noticed a bug: I need to make everything lowercase to really compare.
Bhante @sujato : new version with slightly different results. I made everything lowercase first now. It was calculating capatalized letters as separate: https://drive.google.com/file/d/1SZp8iCfyAc3JMXHOLZDvcsJ-eAR79dPD/view?usp=sharing
Possible explanation:
People tend to use shorter words when speaking than when writing?
Early discourses might be aiming to represent actual spoken discourse,
whereas as later material might aiming to conceptually elaborate on those - introducing complicated and overly formal and technical words that may otherwise not be spoken in real life?
Thank you for letting me know. Let me think. Can I pm you about this?
The PaliHyphenator used by voice is now part of a new repository js-ebt. Voice uses it to hyphenate directly in the browser client. It works for all browsers and incorporates Ayya Vimalaās own hyphenator logic. Hyphenation can be done client or server side.
I had to move the code out of scv-bilara, which has NodeJS dependencies. The new repository has no such dependencies.
Okay, awesome, thanks so much to both of you. Iāll open a ticket to see whether we want to implement this. Iām guessing weād probably want to do it client-side, but weāll see.