Pāli spell checker and hyphenator

Antonio-Costanzo · October 5, 2020, 12:25am

Dear all,

I’m not sure to what extent this topic has been discussed early so that I apologize if I’m opening a discussion which has maybe been solved.

I’m trying to make a Pāli spell checker and hyphenator for Hunspell. I contacted Libre Office for this. A guy answered me inviting me to collect Pāli words. I began with collecting words from the first part of Dīgha Nikāya so that he/she made the first “draft” of Pāli spell checker. That draft works also on Page Plus X9 and Affinity Publisher. So I collected more words from tipitaka.org, all words from the whole Tipiṭaka, Aṭthakathā, Ṭīka and so on. I tried to contact again the guy from Libre Office in order to make a complete Pāli spell checker but he/she dind’t answer. Anyway, I managed to make a complete “spell checker” just renaming my .txt file in .dic file. This bad made spell checker somehow works in Page Plus X9 but it doesn’t work at all in Affinity Publisher. (I have to use an existent language as X9 doesn’t recognize Pāli (pi)). The discussion in Libre Office is here: Is it possible to create a PÄli dictionary for Libreoffice? - Ask LibreOffice

Some months later I came across a Pāli hyphenator which Cittānurakkho bhikkhu made for Latex. Cittānurakkho bhikkhu was so kind to share his hyphenator with me and I contacted again the guy in Libre Office in order to create both complete Pāli spell checker and hyphenator. Again no response.

I tried Cittānurakkho bhikkhu’s hyphenator in Affinity Publisher. Somehow it seems to work as you can see from the image below.

Is there anyone who has skills for making a Hunspell spell checker starting from .txt word list? I have two versions, Pāli ṃ and Pāli ṁ. Regarding hyphenator, it seems that Cittānurakkho’s one works. Maybe it should be just edited in order if someone prefers another kind of hyphenation. For example, I’m aware of people not wanting to split i. e. buddha like bud-dha rather bu-ddha. Any comments about this last issue would also be appreciated.

Thank you.

sujato · October 5, 2020, 5:14am

Hi Antonio!

I hope you get some help, it would be great to have a proper hyphenator for Hunspell.

As you note, the example does indeed appear to hyphenate correctly.

One option that sometimes works is to define the language as Sanskrit. Have your tried that? There may already be what you need there.

No, the former is correct. The syllables correctly broken at bud/dha, and the hyphenator should reflect that, not what people may or may not want.

Antonio-Costanzo · October 5, 2020, 6:11am

Dear Bhante Sujato,

Thank you for your reply. I have some texts where there are both Sanskrit and Pāli words. Affinity Publisher recognize Sanskrit but not Pāli. I have had to use another language in order to be able to use Pāli spell checker and hyphenator. I have uploaded my .dic files in a folder for fi_FI (I don’t use Finnish) so that I can use them. If I try to make a pi (= Pāli) folder, both Affinity and Page Plus display a message like “unknown local settings”. Maybe I should talk about it to the developers.

Anyway, I hope someone can make a .dic file using my Pāli .txt word list(s).

Antonio-Costanzo · November 5, 2020, 9:26am

Thanks to LibreOffice, we have now a Pāli spell checker and hyphenator based on CST 4.0. Here, you can download the Hunspell dictionaries:

There are two versions, Pāli ṃ and Pāli ṁ.

If someone is not able to extract the Hunspell dictionary from the .oxt folder, it is possible to download directly the .dic files (only for Pāli ṃ) from here:

https://mega.nz/folder/tLRyUJJL#nNbDBMqtXZQsGwxwZNH6TQ

sujato · November 5, 2020, 11:05pm

Thanks so much!

I’m looking at the source for hyph_pali.dic and I’m wondering how to interpret it.

It seems that the numerals define hyphenation points (or lack thereof) with a strength? Is that right?

Do you know of any explanation for how the pattern works?

bksubhuti · November 6, 2020, 1:07am

I was thinking about doing this myself. let me know if you need words. There are roughly 950k words. I can send you… I might already have a github for this as well.

Our algo works on this word list. It is in the tipitaka pali projector repo
I think this is it… go with the first created one…

Antonio-Costanzo · November 6, 2020, 9:01am

Unfortunately, I have no idea about how a hyphenator works. All I have done regard it was to include in my sources the hyphenator made by Cittānurakkho Bhikkhu for Latex. Then I send it to Gabix, the guy in LibreOffice who helped me to do the spell checker. I don’t know whether Gabix had to edit Cittānurakkho’s hyphenator, or he has used it straight away in the Hunspell version. Maybe you could send Cittānurakkho e-mail.

Antonio-Costanzo · November 6, 2020, 9:17am

It would be great to add more words as the words that are in the actual spell checker are only actual words found in the canon. As long as I remember, I didn’t include words from Aṭṭhakathayo or Ṭīkayo. So words in thematic form as dhamma or buddha and so on are not included. You find there only actual words used in the canon collection as buddho, buddhā, buddhāna, dhammā, dhammo and so on. Maybe you could join our discussion here

I think Gabix could help us in enlarging our spell checker.

mikenz66 · November 6, 2020, 9:43am

The TeX hyphenation algorithm is very sophisticated. One of Donald Knuth’s students wrote a PhD thesis on it…
See, for example:

[LaTeX is a very comprehensive set of TeX macros: LaTeX - Wikipedia. Hardly anyone uses “Plain TeX”, because if you do you have to write quite a lot of layout code, and it’s a lot easier to just use the LaTeX packages. However, the really clever stuff that gives us beautiful typesetting is in the underlying TeX code. Knuth, a Stanford Computer Scientist took a break from writing books about algorithms in the 70s to write the tools he needed to typeset the books properly.]

Antonio-Costanzo · November 6, 2020, 10:00am

Here, the point is in which way Gabix has integrated Latex’ spell checker in Hunspell. Maybe this could be quite evident for experts, but for persons like me who have very little know-how on this matter, it is puzzling.

bksubhuti · November 7, 2020, 6:17am

if you want a list of words … let me know the format.
i’ll send to you. There are roughly 950k total in all pali texts found in tpp M A T

bksubhuti · November 7, 2020, 6:20am

hyphen algo would need a proper sandhi breaker which is not really existent. We hope to make a proper sandhi breakup, but based on dictionaries in the next year or so. breaking the 950k words. First we start with the DPR algo and then manually correct from there. The file i pointed you at is that actual rough draft. We use that to give us the dictionary lookups.

Antonio-Costanzo · November 7, 2020, 11:30am

Please send me the list as txt. Do you need my e-mail address?

Antonio-Costanzo · November 7, 2020, 11:40am

Do you know Cittārukkho Bhikkhu? He wrote to me that he wanted to develop hyphenation pattern for Pāḷi compound words and for this he thought he needs a very clean error free Pāḷi word list. Maybe it could be a good thing to unify efforts in making this. If you agree I would ask Cittānurakkho Bhikkhu whether he would be involved in this task.

sujato · November 7, 2020, 9:18pm

This sounds like a great idea!

bksubhuti · November 9, 2020, 6:39am

you can give my email. it is bksubhuti at the big G

bksubhuti · November 9, 2020, 6:47am

You can download directly… I would just modify this file here.
I would probably take the older one…
I think this has been reduced by 100k words.

github.com

bksubhuti/Tipitaka-Pali-Projector/blob/master/tipitaka_projector_data/dictionary/dpr-breakup.js

var dprBreakup = {
'ṭhānūpacārena':'ṭhānu-upacārena (ṭhāna, upacāra)',
'ṭhānāṭṭhānañāṇaṃ':'ṭhāna-aṭṭhānañ-āṇaṃ (ṭhāna, aṭṭhāna, āṇā)',
'ṭhānaparicchedo':'ṭhāna-paricchedo (ṭhāna, pariccheda)',
'ṭhānabhedaṃ':'ṭhāna-bhedaṃ (ṭhāna, bheda)',
'ṭhitānī':'ṭhit-ānī (ṭhiti, ānī)',
'ṭhitesupi':'ṭhitesu-pi (ṭhita, pi)',
'ṭhitatāya':'ṭhitatāya (ṭhitatā)',
'ṭhitapuriso':'ṭhita-puriso (ṭhita, purisa)',
'ṭhapitaupanikkhepato':'ṭhapita-upanikkhepato (ṭhapita, upanikkhepa)',
'ṭhapetabbāti':'ṭhapetabbā-ti (ṭhapeti, ti)',
'ṭhapehīti':'ṭhapehī-ti (ṭhapeti, ti)',
'ṭhapanādivasena':'ṭhapanā-divasena (ṭhapanā, divasa)',
'ṭhapanatthaṃ':'ṭhapana-tthaṃ (ṭhapana, attha, itthaṃ)',
'ṭhapanapaccupaṭṭhānaṃ':'ṭhapana-paccupaṭṭhānaṃ (ṭhapana, paccupaṭṭhāna)',
'ṭhānañhetaṃ':'ṭhānañ-hetaṃ (ṭhāna, hetaṃ)',
'ṭhitāya':'ṭhitāya (ṭhita)',
'ṭhatvāna':'ṭhatvā-na (ṭhatvā, na)',
'ṭhapetuṃ':'ṭhapetuṃ (ṭhapeti)',
'ḍaṃsehi':'ḍaṃsehi (ḍaṃsa)',

This file has been truncated. show original

bksubhuti · November 9, 2020, 6:48am

This one should have 950k lines

github.com

bksubhuti/Tipitaka-Pali-Projector/blob/0c7a321b53abf2e2a1ee717681a11cadbcf20773/tipitaka_projector_data/dictionary/dpr-breakup.js

var dprBreakup = {
'ṭṭhānaṃ':'ṭṭhānaṃ (ṭṭhānaṃ)',
'ṭhānūpacārena':'ṭhānu-upacārena (ṭhāna, upacāra)',
'ṭhānāṭṭhānañāṇaṃ':'ṭhāna-aṭṭhānañ-āṇaṃ (ṭhāna, aṭṭhāna, aṭṭha, āṇā)',
'ṭhānācāvanassa':'ṭhānā-cāvanassa (ṭhāna, cāvanā)',
'ṭhānuppattikapaṭibhānena':'ṭhānuppattikapaṭibhānena (ṭhānuppattikapaṭibhānena)',
'ṭhānaparicchedo':'ṭhāna-paricchedo (ṭhāna, pariccheda)',
'ṭhānabhedaṃ':'ṭhāna-bhedaṃ (ṭhāna, bheda)',
'ṭhitānī':'ṭhit-ānī (ṭhita, ānī)',
'ṭhitesupi':'ṭhitesu-pi (ṭhita, pi)',
'ṭhitatāya':'ṭhitatāya (ṭhitatā)',
'ṭhitapuriso':'ṭhita-puriso (ṭhita, purisa)',
'ṭhassāmīti':'ṭhassāmīti (ṭhassāmīti)',
'ṭhassāmi':'ṭhassāmi (ṭhassāmi)',
'ṭhapiyati':'ṭhapiyati (ṭhapiyati)',
'ṭhapitaupanikkhepato':'ṭhapita-upanikkhepato (ṭhapita, upanikkhepa)',
'ṭhapetabbāti':'ṭhapetabbā-ti (ṭhapeti, ti)',
'ṭhapehīti':'ṭhapehī-ti (ṭhapeti, ti)',
'ṭhapanādivasena':'ṭhapanā-divasena (ṭhapanā, divasa)',
'ṭhapanatthaṃ':'ṭhapan-atthaṃ (ṭhapana, ṭhapanā, attha)',

This file has been truncated. show original

Antonio-Costanzo · November 9, 2020, 11:41am

Thank you. I send you a mail. Hopefully, I understood what the right address is.

stu · December 3, 2022, 2:05pm

I’m just looking at using Affinity Publisher (V2) and was wondering if there is a working Pali spell checker and hyphenator for it? Is this project alive @Antonio-Costanzo ? If so, where can I learn more and download the current necessary files?