Pāli spell checker and hyphenator

Dear all,

I’m not sure to what extent this topic has been discussed early so that I apologize if I’m opening a discussion which has maybe been solved.

I’m trying to make a Pāli spell checker and hyphenator for Hunspell. I contacted Libre Office for this. A guy answered me inviting me to collect Pāli words. I began with collecting words from the first part of Dīgha Nikāya so that he/she made the first “draft” of Pāli spell checker. That draft works also on Page Plus X9 and Affinity Publisher. So I collected more words from tipitaka.org, all words from the whole Tipiṭaka, Aṭthakathā, Ṭīka and so on. I tried to contact again the guy from Libre Office in order to make a complete Pāli spell checker but he/she dind’t answer. Anyway, I managed to make a complete “spell checker” just renaming my .txt file in .dic file. This bad made spell checker somehow works in Page Plus X9 but it doesn’t work at all in Affinity Publisher. (I have to use an existent language as X9 doesn’t recognize Pāli (pi)). The discussion in Libre Office is here: Is it possible to create a Pāli dictionary for Libreoffice? - Ask LibreOffice

Some months later I came across a Pāli hyphenator which Cittānurakkho bhikkhu made for Latex. Cittānurakkho bhikkhu was so kind to share his hyphenator with me and I contacted again the guy in Libre Office in order to create both complete Pāli spell checker and hyphenator. Again no response.

I tried Cittānurakkho bhikkhu’s hyphenator in Affinity Publisher. Somehow it seems to work as you can see from the image below.

Is there anyone who has skills for making a Hunspell spell checker starting from .txt word list? I have two versions, Pāli ṃ and Pāli ṁ. Regarding hyphenator, it seems that Cittānurakkho’s one works. Maybe it should be just edited in order if someone prefers another kind of hyphenation. For example, I’m aware of people not wanting to split i. e. buddha like bud-dha rather bu-ddha. Any comments about this last issue would also be appreciated.

Thank you.

3 Likes

Hi Antonio!

I hope you get some help, it would be great to have a proper hyphenator for Hunspell.

As you note, the example does indeed appear to hyphenate correctly.

One option that sometimes works is to define the language as Sanskrit. Have your tried that? There may already be what you need there.

No, the former is correct. The syllables correctly broken at bud/dha, and the hyphenator should reflect that, not what people may or may not want.

3 Likes

Dear Bhante Sujato,

Thank you for your reply. I have some texts where there are both Sanskrit and Pāli words. Affinity Publisher recognize Sanskrit but not Pāli. I have had to use another language in order to be able to use Pāli spell checker and hyphenator. I have uploaded my .dic files in a folder for fi_FI (I don’t use Finnish) so that I can use them. If I try to make a pi (= Pāli) folder, both Affinity and Page Plus display a message like “unknown local settings”. Maybe I should talk about it to the developers.

Anyway, I hope someone can make a .dic file using my Pāli .txt word list(s).

2 Likes

Thanks to LibreOffice, we have now a Pāli spell checker and hyphenator based on CST 4.0. Here, you can download the Hunspell dictionaries:

There are two versions, Pāli ṃ and Pāli ṁ.

If someone is not able to extract the Hunspell dictionary from the .oxt folder, it is possible to download directly the .dic files (only for Pāli ṃ) from here:

https://mega.nz/folder/tLRyUJJL#nNbDBMqtXZQsGwxwZNH6TQ

3 Likes

Thanks so much!

I’m looking at the source for hyph_pali.dic and I’m wondering how to interpret it.

It seems that the numerals define hyphenation points (or lack thereof) with a strength? Is that right?

Do you know of any explanation for how the pattern works?

I was thinking about doing this myself. let me know if you need words. There are roughly 950k words. I can send you… I might already have a github for this as well.

Our algo works on this word list. It is in the tipitaka pali projector repo
I think this is it… go with the first created one…

Unfortunately, I have no idea about how a hyphenator works. All I have done regard it was to include in my sources the hyphenator made by Cittānurakkho Bhikkhu for Latex. Then I send it to Gabix, the guy in LibreOffice who helped me to do the spell checker. I don’t know whether Gabix had to edit Cittānurakkho’s hyphenator, or he has used it straight away in the Hunspell version. Maybe you could send Cittānurakkho e-mail.

1 Like

It would be great to add more words as the words that are in the actual spell checker are only actual words found in the canon. As long as I remember, I didn’t include words from Aṭṭhakathayo or Ṭīkayo. So words in thematic form as dhamma or buddha and so on are not included. You find there only actual words used in the canon collection as buddho, buddhā, buddhāna, dhammā, dhammo and so on. Maybe you could join our discussion here


I think Gabix could help us in enlarging our spell checker.

The TeX hyphenation algorithm is very sophisticated. One of Donald Knuth’s students wrote a PhD thesis on it…
See, for example:

[LaTeX is a very comprehensive set of TeX macros: LaTeX - Wikipedia. Hardly anyone uses “Plain TeX”, because if you do you have to write quite a lot of layout code, and it’s a lot easier to just use the LaTeX packages. However, the really clever stuff that gives us beautiful typesetting is in the underlying TeX code. Knuth, a Stanford Computer Scientist took a break from writing books about algorithms in the 70s to write the tools he needed to typeset the books properly.]

Here, the point is in which way Gabix has integrated Latex’ spell checker in Hunspell. Maybe this could be quite evident for experts, but for persons like me who have very little know-how on this matter, it is puzzling.

if you want a list of words … let me know the format.
i’ll send to you. There are roughly 950k total in all pali texts found in tpp M A T

hyphen algo would need a proper sandhi breaker which is not really existent. We hope to make a proper sandhi breakup, but based on dictionaries in the next year or so. breaking the 950k words. First we start with the DPR algo and then manually correct from there. The file i pointed you at is that actual rough draft. We use that to give us the dictionary lookups.

Please send me the list as txt. Do you need my e-mail address?

Do you know Cittārukkho Bhikkhu? He wrote to me that he wanted to develop hyphenation pattern for Pāḷi compound words and for this he thought he needs a very clean error free Pāḷi word list. Maybe it could be a good thing to unify efforts in making this. If you agree I would ask Cittānurakkho Bhikkhu whether he would be involved in this task.

This sounds like a great idea!

you can give my email. it is bksubhuti at the big G

You can download directly… I would just modify this file here.
I would probably take the older one…
I think this has been reduced by 100k words.

This one should have 950k lines

Thank you. I send you a mail. Hopefully, I understood what the right address is.

I’m just looking at using Affinity Publisher (V2) and was wondering if there is a working Pali spell checker and hyphenator for it? Is this project alive @Antonio-Costanzo ? If so, where can I learn more and download the current necessary files?

1 Like