Lack of SC functionality for proper Devanāgarī ligatures

Coemgenu · October 10, 2018, 6:55pm

It seems that there is a problem with the program that converts romanized Pāli text into Devanāgarī.

Let us pick a word like sāvatthinidānaṃ, purely at random.

This should appear as सावत्थिनिदानं, comprised of the syllables

सा व त्थि नि दा नं

Instead it appears as the syllable sequence

सा व त् थि नि दा नं

Notice the extra त् letter? This letter is redundant, because the program is not correctly generating the त्थि ligature.

Similarly, a word like cakkhusamphassajāya appears as this lengthly sequence

च क् खु स म् फ स् स जा य
ca k khu sa m pha s sa jā ya

Instead of this shorter sequence

च क्खु स म्फ स्स जा य
ca kkhu sa mpha ssa jā ya

Most of the issues are to-do with consonant clusters and gemination.

Devanāgarī compatibility for an older language not traditionally written in Devanāgarī, I understand, might not be the most pressing item on SuttaCentral’s agenda, but, I figure I would raise attention to this in case no one had noticed. It is likely someone already has.

sujato · October 10, 2018, 11:20pm

Thanks for the heads up, I wasn’t aware of this. We would like to fix it, but at the moment we don’t have the resources. If anyone is able to help out with this, please let us know. (The script-changing widget is in javascript.)

Snowbird · October 11, 2018, 6:52pm

FWIW, Sinhala is also not rendering properly.

බ්‍ර හ්ම
bra hma

is rendering

බ් ර හ්ම
b ra hma

This is a classic rendering problem. While, like in the Devanagari example above, the meaning is conveyed, it is still considered incorrect by any Sinhala speaker.

As I understand it, it has to do with devilish zero-width joiners and the like. In a way, it gives me a little comfort to know that it is also a problem with Devanagari and not just the little cousin Sinhala.

Gabriel_L · October 11, 2018, 10:50pm

Maybe it’s worth checking all abugidas?

Snowbird · October 11, 2018, 10:58pm

Abugida is one of my most favoritest words.

The problem is not just with it being an abugida. I don’t know if there is a more technical term than “abugida from hell,” but that’s what these beautiful alphabets are.

In Sinhala, the problem only arises when there is a dropped vowel before an R or a Y.

Unicode has a way to represent it, but many systems don’t render correctly. I believe Apple products are notorious for incorrect rendering of Sinhala. Which is ironic since the macs were the very first computers to support Sinhala.

sujato · October 11, 2018, 11:02pm

Thanks again. Clearly we need to check all these in detail. We’d also like to implement script changing for unsupported scripts, such as Cyrrilic, other Indian scripts, even Brahmi. If there are any linguist/javascript programmers out there, get in touch!

Snowbird · October 11, 2018, 11:27pm

So, I’ll just throw it out there in case anyone is curious what might be involved…

The word බ්‍රහ්ම, properly rendered is made up of the following unicode parts:

\u0DB6\u0DCA\u200D\u0DBB\u0DC4\u0DCA\u0DB8

individually, the visible parts looks like this:
බ ් ‍ ර හ ් ම

Except that there should be something called a zero-width joiner(ZWJ) after the first two parts. That’s what the \u200D is. Without the zero-width-joiner, it looks like this:
බ්රහ්ම

Same problem happens with Ayya, අය්‍ය. Without the ZWJ, it looks like this: අය්ය.

In Sinhala, this is only a problem when a vowel is dropped before an R or a Y.

sujato · October 12, 2018, 9:47am

It’s really important to get these details right! There are approximately 1 quintilliazion javascript programmers in the world, hopefully one of them will be able to fix this for us!