SuttaCentral i18n development

blake · November 30, 2015, 4:39pm

Now here are some thoughts/planning/analysis on i18n.

##Text Translation:

This is translating pali-> other languages, and potentially also for translations of other root languages.

In pootle (and generally the whole gettext framework) a project or po file has a clear, definite and non-optional source language. At the moment we use pootle only for translating pali, but hypothetically speaking if we wanted to translate sanskrit, we would make a sanskrit source language project.

##Descriptions translation

Unlike text, there is no existing source for descriptions. Every description will be bound to a uid, for example dn, dn1 and so on. These uids would be ideal for msgctxt but po itself is built around the concept of translating strings in a source language, to other languages. Hence we would generate .pot files of a form along the lines of:

msgctxt "an3.1"
msgid "Fools are dangerous, but the wise are safe."

This would be a special project “descriptions” with the language “en”.

Using msgctxt as a unique identifier or address is unconventional, the way gettext works, the conjunction of msgctxt and msgid should be unique, so msgcxt could be like “noun” or “verb” or “ui”, the same msgctxt can be reused on different strings. But I don’t see any reason why we can’t use msgctxt as a unique identifier - just to note that the gettext tools don’t offer any special support in this regard (i.e. they don’t assume that msgctxt is a way to uniquely identify a string).

Anyway point is, we would create the descriptions, maybe in CSV or JSON in the general form uid: description, these would then be converted to .pot files and could be translated from english to other languages using pootle. This would mean that the meaning of descriptions would be consistent across languages, as they are translated from the english description (in principle, you could do it another way: create a completely independent description of the sutta for the other language).

##Template Localization:

With templates it seems simplest to use gettext in the traditional way, where simply the msgid is translated. It is worth noting that the Jinja2 i18n extension does not support context, a few issues have been raised about the lack of this feature - as recently as 10 days ago.

I don’t think this will be a big deal though, as for descriptions we wont use the i18n extension and instead will use some custom code which relies mainly on the msgctxt/uid correlation.

The template localization will be about strings that appear in the templates, for example Translation of {sutta_name} by {translator_name}. and we shouldn’t need msgctxt for that.

It is important to be clear about which strings get translated and which don’t, as a general rule probably parts of the page which are not in english (i.e. because they are pali) probably don’t get translated - the logic going if they should be translated in other languages, they should be translated into english too. Although there might be exceptions here particularly when translating to entirely different scripts; perhaps comment could be helpful here. Here’s an example, the word “Sutta”, we could leave a comment Usually okay to keep as "Sutta", but for Thai, Sinhala and other such languages with a true native rendering of “sutta” the native rendering should probably be used. It could be said in such a case we’d prefer no translation to an imprecise translation, but to use a precise translation if the language has one.
We could also consider marking pali, and automatically transliterating pali into scripts which support it. This is obviously something which falls completely outside the scope of normal i18n tools.

The general backend python library to use for general gettext functionality is Babel (it is already in our requirements).