SuttaCentral i18n development

sujato · November 19, 2015, 7:03am

Continuing the discussion from SuttaCentral localization (suggestion):

I have implemented a first pass at language pages, these exist in hitherto unused urls:

For example, all English translations:
SuttaCentral

When there are many translations, they are broken into divisions:
SuttaCentral

In the case where there is a so-called “division length text”, it continues to work as it presently does, that is it takes you to the text and doesn’t attempt to generate a division page:
SuttaCentral

It is also possible to put custom content on a language page, as I have here:
SuttaCentral
(Note: This is all from SuttaCentral - SASANA.PL - with one exception, I google translated “Early Buddhist texts, translations, and parallels”!)

Under the hood, making a custom page involves creating a template templates/language/<isocode>.html

Note that these pages are presently active on suttacentral.net (I’d considered making them only active on staging, but I don’t see them doing any serious harm) but at present there are no links to them, if you manually create the url they will work.

sujato · November 19, 2015, 7:52am

Thanks @blake. I’ve moved this thread here so we can have a dedicated discussion page for this. We have discussed many of these issues earlier, and IIRC I’ve sent a mockup previously, but I’ll repeat it here to make sure we’re on the same page, and anyway my own ideas are evolving.

Okay, this is a start.

How is “many” calculated?

As far as the layout for each sutta goes, I’m after something more like this (allowing for the limits of Discourse markdown!)

DN 1. The All-embracing Net of Views
Brahmajāla Sutta. Translation by Sujato. Alternative translations by Bodhi, Thanissaro, and Rhys Davids.
How to go fishing for all kinds of wrong views, and how mangoes may be plucked. Or not.

Explanation:

The top line gives the SC ID and translated title. This should be the title as found in the “default” translations. The entire line should also link to the same translation.

Normally, for the 4 nikayas at least, the default translation in English should be mine. But there should be a mechanism for people to select a “preferred translator”. If they’d rather read Ven Bodhi’s translations, this becomes default for them, unless of course there is none in which case it falls back to the usual default. Probably, like the language selector, the preferred translator is simply remembered from what people select, although this is not ideal, an explicit chooser might be better. Anyway, all the translations are prsented up front so it doesn’t matter too much. Emphasizing one translation is mainly just for simplicity.

The second line gives the Pali title, linked to the Pali text. Then the list of links to different translations, and finally a link to the relevant “details” page. This is, I think, a nice and simple way to make available the more powerful scholarly apparatus of SC, without confusing less experienced users.

Finally, the last line gives the description. We should strive to make hand-crafted descriptions available for all significant texts. Where they are not available, we might perhaps pull them from the text automatically, but the problem is that so many suttas start the same way …

The three lines need to have a nice clear visual distinction, here I use size, but color may also be good.

sirinath · November 19, 2015, 8:46am

Another thing is the raw Pali texts. This should be available at least in the following scripts:

Singhala
Roman Script
Burmese
Thai
Devanagari

blake · November 19, 2015, 10:01am

There is a script change functionality but it might be a little bit hidden, you pull down the Sidebar, and under the Controls tab there are a row of buttons which let you choose any of the scripts you mentioned. It wouldn’t hurt to make this functionality more obvious.

Maybe we could also consider making hardlinks to other scripts, perhaps with a url like this:
/pi-sinh/dn2

A primary advantage of hard urls is that they can be indexed by search engines.

sirinath · November 19, 2015, 10:05am

If some functionality is hidden or not apparent to majority of the users then the effort that goes to develop this is wasted. Since the effort is already put in the we have to think what is the best way such feature can be exposed to the users so they can use it to improve their UX.

sujato · November 19, 2015, 10:06am

Pali texts are already available in these scripts. Also, please keep this thread for issues relating to the internationalization of SC (i18n). Other requests can go on a new thread, thanks.

blake · November 19, 2015, 10:08am

In this case there was little development effort because I simply reused the code from Digital Pali Reader. But certainly the feature is wasted if no-one knows about it.

With respect, I think this is very related to internationalization. It is only logical that an internationalized version for Sinhalese will link to the pali in Sinhala script.

sujato · November 19, 2015, 10:14am

Okay, sure.

sirinath · November 19, 2015, 10:16am

I managed to get there. Though mentioned under control it took me a few seconds to recognise this. Maybe you can have Control, Navigation, Meta as section vertically than in tabs. You can have the last item of the section as … where you can add the ability to expand this if it does not fit the space.

sujato · November 19, 2015, 10:19am

The sidebar is described in the introduction on the Home page, and a popup appears when you first use the site pointing to it and summarizing what it does. This is plenty of exposure.

And, in addition, we created this forum, one of the functions of which is to do exactly what we are doing here, to discuss the features and uses of SC so that users can find things out.

We cannot hold our users hands. Any piece of sizeable software has lots of hidden features. Many, many people will quite literally use a piece of software like Word for decades and never learn how to use a basic feature like headings. If you try to expose all features you get a usability nightmare.

sirinath · November 19, 2015, 10:22am

I understand but some consideration should be given to User experience design

blake · November 19, 2015, 6:10pm

“Many” is simply defined as 50, a number chosen for being nice and round. If there are less than 50 translations they are shown inline.

By the way, I should emphasize here that one of my ideas is having a usable generic fallback for the case where no localization data is available. We have a large and constantly growing number of languages hosted and since the effort to upload translations is less than the effort to localize for a language, the localized languages will always lag behind, so it’s nice to have something for the languages we don’t have localization data for, and I think that even what I’ve implemented so far helps a ton in making visible translations in places you wouldn’t have thought to look.

[quote]
Normally, for the 4 nikayas at least, the default translation in English should be mine. But there should be a mechanism for people to select a “preferred translator”. [/quote]

Hmmm. As a starting point, lets go with hard urls for translations:
/en/dn1/sujato
/en/dn1/bodhi

It is important that the browser is always delivers the same data for a particular URL, regardless of the choices the user has made. So that is, when a user goes to /en/dn the server gives the browser the same basic page. This isn’t the only way to do things, but it sure helps with search engines and it helps with consistency when e-mailing links and such, so that you can roughly predict what a user will see when they go to that url.

When I think about it, the idea of a preferred translator is tricky, as if the title is from the preferred translator, that substantially changes the content of the page. One solution might be to actually use a query string parameter, so:
/en/dn - normal
/en/dn?prefer=bodhi prefer Bhikkhu Bodhi translations

I’m not a big fan of query string parameters but I think they can be acceptable for handling what is basically a variant of a page, provided that the default/standard URL is query-string free.

Formally, query strings are a way of telling the server to send customized data.

Another possibility is to do nearly everything client-side, as discussed in the Skype session. In this case, the server sends all the data, and the client can easily render that data in different ways. URLs are generally less important to this model as they merely act as markers. This model is readily converted to an app because the server takes a back seat as a data delivery service, and if the raw data is bundled into the app it’s good to go. Isomorphic javascript can render a stock page for all URLs for search engines and luddites who don’t use JS.

I don’t like the idea of magically remembering the user’s choice, for one people probably sometimes just want to compare translations. I think a better approach would be to have the default translator as Sujato. Then if someone goes to a Bodhi translation, it can create a drop down asking “Prefer Bhikkhu Bodhi translations?”, with the yes/no/goaway options.

Hand crafted is the only way. I’ve investigated at some length the possibility of auto-extraction of significant passages or terms, and it’s really a no go. And even that’s a far cry from generating an “about” for a longer sutta. Computers are just too dumb at this kind of thing.

sujato · November 20, 2015, 7:16am

Okay, sounds fine.

I’m not sure that that is so. Preparing texts, depending on the source, can be very time consuming, while creating at least a basic localization doesn’t require much. But the point remains: in some cases we will have translated texts without a localized site.

I’m not sure what the problem is here? Do you mean that we pull the title from the preferred translation, so this may differ from the title in the chosen translation? If so, I don’t think this is really a problem. If a “preferred translator” is chosen, we could pull their titles where appropriate.

More generally, I don’t quite understand the issue here. Let me know what I am missing, but here’s how I see it.

We have a list of translations by different authors, say

/en/dn1/sujato
/en/dn1/bodhi

We assign one of these as the default translation, so that

/en/dn1/sujato = /en/dn1

When a user clicks on the title of a sutta, this takes them to this text. We could assign either the plain URL for this, or the “translator” URL. Probably it would be best to use the translator URL, so that the translator’s name is a little more exposed (this being one criticism of our current layout, that we don’t make it clear enough who the translator is).

Our nice, handy system for short URLs still works, since /en/dn1 simply redirects to /en/dn1/sujato

If someone selects /en/dn1/bodhi, we ask them if they want to set Bodhi as preferred translator. If they say no, no change. If they say yes, in future the default text for them is the Bodhi translation (where available), and the titles are pulled from there. However, when they click on a title, they go to /en/dn1/bodhi. That is, the URLs and their texts remain the same, only the navigation changes.

Am I on the right track here?

I agree. For all its sophistication, Google still gets this remarkably wrong. I’ve been using gmail and google for years, with many searchs and everything, all in English. But as I travel, Google just blithely delivers me the local version of its search results: and it is strangely difficult to force it to just give the same results everywhere.

Sounds good.

Okay, fine. It’s not that big of a deal, in fact. Many of the translated versions have such descriptions already. And there’s no need for them to be complete. If we just have descriptions for DN and MN, this already goes a long way towards helping the user find a way in to the suttas. In AN and SN, the usefulness of descriptions declines rapidly due to the shortness and repetitiveness of the texts.

sujato · November 25, 2015, 7:21am

@blake has pushed the most recent changes to staging:

http://staging.suttacentral.net/en/an3

http://staging.suttacentral.net/en

blake · November 30, 2015, 1:58pm

4 posts were split to a new topic: Text Controls and User Experience

blake · November 30, 2015, 4:39pm

Now here are some thoughts/planning/analysis on i18n.

##Text Translation:

This is translating pali-> other languages, and potentially also for translations of other root languages.

In pootle (and generally the whole gettext framework) a project or po file has a clear, definite and non-optional source language. At the moment we use pootle only for translating pali, but hypothetically speaking if we wanted to translate sanskrit, we would make a sanskrit source language project.

##Descriptions translation

Unlike text, there is no existing source for descriptions. Every description will be bound to a uid, for example dn, dn1 and so on. These uids would be ideal for msgctxt but po itself is built around the concept of translating strings in a source language, to other languages. Hence we would generate .pot files of a form along the lines of:

msgctxt "an3.1"
msgid "Fools are dangerous, but the wise are safe."

This would be a special project “descriptions” with the language “en”.

Using msgctxt as a unique identifier or address is unconventional, the way gettext works, the conjunction of msgctxt and msgid should be unique, so msgcxt could be like “noun” or “verb” or “ui”, the same msgctxt can be reused on different strings. But I don’t see any reason why we can’t use msgctxt as a unique identifier - just to note that the gettext tools don’t offer any special support in this regard (i.e. they don’t assume that msgctxt is a way to uniquely identify a string).

Anyway point is, we would create the descriptions, maybe in CSV or JSON in the general form uid: description, these would then be converted to .pot files and could be translated from english to other languages using pootle. This would mean that the meaning of descriptions would be consistent across languages, as they are translated from the english description (in principle, you could do it another way: create a completely independent description of the sutta for the other language).

##Template Localization:

With templates it seems simplest to use gettext in the traditional way, where simply the msgid is translated. It is worth noting that the Jinja2 i18n extension does not support context, a few issues have been raised about the lack of this feature - as recently as 10 days ago.

I don’t think this will be a big deal though, as for descriptions we wont use the i18n extension and instead will use some custom code which relies mainly on the msgctxt/uid correlation.

The template localization will be about strings that appear in the templates, for example Translation of {sutta_name} by {translator_name}. and we shouldn’t need msgctxt for that.

It is important to be clear about which strings get translated and which don’t, as a general rule probably parts of the page which are not in english (i.e. because they are pali) probably don’t get translated - the logic going if they should be translated in other languages, they should be translated into english too. Although there might be exceptions here particularly when translating to entirely different scripts; perhaps comment could be helpful here. Here’s an example, the word “Sutta”, we could leave a comment Usually okay to keep as "Sutta", but for Thai, Sinhala and other such languages with a true native rendering of “sutta” the native rendering should probably be used. It could be said in such a case we’d prefer no translation to an imprecise translation, but to use a precise translation if the language has one.
We could also consider marking pali, and automatically transliterating pali into scripts which support it. This is obviously something which falls completely outside the scope of normal i18n tools.

The general backend python library to use for general gettext functionality is Babel (it is already in our requirements).

sujato · November 30, 2015, 8:03pm

Thanks, Blake, this all sounds fine, I only have the following comment for now:

Why not just treat the English as the basic source? If we pull in descriptions from existing sites, they can still be treated as “translations” even if they’re really just independent descriptions of the same thing. It seems this would be more in line with the normal way gettext works, and we can keep the UID in the same place as it is in the translations. Currently we don’t have much in the way of English descriptions, but I don’t see how that matters. You can just add a “translation” to an empty field.