Suttacentral/Bilara Localization Workflow and Data

blake · March 17, 2021, 8:46pm

So in the upgrade the existing localization workflow got a lot demolished with the removal of pootle.

So the old work flow as: Elements → PO files → translation → PO Files → JSON Files

With the removal of everything in between.

Is localization data in the API desirable?

At the moment the localized files are just that, files. Another option would be to load the localization data into ArangoDB, enabling its use in queries. A query can ask something like “is there localization data available for a language”, at the moment that’s basically hardcoded.

The localization data is essentially dynamic, it is retrieved by the client as it is required, so there is no reason why it shouldn’t by served by the API.

A rewrite is probably warranted

At the moment a key is used, for example from suttaplex.js

                ${this.localize('inYourLanguage')}

I would propose fully qualifying the localization keys, so it becomes

                ${this.localize('suttaplex:inYourLanguage')}

This would make it identical with the way the data appears in bilara-data, which is like this:

  "suttaplex:inYourLanguage": "in der von Ihnen gewählten Sprache ",

The localization API would generally be used by asking it for all strings from a namespace (i.e. all “suttaplex:” strings), instead of retrieving a file. This would mean that strings are not bound to a specific template and would potentially reduce the number of small files needing to be requested, for example we could consolidate sc-page-search and sc-filter-menu under a “search” namespace, altough consolidation might even be done more aggressively - the entirety of the non-static page strings is only about 25 kb uncompressed so splitting them into separate files doesn’t accomplish that much, the static page strings on the other hand comes to about 600 kb. This kind of restructuring would be something to be done over time.

Static Templates

In the old system, the files in elements/static-templates/ were processed into the elements/static/

Somewhere along the line developers stopped touching the static-templates files, in may 2020 some irreverent individual changed one of the lines in home-page.js to:

          <div class="row preamble">
            <p>
              ${_`The farts of the Buddha has been preserved in a vast ocean of ancient texts...

The fact that this line does not appear at the front of the home page is proof positive that static-templates have been ignored for some time.

Nevertheless, in the original system the static-templates were processed into static files, which end up looking like this:

          <section class="plain" style="min-height: 134px">
            <p>${unsafeHTML(this.localize('2797e2ab111cd1d938bd327b38002092'))}</p>

And the string is then in a JSON file like so:

{
"2797e2ab111cd1d938bd327b38002092": "The wisdom of the Buddha has been preserved in a vast ocean of ancient texts...

Of course the hash is supposed to be a hash of the root string, though it isn’t anymore.

As elegant and beautiful as this system is, it is probably time to revise it.

Alternative

One option would be to eliminate the “static-templates” scheme, just nuke it entirely. Every “static” page is just like any other page. We rename 2797e2ab111cd1d938bd327b38002092 to home-page:1 and are done with it, the “source of truth” for the english strings is the JSON files.

This means that the strings no longer appear in the element so the the element cannot be simply edited inline, instead the related JSON file would need to be edited with the string.

Is this indeed the best solution?

An alternative would be allowing the strings to be inline, so it becomes something like:

files.

          <section class="plain" style="min-height: 134px">
            <p>${_("home-page:1", "The wisdom of the Buddha ..."))}</p>

And then in the frontend build process the english strings are stripped out, with the above being rewritten to:

          <section class="plain" style="min-height: 134px">
            <p>${unsafeHTML(this.localize("home-page:1"))}</p>

But developers would never see that, it would just end up in the final bundle.

I feel like that could be a good option for the future… but not needed right now, obviously, since the static-templates aren’t being used anyway.

This basically means the static-templates/ folder should be deleted, and the static/ files should be rewritten by a script to give them readable segment ids, unless it’s desired to make more refined segment ids (the code to do this renaming is mostly already written).

Synchronization

Because of the lag between localization strings entering the translation process, and leaving it, it is possible that they become synchronized… the worst thing is if a root string substantially changes but the segment id does not, and renumbering is by far the worst thing.

While one option is to ignore the problem and hope it does rear up, there is one obvious solution.

Step 1: Take root strings from suttacentral/client
Step 2: Add to Bilara
Step 3: Translate strings
Step 4: Output both root and translated strings from Bilara
Step 5: When adding to ArangoDB, compare root strings from Step 4 with Step 1 and take note of any inconsistency in the root strings.

With this process any inconsistency can be flagged, and it is possible to be confident in correctness. Note that step 4 is very important, to publish both the root and translated strings simultaneously.

Timeline

In terms of order to do things, there is the choice of shoehorning the bilara-data back into the current format (with the keys like 2797e2ab111cd1d938bd327b38002092), but it would probably make more sense to get rid of that mess ASAP and make the segment ids in the static pages conform with bilara-data. Once that is done, the translations in bilara-data could be published using the existing file system.

Then the server API could be added, and client code that uses it.

karl_lew · March 18, 2021, 1:19pm

Hash keys only make sense if the priority is to track content change. Voice uses hash keys for sounds because we pay for every sound. The sound for any hash code is paid for once only: hashes are computed from the textual content. The downside of hash strings is their semantic opaqueness. Hash keys convey no understanding–they have no meaning on their own.

Yes.

For I18N, the priority should be on human utility. Semantic keys are far more valuable for i18n.

For the EBTs (and commentaries) we should preserve the segment ids as the absolute highest priority. We should NOT change established segment ids.
For UI text, the I18N key should be readable, it should ideally be the initial English text grouped by context. Your suggestion of ${this.localize('suttaplex:inYourLanguage')} is the perfect example.

Vuetify uses a very simple and effective organization for I18N keys. It uses a single JSON file for each language. The JSON has two levels. E.g.,

{
    suttaplex: {
        InYourLanguage: "In deiner Sprache",
    },
},

The marvelous thing about these files is that they can be handed to a translator with a simple request: “Please translate the quoted text”. This is exactly what we do in Voice for I18N. We ask folks to translate en.ts.

Furthermore, the client localization API should be as simple as possible. Instead of

${this.localize('suttaplex:inYourLanguage")

Voice relies on a two-letter function:

$t('suttaplex.inYourLanguage')

We also have a default namespace, so $t('inYourLanguage') is equivalent if suttaplex is the default namespace.

There are two things to notice here:

The compactness of a two letter function (i.e., $t) in Javascript provides enormous benefit to front-end engineers.
The namespace delimiter is . instead of :. The . perfectly suits Javascript client implementation of $t

Another aspect of the API is the transport unit. Since Voice has a tiny UI, it is convenient to send all I18N information in a single HTTPS request. For SC, sending I18N for a locale might make more sense. Even then, I would combine ALL the I18N contexts for a language in the response to a request for I18N information.

Don’t break existing clients. Voice uses the SC suttaplex REST API and uses bilara-data JSON directly. Voice does not use SC I18N user interface information.

Voice relies on Vuetify, which uses HTML templates. Templates are wonderful, especially since Vue templates include Javascript as well as CSS sections.
We’re also using Nuxt, which converts all of that into static HTML which we never look at since it’s compiled gobbledygook.