Thanks, this is a great question.
We will be marking up the texts in a language called PO, which is a standard for translations widely used in the open source world. Here’s an example of a PO file from my (very slowly progressing) translation of the Therigatha.
#: thig_lineseg.html+html.body.section.article.blockquote.div:16
msgid "“Sukhaṃ supāhi therike,"
msgstr "“Sleep softly, little nun,"
As you can see, there are three parts to this segment.
The first line is a comment, which in this case contains the HTML code. PO is not built to be interoperable with HTML, so what we do is stick the relevant HTML in a comment. That we we preserve SC’s HTML, and when the translation is finished, we just reverse the script and output perfect, SC-compliant HTML5. Since this is always clean, well-structured code, it is then (relatively) trivial to convert this to any other encoding format, such as LaTeX, epub, and so on, if that is desired. But the main thing that translators will do is simply press a button and produce a file that can be directly uploaded to SuttaCentral. No more worrying about converting Word documents and other hellspawn to something useful.
The second line, “msgid”, contains the original text in Pali. As you can see, this has been segmented. As this is a verse, we have broken it into lines. The segmenting can, however, be based on any feature of the text. Most probably we will break on sentences, perhaps colons and semicolons as well, for prose, and lines for verse. That means the text is broken into natural semantic units (assuming that the source text is well proofed and well punctuated, which for our Pali text is the case). Occasionally this doesn’t work; for example, in some verses you need to translate the lines out of order, or mix the lines. That’s okay, it just means that the text and translation don’t match 100%. No natural language process can ever be completely consistent, but the point is that it be useful. However, on the whole it works to translate in this way, as it corresponds to natural semantic structures in the language, as opposed, for example, to word by word translation, which would never work, apart from anything else, because the word order is so different. So segment by segment is basically the smallest unit that is usable for translation.
Finally, the msgstr line contains the translated English line. Once you have translated this, it goes into a memory bank, and the next time this segment occurs in the Pali, it will pop up a suggestion reminding you how you translated this previously. So you just go, click, and the second occurrence is added. As you can imagine, in repetitive text like the Pali canon, this saves a lot of time, and greatly helps ensure consistency. Moreover, this is a fuzzy function: it’ll tell you it’s found, say, a 92% match and an 86% match, and let you choose which to use, and modify it if you wish.
When you export the finished translation, it is simple to then include an ID tag for each segment. So even though the exported files are quite separate, we can recognize the related segments in each file. This can then be displayed in various ways, depending how you want. You can, for example, make it show the underlying Pali text by hovering or clicking a segment (Google translate does this). Or you could arrange the text line by line, or in parallel columns. Or all of the above, which in fact we plan to do, as there are many different ways people like to read.
You can also use this for checking alternate translations, in the same or different languages, as long as the segmented markup is the same.
Having such detailed hard-coded correspondences at a segment level then opens up a number of other possibilities, many of which we are only beginning to understand:
- High-level semantic text analysis of sentiment or other characteristics based on the English text, which could an indicator of systematic characteristics in the Pali that we cannot now discern.
- Finely tuned machine translation from the English into minority languages, using a list of technical terms to ensure accuracy of doctrine, while the English provides natural expression.
- On-the-fly creation of alternate translations, by substituting a list of translated technical terms. Don’t like “illumination” for jhāna? Fine, roll your own.
- Detailed semantic markup of the Pali text can be applied automatically to other texts; for example, we could distinguish between narrative and teachings, things spoken by different people, and so on. This can be then used to enhance search and other natural language processing.
- Annotations can be added to the text (via an interface on SC), which can function like footnotes in books, except richer and more flexible. They could include text, audio, images, and so on. These can appear, if desired, in the appropriate place in any of the text or translations. So If I add a note to an English translation, someone reading the same sutta in Italian, for example, can read the note.
And so on. We are creating a native digital text, one which is not a second-rate copy of a book, but which is designed to take advantage of the possibilities that the digital medium offers, now and in the future.