Ideas about segmented translations?

I have been thinking more about segmented translations since playing around with Pootle and Virtaal. Using a PO editor can be very helpful and it is neat to see old translation work automatically make itself useful again with the translation memory.

Segmented translations are totally different from the TEI Lite XML files I have been using, though, and also totally different from normal HTML files. Basically, segmentation has to happen, and it has to be retained, along with the parallel source text.

The part that bothers me at this point is the workflow. Segmenting plain Chinese text and converting into PO, then converting PO into plain text and then formatting it into XML is easy enough. The problem is that once they are in XML, there is no simply way to pull them back again without jumping through hoops. Perhaps the problem is that XML and HTML simply are not conducive to translation work like this.

I have seen some example PO files from Bhante Sujato, and these use PO comments to store HTML tags, for easy conversion in and out of PO. I like that approach, but I think the PO format is a bit awkward and antiquated, mostly just something created for software translation teams. Then I read something about JSON being used instead.

I have been playing around with some possible formats, but I wanted to ask what your experience has been with segmented translations, and if you have any advice on the matter?

1 Like

Well in any case, currently I am looking at keeping most translation metadata in a separate file from the actual text contents. For the text segments, I’m looking at a simple format like the following:

                "_id": "T0099.806.1",
                "en_html": "<p>{} ",
                "en_raw": "Thus have I heard.",
		"zh-Hans_html": "<p>{}",
		"zh-Hans_raw": "如是我闻:",
                "zh-Hant_html": "<p>{}",
                "zh-Hant_raw": "如是我聞:"
                "_id": "T0099.806.2",
                "en_html": "{} ",
                "en_raw": "At one time",
		"zh-Hans_html": "{}",
		"zh-Hans_raw": "一时,",
                "zh-Hant_html": "{}",
                "zh-Hant_raw": "一時,"

The data is still fairly flat (rather than nested), and allows any number of target languages and output file formats. For example, formatting directives could be added for Markdown, TeX, etc.

Okay, may I ask some questions?

  1. What advantage do you see in this approach rather than using the PO format?
  2. Why is the HTML repeated for each language? Do you anticipate using different markup for different languages?
  3. Given that the quantity of languages may (hopefully!) blossom, won’t this become clumsy? Currently there’s 36 languages on SC. Would it not be better to keep each language in separate files, coordinated via ids?

Oh, good questions.

Mainly robustness and flexibility, but also the representation seems a bit cleaner, and JSON is trivial to read and write. Formatting information can also be encoded by just adding key-value pairs.

Possibly! I’ve gone back and forth on this issue, but I tend to think it may be nice to have that type of flexibility. For example, keeping the source text in the original received formatting, while carefully formatting and organizing the translated text.

For SuttaCentral, stuffing all those translations into one file would definitely be clumsy and messy. But at least for my site, I like the flexibility of being able to store variants in one file (even if I never actually use it). It’s mostly for flexibility. In the example above I used “zh-Hans” and “zh-Hant,” technically the same language, but two different representations.

I didn’t mean to propose that this format is for SC, though. It is something I have been thinking about for my site, and I thought if SC has started going down this route of storing segmented translations, you guys might have some ideas.

I just started with a PO file, and rewrote the contents in JSON, and then took out some of the artificial restrictions, like tying it to two languages, and then added some conventions for storing markup for different formats.

For editing the files, a script could easily extract the relevant language pairs into a temporary PO file, load the file in a PO editor, and then automatically update the language pairs after the editor has closed.

Fair enough. There is certainly a clarity to keeping markup in a separated form. I suspect, though, that you’ll encounter the "overlapping hierarchies’ problem. For example, what to do about inline references? Perhaps your text doesn’t have them, but we have vol/page references at random places in the text. This is the hard problem that standoff properties aim to solve.

Yes, this is something I have thought of too. As I work with the Pali texts, I am aware that often the paragraphing, for example, is not what I’d want to use in an English translation. The most common convention in English is for each speaker in a conversation to start with a new paragraph. Should I edit the Pali text to do this? I can, but what if I want to change it later? Or what if we translate into another language that uses a different convention?

If I understand the idea of standoff properties, they rely on a readable plain text document in one file, and a set of properties in another. Using segmented translations, the plain text could simply be one output format. If the standoff properties are used to generate other file formats, then the extra formatting in the JSON file could simply be removed. For example:

		"_id": "1",
		"en_raw": "Vajracchedikā Prajñāpāramitā",
		"en_txt": "{}\n\n",
		"zh-Hant_raw": "金剛般若波羅蜜經"

Right. Often I just grab a text from CBETA, pipe it through my reformatting program, and then use that directly. I want to format the English translation nicely, but if I apply those same tags to the source text, that would also require changing the source text (punctuation, etc.).