Poolilization of Chinese texts

Vimala · July 1, 2017, 10:16am

I have attached a draft .po file for the Chinese Chinese Dharmaguptaka Bhikkhunī Pātimokkha for Ayya Kathrin @vimalanyani’s project.

These are missing the t-linehead and other code.
I had a look at the code and @Blake had deliberately taken out certain html classes in the sc-html2po.py:
> help=“CSS selector for stripping tags but leaving text”,

                    default=".ms, .msdiv, .t, .t-linehead, .tlinehead, .t-byline, .t-juanname, .juannum, .mirror-right, .cross")

So I guess that there is a specific reason why these classes have been taken out. Most of them are from the Chinese but not all. The tool has actually been written with the Chinese in mind also.

Please let me know if those classes need to be included in the pootle-files or not.

lzh-dg-bi-pm.po.zip (19.2 KB)

sujato · July 1, 2017, 10:16am

Hi Ayya,

There’s a lot more than just the t-linehead missing. All the “t” class IDs are stripped, and other markup too, eg t-byline. All this is meaningful and should be preserved. And this is why standoff properties are such a good idea!

There’s two issues at stake here. First, we need to preserve the integrity of the text metadata. second, we need to have the text in a form that is useful. Too much metadata make it basically unreadable.

One of the problems is that of overlapping hierarchies, i.e. the t line numbers don’t correspond with the segments. Again, this is precisely the problem that standoff properties aim to solve.

It seems to me there are two approaches to solving this.

Use standoff properties. Take the metadata (or most of it) out of the file, and keep it in a separate file. The location of the tags is kept in JSON, which notes the segment number and glyph number.
Make two separate versions, both of which are segmented the same way. In one, the metadata is kept complete. In the second, the metadata is stripped. The second one is used for the translation. Before publishing, the translated segments are merged back into the first one.
The first solution is more elegant, but I could not say which is easier.

One problem with the second option is that it keeps the metadata tied to the original Chinese text. There’s no trivial way, so far as I can see, to then associate a particular segment of the translated text with a line number, unless you’re using the two-language view. Ideally, however, the metadata should be independent. You should be able to say, what is the translation of line so and so? Obviously, the tags cannot be inserted precisely per glyph as they are in the original, but they can be done per segment.

Anyway, please have a think about this, it will take some time to get it right.

blake · July 1, 2017, 10:17am

It’s always been my intention that the parts which are stripped out will be re-inserted standoff style, basically by re-parsing the original source text and noting where the stripped out stuff should go.

This is very straightforward to do statically (i.e. in advance of serving the files), it’s a little trickier dynamically because it might cause performance problems. The reason is that inserting the extra markup standoff style is relatively CPU intensive - especially if using character offsets, if not using character offsets (i.e. just segment level) it’s really quick.

sujato · July 1, 2017, 10:18am

Okay. So can you check Ayya’s files and make sure that’ll work out?

Vimala · July 1, 2017, 10:26am

In principle I agree with this, but I also see a problem with translations. Ideally we would like the .t and .t-linehead to be where they belong, at the end of line of chinese characters where the are now. But that is in the middle of a segment, because Bhante Sujato wants to have one rule per segment for these Patimokkha texts. So that means that there is no way to tell where the original .t has to be within the translation, unless you also put this in the JSON file to say that a specific code needs to go in position x in the original chinese and position y in the english translation and z in the german translation.

sujato · July 1, 2017, 10:33am

But the T line number has no specific position in the translated segment anyway, as the translation and the original have no word-to-word correspondence, only a phrase-to-phrase one, i.e. segment to segment. Associating line numbers per segment is the most precise that is possible. In any case, it’s plenty good enough: the purpose is so that people can find what they’re looking for.

Vimala · July 1, 2017, 10:34am

But the segments have been added together to make one big segment, which creates the problem.

sujato · July 1, 2017, 10:39am

Yeah, don’t worry, it’s precise enough. Associating the line numbers per rule = per segment will be fine. It’s better to be simple, clear, and robust than to add an extra burden of complexity that will be of dubious use.

Of course, this still leaves the rest of the Vinaya to be sorted out, but that’s a problem for another day.

blake · July 1, 2017, 10:44am

Yeah the idea is to be precise when it is easy to be precise and otherwise just put things in vaguely the right place. I can’t see a compelling reason to have this metadata placed with high precision within translations.