Clean up Po code from variants, etc

sujato · August 31, 2016, 12:53am

One of the long-standing limitations of the PO code we’re using is that it handles variant readings poorly. basically, pretty much any time there’s a variant reading, the segment matching fails (because of the HTML). This severely limits the effectiveness of the matching. On the other hand, the variants themselves are only rarely useful.

One solution is simply to scrub the variants from the PO code. This keeps the code clean, but you might miss the occasional useful variant.

Another idea I had was this. When preparing the PO code, do two things:

Copy the variant(s) in their entirety into the comments (like the rest of the HTML).
Delete the extra code from the main text.

So you’d chenge this:

#. </p><p><a class="sc" id="6"></a>
msgctxt "sn46.55:9.1"
msgid ""
"Seyyathāpi, brāhmaṇa, udapatto agginā santatto <span class=\\\"var\\\" title="
"\\\"ukkaṭṭhito (bj, pts1) | ukkuṭṭhito (s1-3) | pakkudhito (mr)\\\" id=\\\""
"note111\\\">pakkuthito</span> <span class=\\\"var\\\" title=\\\"ussadakajāto "
"(bj) | usmādakajāto (s1-3)\\\" id=\\\"note112\\\">usmudakajāto</span>."
msgstr ""
"Suppose there was a bowl of water that was heated by fire, boiling and "
"bubbling."

To this:

#. </p><p><a class="sc" id="6"></a><span class=\\\"var\\\" title="
"\\\"ukkaṭṭhito (bj, pts1) | ukkuṭṭhito (s1-3) | pakkudhito (mr)\\\" id=\\\""
"note111\\\">pakkuthito</span> <span class=\\\"var\\\" title=\\\"ussadakajāto "
"(bj) | usmādakajāto (s1-3)\\\" id=\\\"note112\\\">usmudakajāto</span>
msgctxt "sn46.55:9.1"
msgid ""
"Seyyathāpi, brāhmaṇa, udapatto agginā santatto pakkuthito usmudakajāto."
msgstr ""
"Suppose there was a bowl of water that was heated by fire, boiling and "
"bubbling."

Of course we have to ensure that the correct HTML is reconstituted afterwards.

Unless—and maybe this is actually a good idea—we use this as a template for stripping variants from the HTML altogether and retaining them as JSON standoff? Even without using explicit standoff glyph-counting, we can match the variant to its correct place with a fairly high accuracy. The only time it would fail is if we have a term that appears twice (or more) in the same segment, and only one has a variant. But this would be a rare case, and unlikely to cause problems.

blake · August 31, 2016, 1:22pm

Hummm, variant readings could probably be included as separate comments.

Vimala · August 31, 2016, 6:25pm

I like the JSON standoff idea …