On Pootle we segment texts based on major punctuation. This is very useful as an initial approximation, as it breaks the text into (mostly) semantic segments.
It’s far from perfect, though. The punctuation is not consistent, for example. I have been recording such instances, and intend, when my translation is done, to go back over the whole corpus and resegment. Obviously this is not urgent, but I thought I’d record my thoughts here.
This gives us an opportunity to create segments using any means we like. One important criterion is that the texts should be reversible; that is, we should be able to resegment them. Here’s some options.
- Change the punctuation (in original text) to make it consistent.
- Advantages: simple, improves the text. (I will be doing this to some extent regardless.)
- Disadvantages: It won’t work every time. Particularly, it breaks on ellipsis, which sometimes should indicate segment, sometimes not. Also, punctuation might be corrected at some point, which would ruin everything!
- Just do it by hand in the PO files.
- Advantages: pure coding, no changes to text required.
- Disadvantages: Loses reversibility.
- Use some other markup. Add an HTML span or something to indicate segments.
- Advantages: umm …
- Disadvantages: makes the code crappy.
- Use standoff
- Advantages: super clean, nothing is in the text.
- Disadvantages: it doesn’t exist.
- Insert some dedicated glyph at segmentation points
- Advantages: Simple, unambiguous, robust. We could use NULL character, which, if I am not mistaken, has a similar usage in C, etc.
- Disadvantages: Might bug out? If the text is reused, people might not know they are there. Maybe use ␀?