Here i’m going to describe for @blake and other interested parties (@Jhanarato @vimala) my ideas for how we can hack Virtaal to become an awesome Pali translation tool.
##Virtaal
Virtaal is a Computer Assisted Translation (CAT) app, which is free, open source, and written in Python: all good things. I’ve had a look around the available CATs and this seems to be the best for us, but of course other suggestions are welcome. The project is alive, but not very active.
Some of the things below will be easily possible, some not. Many of them are at least partially supported by extant Virtaal (translate-toolkit) extensions.
http://translate-toolkit.readthedocs.org/en/latest/commands/
Note that I am not at all interested in machine translation at this time. The purpose of Virtaal is to make human translation easier, faster, and more accurate.
##What I want
-
Segment control: Virtaal uses the PO standard for segmenting and marking up text. We need to make it segment on sentences by default, with the ability to adjust segmenting for individual cases. There doesn’t seem to be an HTML –> PO tool that gives detailed segment control. html2po segments on block level. So one solution is just to add bunches of divs to the text before segmenting. (or better to use some block level element that doesn’t exist in the texts to avoid problems.
<pre>
works fine.) Actually, I just came across this, which seems to be what we want:
posegment — Translate Toolkit 3.8.0 documentation -
Match strings: Probably the most important feature of a CAT is the ability to match strings and propagate changes. Virtaal already does this natively, even doing sweet fuzzy matching of Pali! But it needs to ignore inline HTML when doing matching. See
html2po, po2html — Translate Toolkit 3.8.0 documentation -
Import/export: We need to get text into Virtaal and then out of it. So we should be able to simply load the SC Pali text—which is more or less what it does already—and then export the translated text using exactly parallel markup. Of curse, things like variant readings and so on should not be exported, just the text markup. Ideally we could simply press “Export to SC” and have it produce a set of files that can be directly uploaded. Note that this doesn’t have to be ready until the end of the project. See:
pomerge — Translate Toolkit 3.8.0 documentation
http://opensource.bureau-cornavin.com/html2pot-po2html/index.html -
Show/Hide HTML: Virtaal displays HTML by default, which is not ideal. It would be nicer to have it hide things like variant readings and so on, but display them when needed.
html2po — Translate Toolkit 1.12.0 documentation -
Pali lookup: Get a Pali dictionary lookup tool integrated with Virtaal.
-
Term markup: It would be a nice extra to be able to simply markup any technical terms, proper names, and flora and fauna. We could make a list of these—in fact we already do, pretty much—and have Virtaal recognize and suggest them. Then when we export we can give them a tag, which we can use for a glossary. That way we can have convenient info and definitions for important terms everywhere on SC. It seems this is done by this:
poterminology — Translate Toolkit 3.8.0 documentation -
Trilinear: This is for later, but once we have the full English translations it would be nice to make a trilinear version: Pali/English/Target. Then we can open the app for people to start making new translations in other languages. This is somewhat related to this:
poswap — Translate Toolkit 3.8.0 documentation
Anyway, enough for now. I’ve done a little test this morning. I took MN7, stripped out the inline elements, and put divs between sentences, and also other punctuation just for the hell of it. Then I used html2po to convert it, loaded into Virtaal, and started with a rough translation, just to see how it works. I was pretty impressed with it’s ability to do fuzzy matching of Pali. The whole interface is sweet, actually. HTML and PO files attached.
test.zip (8.3 KB)