Wishlist for Virtaal

sujato · April 1, 2015, 1:47am

Here i’m going to describe for @blake and other interested parties (@Jhanarato @vimala) my ideas for how we can hack Virtaal to become an awesome Pali translation tool.

##Virtaal

Virtaal is a Computer Assisted Translation (CAT) app, which is free, open source, and written in Python: all good things. I’ve had a look around the available CATs and this seems to be the best for us, but of course other suggestions are welcome. The project is alive, but not very active.

Some of the things below will be easily possible, some not. Many of them are at least partially supported by extant Virtaal (translate-toolkit) extensions.
http://translate-toolkit.readthedocs.org/en/latest/commands/

Note that I am not at all interested in machine translation at this time. The purpose of Virtaal is to make human translation easier, faster, and more accurate.

##What I want

Segment control: Virtaal uses the PO standard for segmenting and marking up text. We need to make it segment on sentences by default, with the ability to adjust segmenting for individual cases. There doesn’t seem to be an HTML –> PO tool that gives detailed segment control. html2po segments on block level. So one solution is just to add bunches of divs to the text before segmenting. (or better to use some block level element that doesn’t exist in the texts to avoid problems. <pre> works fine.) Actually, I just came across this, which seems to be what we want:
posegment — Translate Toolkit 3.8.0 documentation
Match strings: Probably the most important feature of a CAT is the ability to match strings and propagate changes. Virtaal already does this natively, even doing sweet fuzzy matching of Pali! But it needs to ignore inline HTML when doing matching. See
html2po, po2html — Translate Toolkit 3.8.0 documentation
Import/export: We need to get text into Virtaal and then out of it. So we should be able to simply load the SC Pali text—which is more or less what it does already—and then export the translated text using exactly parallel markup. Of curse, things like variant readings and so on should not be exported, just the text markup. Ideally we could simply press “Export to SC” and have it produce a set of files that can be directly uploaded. Note that this doesn’t have to be ready until the end of the project. See:
pomerge — Translate Toolkit 3.8.0 documentation
http://opensource.bureau-cornavin.com/html2pot-po2html/index.html
Show/Hide HTML: Virtaal displays HTML by default, which is not ideal. It would be nicer to have it hide things like variant readings and so on, but display them when needed.
html2po — Translate Toolkit 1.12.0 documentation
Pali lookup: Get a Pali dictionary lookup tool integrated with Virtaal.
Term markup: It would be a nice extra to be able to simply markup any technical terms, proper names, and flora and fauna. We could make a list of these—in fact we already do, pretty much—and have Virtaal recognize and suggest them. Then when we export we can give them a tag, which we can use for a glossary. That way we can have convenient info and definitions for important terms everywhere on SC. It seems this is done by this:
poterminology — Translate Toolkit 3.8.0 documentation
Trilinear: This is for later, but once we have the full English translations it would be nice to make a trilinear version: Pali/English/Target. Then we can open the app for people to start making new translations in other languages. This is somewhat related to this:
poswap — Translate Toolkit 3.8.0 documentation

Anyway, enough for now. I’ve done a little test this morning. I took MN7, stripped out the inline elements, and put divs between sentences, and also other punctuation just for the hell of it. Then I used html2po to convert it, loaded into Virtaal, and started with a rough translation, just to see how it works. I was pretty impressed with it’s ability to do fuzzy matching of Pali. The whole interface is sweet, actually. HTML and PO files attached.

test.zip (8.3 KB)

blake · April 1, 2015, 2:34pm

Segment control: This should be a straightforward task, as PO is such a simple format and I am very very familiar with HTML processing I can just write a custom segmenter.
Match strings: html2po uses a clever approach by including the ‘address’ of each sentence as a comment, allowing reconstruction of the text with po2html. I would base my segmenter on this technique. Ignoring inline HTML sounds doable with a plugin to viitral.
Import/export I think we can use a web interface for this, possibly only enabled for development environments, but if later we have translators who aren’t equipped to set up a development environment, we could consider having it available online too.
Show/Hide HTML: Okay, you seem to be saying it would be ideal to hide some HTML, we probably want markup for emphasis and stuff to be shown, in short anything you’d want to write into the translated text, I’d think should be shown. Perhaps anything which probably wont get translated could only be available as a comment? For example, you can mouseover a entry in virtaal to see the associated comment (i.e. the html file and path in html2po output), would it be adequate to just include the ‘raw html’ as a po comment making it visible through this mechanism?
Pali Lookup: My preliminary research indicates such tools don’t currently exist for any language in Virtaal. Is this correct? It should be doable through a plugin or fork.
Term markup: Sounds like a job for Placeables, if I understand them correctly. Basically things which you want to pass through unchanged.
Trilinear: Are you talking about a significant mod to Virtaal which shows 3 columns instead of 2?

sujato · April 2, 2015, 12:17am

Cool. Generally I’m thinking segment on . ; : ! ? — . Also   for the verses. Mostly you want to translate verse line by line.

But it would be nice to be able to adapt segmentation locally. Of course it can be done by editing the PO files directly, but some interface to allow local changes might be useful. But this is not a priority, just something to think about.

I’m not sure what kind of web interface you are thinking of. What I was thinking was that we prepare the whole Pali canon and import it into Virtaal as a project. Then when the translation is finished, we just hit export and produce a bunch of SC-compliant HTML files.

I wasn’t thinking of this, only the HTML in the source files. But yes, you should be able to insert HTML into the translated text. Practically speaking, there shouldn’t be much, really, just the occasional . I’d like to resist the ability to add notes and fancier stuff. Perhaps we could just use markdown emphasis? But then there is the question of “terms”, see below.

Perhaps, but exclude <a> tags, otherwise it’s just cluttered with references. Just the variant readings and other textually relevant material.

Yes, as far as I can see. I’ve also briefly checked for poedit and it also seems to not have dictionaries. I don’t know why this is, it would seem to be an obvious feature. And BTW, poedit is a possible alternative to Virtaal; it is a more active project.

I don’t mean that we retain the Pali terms in the translation, but that we mark the translated terms with metadata so they can be identified later. Then on the site we have have a popup for terms and so on. Of course, some terms will be retained in the Pali, especially proper names, but also a few doctrinal terms.

So in our po file we could have something like:

Not doing what is <span class="term" id="pāpa">wrong</span>
Undertaking what is <span class="term id="kusala">good</span>,
And purifying the <span class="term" id="citta">mind</span>—
This is the teaching of the Buddhas.

Yes. Although this is, as I said, not a priority for now. Actually, take a step back first. The issue is that most translations are made using the English as a basis. So let’s make these easier and better. We can release Virtaal with the English translated texts, when they’re ready, so that’s good. But can we do better? The main problem with translating from English is that you get distant from the Pali. So let’s bring the Pali closer. Show the Pali and the English, with the ability to use the dictionary or whatever other cool tools we think of. Then a translator can still work from the English, but with the Pali there to refer to. (The opposite to how I work, actually: I translate from the Pali, and refer to the English.)

What I’d like to do in the next week or so is to start work on a small text, something from the Khuddaka, and start to get a feel for how it works.

sujato · April 9, 2015, 12:41am

I’ve started working on the Therigatha to get a feel for how Virtaal works, and I must say I am very impressed. It’s straightforward, and does most of what I want right out of the box. In case you want to see what I’m doing, see the attached file, which has the beginnings of the translation (just a few verses, draft of course), together with a terminology file.

The terminology file is derived from a table of terms as used by Ven Bodhi in his translations. This was supplied by John K. I have simplified it, changed some entries, and keep only the most recent renderings. Over time it will evolve, but for now this gives a good example of how this works. I’ve also removed the final letter from all entries. NOTE: this is now updated, I have edited each entry and this is closer to what I shall use.

terminology.po.zip (11.3 KB)

And in case it is of interest, here is the unfinished translation of the Therigatha that I am working on.

thig.po.zip (45.7 KB)

blake · April 28, 2015, 10:06am

Inspiration has struck for the custom segementer and I’ve hit upon what I feel is a pretty nice approach.

A python script first cleans up the HTML in various ways, for instance removing excess paragraph numbers (i.e the ms ones), the metadata, variant notes and so on (you can customize what is removed).

To construct the .po file, it then breaks up the HTML, structural HTML tags (such as div, blockquote, p, most paragraph numbers and so on) are output directly in the form of po comments.
Text to be translated becomes msgid strings, segmented on sentence breaks.
There may be inline HTML tags.

When you open the file in Virtaal you see the comments as greyed out text bracketing the text being translated.

When it becomes time to perform reconstruction, it is then a simple matter, the inverse utility (not written yet), just recombines the HTML from the comments and the text from the msgstr’s to produce a fully functional HTML document. Unlike po2html no template is needed.

Here is sample output sn56.11.po.zip (2.5 KB)

The script itself has been committed to the git repository, as utility/sc-html2po.py

You can run it directly with an input (html) and output (po) filename, it can also take parameters to determine what tags are stripped (keeping the enclosed text, such as variant notes), and what elements are completely removed (eliminating text and children, such as the metadata element), so if you say, want to keep everything, you can.

Note at present it doesn’t have special handling for  , it will break verses normally based on .; and so on. I have a suspicion that it might actually be better that way since presumably verse lines can’t always be translated in isolation and even in verses a sentence is the minimum translatable unit. Technically it’s not difficult to implement, the only question is what the most correct behavior is.

sujato · April 28, 2015, 11:02am

Cool. So I’ve done a translation already, for you to use in your testing or whatever. Meanwhile I’ll have a closer look at what is going on. You’ll have to walk me through how to actually run the script…

sn56.11.po.zip (4.3 KB)

Regarding verse translation, I would have agreed with you before starting work on the Therigatha. However I’ve found that in 9 out of 10 cases it is fine to translate (more or less) line by line, and only every few verses do you need to rejig the syntax. So you gain a lot by being able to reuse more segments. The downside is that not every line ends up being an actual translation of the line to which it is connected, which I think is okay. It just makes the marked-up text a bit fuzzy, that’s all.

blake · April 28, 2015, 11:46am

You should be able to run the script in the same way as split.py, the requirements should be the same, both use the suttacentral environment.

I think though I’ll update the syntax so it takes a list of files, and generates the output name(s) automatically.

I’ll also update the
behavior, if it works well 9 / 10 times that is good enough :).

sujato · April 28, 2015, 12:52pm

Okay, it works now, I had to add cssselect to my python3.

blake · April 28, 2015, 3:00pm

I have pushed a new version.

It generates prettier output, for example inline closing tags (like ...) should now belong to the same sentence as the text they enclose. Also the HTML should display better as comments. It now strips out most the invisible unicode whitespace control characters.

  is now a link break and is inserted as an HTML structure comment rather than an inline tag.

There is a command line change, it now takes a list of files, and dumps converted files into an output folder, ./po-out by default, files are renamed from ‘.html’ to ‘.po’

sujato · April 30, 2015, 8:21am

I’ve had a few more thoughts after using the “terminology” function for a while.

Basically I haven’t found this to be as useful as I thought, because:

It doesn’t recognize terms all that well. Not sure how it works but it often seem to pick the shortest term.
It’s kind of clumsy to use and doesn’t really save time.
It doesn’t help to markup terms.

Currently it highlights terms; you alt+right to select and alt+down to insert, or if there is more than one choice for that term, view it first then choose and insert. Once you’ve done that you interrupted your flow enough that you would have just typed it more easily, especially once you’ve adapted the term for grammar, put it in the right place, and so on.

I’m wondering whether we could display the terms as a popup instead. When you start a line, the highlighted terms appear, probably above the text entry field, covering where the “comments” are. That way you can glance at them if need be, and it won’t affect the typing flow. You can still select and insert as you can now, there’s just an additional visual aid as well.

As for marking up terms, I go back and forth on this one. The simplest way I can think of would be just use [square brackets] to enclose any term. Then later we can match the bracketed translation terms with the appropriate Pali, which should be simple with some fuzzy matching. To avoid ambiguity, I could add the appropriate Pali term where necessary, something like [kama|desire], [chanda|desire].

But if we’re going to do that, do we need any kind of markup at all? Just keep a list of translated terms, and when we are ready run a matching program over the whole corpus and identify the terms. Of course there will be problems with ambiguity, but it might be quicker to deal with this all at once at the end rather than try to do it one by one as we go.

Then there is the issue of to what extent the terminology will be useful. If we have the Pali sentence matched with the translated sentence, and we also have the Pali lookup tool, we essentially have the means to fairly quickly check every Pali word in the corpus. Maybe there is little utility in developing a much more limited and specialized terminology glossary.

Regarding which, and this is looking ahead some way, I was thinking that we could implement a system where the user could choose how they wanted to see the Pali and English (or other translated text).

By default, they just read the translation. If they want, they can click to call in the Pali as well. There are three main ways of displaying this:

As a popup; like Google translate, hover and see the original text.
Line by line.
Side by side in two columns.

Each of these is useful in certain cases, and with our matched text and javascript magic we should be able to do them all, amiright?

In any case, once a reader has simple access to the Pali text, together with the Pali lookup, they can, if they wish, identify the Pali behind anything in the translation. So the utility of marking special terms would seem to be minimal. What do you think?

blake · April 30, 2015, 2:28pm

I’ll see if I can come up with something cool for dictionary lookup. Do you think it could work if there was a separate popup window (which you can place anywhere you want to) which displays a breakdown of the current sentence, according to terms and dictionary lookup? One of the things is that hacking in a whole new component is a ton easier than trying to hack the functionality of an existing component.
By the way I was thinking of using Elasticsearch to handle dictionary lookup, as the dictionaries are already indexed into it, and it has a vast range of mechanisms for fuzzy matching, ranking and so on. I could actually just index the terminology data into Elasticsearch, and use it as a high ranking source for lookup.

Now with marking up terms I think you’re right. Since we have a correspondence of pali sentence to english sentence we can use statistical analysis to know what word is what. For example if we know that ‘desire’ might be ‘chanda’ or ‘kama’, and the pali sentence only contains ‘chanda’, we then know that in that context it is ‘chanda’. And I think that you’re right in thinking that for one who cares about such things, just having the ability to reference the pali would be adequate and there’s probably no need for anything more.
However for the more casual user, it might be beneficial to have a specific notation for marking up terminology.

With display of pali and english, I like the possibilities. Popups, line by line and side by side are all totally doable, as is toggling them simply with some javascript and CSS rules.

blake · April 30, 2015, 2:58pm

I’ve also updated sc-html2po.py with some bugfixes, and added sc-po2html.py which works in essentially the same way, that is it takes a list of .po files, and outputs html files into ./html-out

I don’t know how you run these scripts (and also split.py), but in order to keep them up to date and facilitate addition of extra scripts in the future, here is how it should be done now:

Remove them from anywhere else if you have put them anywhere else.
Add the suttacentral/utility/bin folder to PATH, which means edit .bashrc and add the line:

export PATH="$HOME/suttacentral/utility/bin:$PATH"

Assuming your suttacentral path is just ‘suttacentral’

Once this is done you will be able to run sc-*.py and split.py from anywhere, and they will run directly from the utility/bin folder, which guarantees they’ll be updated whenever you git pull, and ensures they run in the suttacentral python environment.

sujato · May 1, 2015, 12:00am

That would be very helpful, yes. There’s plenty of space in the Virtaal window, I’m sure we can find room for it. Possibly better to have it as part of the display rather than as a popup obscuring other things.

You mean, in Virtaal or on the site, or both? Either way, it sounds good, use a dedicated and sophisticated tool for this.

And plenty of time to implement these.

blake · May 1, 2015, 5:57am

Well I was thinking if you bring up a new OS window, it could be positioned as desired. Think gimp style. It’s a pretty common approach in the unix world. But making it a panel in the virtaal window should also be a possibility.

sujato · May 1, 2015, 6:02am

Either way can work.

sujato · May 1, 2015, 8:44am

Thanks; except there is no /bin in /utility. With bin removed it works fine.

blake · May 1, 2015, 9:13am

Oops, i forgot to push the changes were I rearranged the folder structure. git pull, and then it should work with utility/bin

sujato · May 1, 2015, 9:39am

So that works fine now, thanks. But sc-po2html.py isn’t working properly for me, I just get an empty html file.

blake · May 1, 2015, 11:44am

It is working fine for me with the sn56.11.po you translated earlier. Are you sure you’re using a valid translated po file, not an untranslated one? If it doesn’t work, send me the po file it’s not working for.

sujato · May 1, 2015, 12:19pm

Here’s the file I’ve been using:

an4.92.po.zip (1.1 KB)

I just took a random sutta, ran sc-html2po, translated it in Virtaal, and ran sc-po2html. It produces just a bare HTML file.

I tried with sn56.11 and that one worked, so that’s odd. (There was a glitch I noticed, an extra msgid "" at the bottom of the page. Also a couple of empty paragraphs.)