Pootle for translation

sujato · July 5, 2015, 12:58am

@blake Another detail I’ve noticed. In the PO file, the text is on one line, while any translation of any length is split on separate lines, like so:

msgid "‘Ekāyano ayaṃ, bhikkhave, maggo sattānaṃ visuddhiyā sokaparidevānaṃ samatikkamāya dukkhadomanassānaṃ atthaṅgamāya ñāyassa adhigamāya nibbānassa sacchikiriyāya yadidaṃ cattāro satipaṭṭhānā’ti."
msgstr ""
"‘Monastics, this is the path where all things come together as one, to "
"purify sentient beings, to get past sorrow and lamentation, to make an end "
"of pain and sadness, to reach the way, to witness Nibbāna; that is, the four "
"kinds of mindfulness meditation.’"

This makes some things harder, like manual regex on the translation. It also creates errors, for example I had a <span> in the text, it became broken over the line (<\nspan>), and hence showed as plain text in the HTML file.

Likewise, I am getting an error in cases like this:

"Here, a monastic who has the awakening factor of mindfulness clearly knows ‘"
"I have the awakening factor of mindfulness’; when they don’t have the "

This produces the following in the HTML output:

Here, a monastic who has the awakening factor of mindfulness clearly knows ‘ I have the awakening factor of mindfulness’

Notice the unwanted space following the ‘.

Can we hack it so both text and translation are on one line?

blake · July 6, 2015, 6:59pm

Okay I’ll get hacking on those things.

In remember.py at the top is a line “SYNC_PERIOD = 20”, that means every 20s it scans for changes. You could set it to a lower number. For a fast computer with a SSD scanning for changes will be so fast as to instantaneous for all intents and purposes, so you could safely set it to 1.

This should be doable.

I would suggest just editing the po files. If it’s truly a problem in the pali text, edit the pali file at the same time. I think that would be the easiest way as presumably when you discover problems in the pali, you’ll have already been translating the po file for a while.

Global find and replace is pretty dangerous. I mean there is the pretty powerful find feature with lots of options, the replace isn’t there because it’d be dangerous to make it too easy to replace things on a global level. Now that is assuming the standard pootle project, which is low on duplicated strings and multi-user.
For now I’d suggest just editing the .po files, if you want to do a global replace, then use perl or something, then rescan the project files in pootle.
By the way, putting the po files under version control would be a really good idea. Whenever I do global replace, I first git add and commit everything, then do the replace, then run git gui to see what has changed - because git gui is brilliant for seeing changes. If I’m unhappy with what got replaced, then it’s easy to revert the changes within git gui, and re-run the replace.

It’s easy to put a folder under version control, you just cd to the folder and go:

git init 
git add *.po
git commit

(or do the last two lines from git gui)

Anyway in the long run maybe we could look into a built in search replace but it would be need to be designed with care, so as not to be a big red “nuke the project” button.

The easy and best solution here is Libreoffice style on-the-fly conversion of ascii to unicode punctuation. It is very easy to implement especially if you don’t want to reserve the option of having ascii quotes in the translation (I just implemented it as proof of concept so it definitely works). simple regexs will get it right 95% of the time, and if it gets it wrong, it will always leave unicode punctuation alone so if you put the correct unicode mark in it wont try to change it to what it thinks is right.

It should be noted it’s also possible to copy most non-alphabetical things like punctuation from the root text directly by clicking on it, and finally you can add any characters including punctuation to the ‘Special Characters’ of a language under Admin, add them to en and you can then click to insert.

Sure. Everything there seems sensible.

I can have a look at improving po2html, certainly the spacing should be handled correctly and there should be no need to add trailing or leading whitespace to make the translated text work.

We have now. And no, it was far from trivial, as the texts in Buddha Words are somewhat modified, alphabetically they are similar although for some reason ṅ has been replaced with n, unicode double quotes with ", em-dash with en-dash (presumably with subtly sadistic intention) and some of the markup is stripped out or perhaps wasn’t there, and there were the odd case of missing markup and the even rarer case of small pieces of text going missing - basically everything you’d expect from the real world. So it involved analyzing where the markers had been added, and locating the matching place.

blake · July 6, 2015, 7:43pm

Hummm, now that’s a tricky one. Hacking Pootle’s javascript is easy, the python is typically trickier.

You could try globbing the lines with a regex something like

pattern: ^(msgstr.*)"\n"
replace: \1

Run it multiple times and it should safely merge all the lines by eliminating the "\n" units.
Hacking it is possible - essentially just doing the same thing but automatically when pootle writes a file.

sujato · July 7, 2015, 12:22am

It would be nice to be able to edit things in one place. I’m wondering whether it would be possible to make the changes in the PO file, then generate the Pali from the PO file at the end, along with the translation?

Okay, I will look into this. But it won’t solve the problem for less technical translators in the future.

Perhaps we could do a global f&r that required confirmation for each replace? It would be slower but safer; there would not be many cases where the quantity of replaces was so very high.

That would be great. Later we can sort out the various styles for different languages.

Yes, but it’s quicker to use a shortcut!

Good point, yes I’ll do this.

Crikey, thanks for that. I’m seeing these now locally, but not online…

Scratch that, they’re live.

sujato · July 7, 2015, 11:15pm

In fact, now that I think about it, we have to do this anyway, right? Because at the end of the day we want pali/English files with parallel markup, and these files will include the segments with IDs.

sujato · July 7, 2015, 11:28pm

Also, just to let you know I am now back in Sydney and trying to get my local Pootle server working on the Desktop.

After much weeping and wailing, lamentation and beating of breasts, it is now working. I think what I needed to do was:

 sudo -u postgres psql postgres
ALTER ROLE sujato LOGIN;

But frankly I still don’t really know…

sujato · July 12, 2015, 8:51am

I’m trying out the version control thing, and I can’t see how to use git-gui for seeing diffs. Meld would be another option, yes?

On an unrelated topic, I have got cepstral voices working locally, so I will be using that for my TTS. No chance of a PO2SSML script is there?

blake · July 13, 2015, 7:05am

Here is the general process:

Make changes but don’t git add or git commit yet…

Run git gui

Unstaged changes (i.e. those which haven’t been git added) appear in the top-left panel called appropriately enough “Unstaged Changes”. Changed files appear in green, new files in white. You can click on a file to see changes. You can revert a file in this panel, by clicking on it and using “Revert Changes” from the Commit Menu, Revert Changes only works for unstaged changes.

You can add changed files from within git gui, individually by clicking on them or from the commit menu, or all at once by clicking the “Stage Changed” button.

Changes which have been added but not committed, appear in the “Staged Changes (Will Commit)” panel. Again you can click on a file, and see the visual diffs. You can unstage changes from here via the commit menu. You can also commit from here.

Once files have been committed, you can still view the diffs. To do this, go to the Repository Menu and “Visualize All Branch History” or “Visualize History”, it simply gives a list of commits, and you can click on a commit to see what changes in that commit. Very useful.

From this visualize menu (gitk), you can also do a few things. You can reset the repository to an earlier commit, this is useful if you realize the changes you just committed are messed up. To do this, right click on a commit and choose “Reset <branch> to here”, and then the “hard” option.
Never reset changes you have pushed, you should reset no earlier than remotes/origin/<branch> which is the current state of the remote online repository, consider anything pushed to be set in stone

Instead of reseting, you can also right click on a commit and choose “Revert this commit”, the revert option creates a new commit which undoes the old one. You can use it on any commit, but it tends to become harder to use the older a commit is, as git has to try to revert it relative to the current repository state, not how it was when it was committed. Thus sometimes revert will fail and tell you human intervention is required.

Final thing to note, within git-gui, go to “Edit/Options”, there is an option “Additional Diff Parameters”, paste --color-words into it. This will cause git gui to highlight changed words instead of changed lines, making it much easier to see what actually changed. It only has to be done once per repository.

sujato · July 13, 2015, 8:07am

Thanks so much, very clear and useful. Everything works as specified!

How’s other things going? Are we able to test Pootle with the full project yet?

sujato · July 13, 2015, 8:34am

A few things regarding Terminology. Some time ago you mentioned that the termonology was meant to be produced as the translation went on: but I don’t think this is right. On the Pootle site it seems as if you should simply create a Terminology PO file and add it, which is what we are doing.

More important, I am wondering whether we could get the Terminology lookup to play nice with multiple definitions. Currently only the final definition is shown. Could we use the same trick we did on the Pali file to disambiguate identical terms?

blake · July 13, 2015, 9:20am

It seems the recommended way is to use brackets after the term to disambiguate, something like
duck (bird)
duck (action)

However I’m not sure how useful terminology is going to be just because pootle isn’t going to do a very good job of understanding pali declensions and conjugations. It will be something to work on but probably for the public Pootle server. As far as I can tell from the source code, the Pootle terminology is practically hardcoded for the english->other language case - it’s not surprising since that is overwhelmingly the use case for pootle.

blake · July 14, 2015, 11:48pm

Having thought on it, we already have all the code for understanding pali in
javascript. I have implemented something I long dreamed of, pali popup with
the ability to enter your own definition. You should be able to git pull and try
it out right now.

With that working, I could just completely replace Pootle’s built in terminology, so if you have entered terms into the pali lookup custom glossary, they would appear in the side as appropriate.

sujato · July 15, 2015, 12:17am

Sorry, try it out where: SC or Pootle? I can’t see anything on either staging or in palilookup2.0 locally; in fact the look up isn’t working properly at all. Looking at Github, maybe you’ve forgotten to push the changes?

Also, for the Terminology I am currently using, I essentially took the list of terms used by ven Bodhi in his most recent translations, and changed the ones I wanted. To help increase the hit rate I used the highly sophisticated methodology of removing the last letter from every Pali word. Subtle, right? If we’re going to do something slicker, we’d be better off, I presume, using the full word as a basis, so in case it is useful, here is the file before the endings were chopped off.

BB_terms_last.csv.zip (11.3 KB)

blake · July 17, 2015, 7:55pm

Okay here are the instructions for getting pali lookup and other pali goodies integrated with Pootle. Cross your fingers when running these! Because the techniques are kind of dodgy, but it should work!

First of all, I put the scripts and stuff under version control on github, so you need to update your pootle folder to be under version control; because we’re initiating in a non-empty folder, things are a little different to usual, but just run the following:

cd ~/pootle
git init
git remote add origin git@github.com:suttacentral/pootle.git
git fetch
git checkout -tf origin/master

~/pootle should now be a git repository, and should have grown some new files, such as patch.sh and the folder patches

Now just run

./patch.sh

This should have updated pootle with some changes to the python and javascript. The git repository and patch.sh infrastructure should make it straightforward to make further changes. Any update simply entails a git pull and ./patch.sh. Oh, and run ./start.sh to bring the pootle server back up.

And before I forget, there is one other thing you should run, go to the SuttaCentral folder, git pull both there and in /data, then run this in the suttacentral folder:

python -c "from sc.search import glossary; glossary.load()"

This will load into elasticsearch glossary from data/dicts/en/bb_glossary.json, that is simply the csv file converted to JSON, you could actually alter it and re-run the above command to change the glossary, but do note that JSON is absolutely finnicky about the double quotes, brackets and commas all being perfect.

The basic additions to pootle are as follows:

Po files now have the msgstr lines globbed into a single line.
Tables can now be sorted using numerical sort, you might need to click the name column once to activate this (unfortunately, this sorting is merely at the user interface level, internally pootle will still consider the files to be in alphabetically sorted order, ultimately this is due to the SQL database)
Ascii quotes will now be automatically smartened as you type ala Libreoffice. This only happens for " and ', if it ever makes a mistake, just put in the correct unicode mark, and it wont try to change it. There is no way to enter a " or ’ other than the &dquo; form.
Pali Popup Dictionary now works for pali source.
For pali source, terminology is now powered by Elasticsearch. It understands pali moderately well, and handles duplicates gracefully.
You can add your own terminology via the lookup popups, “gloss” is what gets displayed in the side, along with context if applicable. You can overwrite an existing glossary entry by using the identical term and context. Note you may edit the term into a standardized form or even something completely different. This glossary is stored in elasticsearch, marked separately to that from the bb_glossary.json.

sujato · July 17, 2015, 9:44pm

Fantastic, all goes well up to

python -c "from sc.search import glossary; glossary.load()"

which gives me the following error

sujato@sujato-UX31A:~/suttacentral$ python -c "from sc.search import glossary;glossary.load()"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: cannot import name 'glossary'

blake · July 18, 2015, 3:41am

That would imply the repository isn’t up to date or set to the right branch, the complete command sequence would be:

cd ~/suttacentral
git pull
git checkout palilookup2.0
python -c "from sc.search import glossary; glossary.load()"

If it complains bb_glossary.json doesn’t exist, git pull the data repository.

sujato · July 18, 2015, 5:47am

Okay, that fixes it, I was on master.

Now I am back at the Buddhist Library, and for the next few days can work on this. It is working well on my Desktop. In fact it is completely awaesome, as usual you take my unrealistic expectations and just blow them out of the water! Now I play…

sujato · July 18, 2015, 6:56am

After working with it for a little while, all is going great.

One minor bug: the quote marks inside HTML are getting correctified; not the structural “Locations” HTML, but the inline stuff.

Other than that, the only minor annoyance is that the popup is a little finnicky at times. When you expand a dictionary entry, it frequently disappears; and sometimes it stays around for too long. Not easy to fine tune this, i know.

In case it is of use, I attach the latest PO file for my MN10 translation.

mn10.po.zip (16.6 KB)

A few more buggish things, some of which I’ve mentioned before, just keeping them on the radar.

I get garbled inline markup like this:
<span class="brnum">(2)span>
which seems to indicate a bug in sc-html2po.py.
At times Pootle hides some strings. I’m not sure why, but it has to do with the whole “unique string” thing. However it doesn’t hide all repeated strings, just from time to time. The strings are there, you can just click “More” and they are there. So it’s not a critical problem, but it would be easy to overlook this and leave some strings untranslated. So it would be better if there was a setting to say “show all by default”.
Another minor detail, when you simply accept a string with translation, which typically happens when there is inline numbering, Pootle registers this as “needs work”, which is kind of misleading. It would be better if these were simply accpted as translated, or else called “Accepted” or something.
Spacing is still not happening correctly with sc-po2html.
There is a bug in the quote converter. In certain cases the cursor will jump to the end of the passage. For example, if I have:

A few words, like this.

And I want to insert a highly grammatical quote mark here:

A few word’s, like this.

When I put in the quote the cursor immediately leaps to the end of the line.

sujato · July 19, 2015, 11:29pm

So from what I can see we are very close to having a final product. A few remaing things.

I still don’t know an elegant way to add the entire canon to pootle, preserving folder structure. I can only do folder by folder. Ideally we simply deploy the server online in the end, with all the Pali texts pre-loaded.

A little more wishful, but i’m wondering whether we can hack the add-term bar to improve the Pali lookup generally. We’re getting a very high hit rate with the Pali lookup, but there will of course be some terms that are missing or misread. I’m wondering whether it is possible to add such terms directly via the term-bar. What I am imaging is creating an entirely separate new Dictionary for SC, which will contain only such terms as are missed by the ones we have. So, maybe put another “add” button on the term-bar. The existing function stays as it is, you can add terms. But if you want to add a Dictionary entry, the term is there, the gloss is there, then hit the second Add button. That goes into a separate data file, presumably JSON, which is then globbed up by elasticsearch as an extra dictionary.

Incidentally, and this is for further development, now that the hit rate for Pali is so good, the problem for the next gen is to pare down the results to the relevant ones: not an easy task.

I’m getting mn1, mn10, mn11…

Also, just to let you know, Ayya Arannadevi has been editing the DPPN, ensuring that the spelling of names is in line with our Pali text. She’s close to completing, so this should be ready soon. The reason for this is that variations in name spelling is often greater than for words generally, and often unpredictable (eg “Kevaṭṭa” vs. “Kevaddha”). So this should make the DPPN work better.

When she is finished, and if she wants to do some more work, so you have anything that might be good? I was thinking of suggesting she proofread the PTS dict for typos: it seems to be pretty ragged in things like punctuation and so on.

blake · July 23, 2015, 3:15pm

I’m not sure what you mean by preserving folder structure. I don’t think pootle supports it. It’s very easy to regenerate the folder structure as it is completely deterministic.

I have tried loading the whole canon at once into Pootle, and it did work. However it took a long time to process, and my feeling is it was pushing the limits of a single project in Pootle, with high potential for stability/performance problems.

I would probably recommend treating the divisions individually, here are the commands you could run to do that:

cd ~/suttacentral/data
sc-html2po.py --out ./po/dn  ./text/pi/su/dn/*.html
sc-html2po.py --out ./po/mn  ./text/pi/su/mn/*.html
sc-html2po.py --out ./po/sn  ./text/pi/su/sn/*/*.html
sc-html2po.py --out ./po/an  ./text/pi/su/an/*/*.html

That puts each division in a folder under data/po, compress each individual division folder into a zip.

You can try loading them all into a single project, and it does work but I’m not sure if it’s the best approach in terms of organization, and there would be potential for problems. I would recommend creating a separate project for each division (i.e. dn-suttas, mn-suttas…), so as to not push the limits of what pootle can handle in a single project. This wont negatively impact translation memory, as translate memory operates independently of project.

I can probably disable the fuzzy results in the case where there is a conjugated match. The fuzzy results are particularly helpful for verses.

It should work if you click on the “name” heading on the table - it is delivered in alphabetically sorted order, but will be sorted numerically if you ask it to sort by name. Pootle seems to remember your choice of sort column.