Pootle for translation

blake · July 24, 2015, 8:25pm

I have tuned this in the latest version. Basically, for about 1s after the popup is created, it is immune to being dismissed. The same technique is applied to expanding an entry. Seems pretty effective.

The po files are fine, it’s actually the fault of the palilookup trying (and failing) to mark up the already marked up stuff. Fixed.

Let me know if you can pin down this issue.

You’re right about being misleading - to add to the confusion in the source code it’s called “fuzzy”. It seems to be a grab all for any kind of condition which might make the translation less than optimal and they didn’t know what to call it. I modded it so that “needs work” is only checked if there are alphabetical characters in the text (non-html tag) context so if there are only numbers and punctuation and html tags, it doesn’t check “needs work”.

Fixed.

Also in the new version there are some more goodies. The matching is more accurate as I fixed some problems with some of the suffixes not being matched. In the case where a strong (conjugated) match is found, the fuzzy matches are hidden by default. The fuzzy matches are normally wrong but can be useful particularly for verses. There is a control to reveal the fuzzy matches. Also, when used in Pootle only, fuzzy matches show which part of the word had to be modified to get a match. It only works in Pootle because I reused the pootle javascript responsible for showing the differences in suggestions.

Updating should be a simple matter:

cd ~/pootle
git pull
./patch.sh
./start.py

sujato · July 25, 2015, 12:56am

Okay thanks, I’ve updated and everything seems to go smoothly. However I’m not seeing most of the improvements you’ve spoken of, either they haven’t been pushed properly or not updated properly at this end—but all seems good here.

More specifically:

I can see the fuzzy matching on the Pali lookup on SC at palilookup2.0 branch, but not on Pootle.
The quote correction bug is fixed.
The “needs work” correction doesn’t seem to have done anything, it behaves as before.
I still get <span class="brnum">(1)span>. I just noticed that it displays correctly in the orginal text, then changes to the buggy form on hover.
popup still disappears on Pootle, on SC it is fine.

I have had some more success in uploading projects. For some reason pootle doesn’t seem to like it if you just have “mn” or whatever as the project name, but “mn-suttas” works better.

It’s still churning over things a lot, and maybe we’d be better off dividing them into subdivisions; but then, I’m on my lappie, so not as powerful. Ultimately, though, we are still displaying a list of over a thousand items at once, and that is going to stretch the capabilities of any older or less powerful system. So maybe you could write a script that would keep the PO files in their subdivisions?

No, it’s still sorting mn1, mn10, mn11…

sujato · July 27, 2015, 12:14am

A couple of further thoughts, these are not bugs but feature ideas.

We discussed earlier the idea of maintaining the HTML in parallel for Pali and English in cases where the HTML needs editing. It occurs to me that we were maybe making that harder than it needs to be. Can we not simply mod sc-htm2po to output both Pali and English? Then I can edit by hand the PO files, no changes to the interface required (it won’t happen very often, I think). At the end of the project we spit out the two languages, confident that they are exactly in sync, and that I haven’t clutzily edited the Pali and English differently. Of course, this would apply only to the English version, and it would not have to be functional until we are ready with the final text.

Here’s second, unrelated idea, connected with the notion of improving our hit ratio for the Pali lookup. There are quite a few cases where the lookup doesn’t work initially, but the “click and choose how to break up the word” produces correct results. Would it be possible to create an extra JSON terminology file that was populated by a click on that widget?

So, just now, it didn’t recognize “lobhadhammopi”. I click on the term, opening the selection widget. When I hover at the start, it correctly gives me “lobha”. Click! Moving the cursor down, it correctly gives me “dhamma”. Click! Then at the end it correctly gives me “pi”. Click! Those three clicks generate the JSON data for that term, and the next time it comes up, I get the entries for those three words.

And a third idea. As time goes on the number of translation suggestions increases, and sometimes it’s a lot of noise. For example, if you misspell a word, then correct it later, the misspelled version will keep coming back as a suggestion. I wonder if there’s a way that these can be filtered, either by gradually dropping off unused suggestions, or by having a X on the suggestion field, click it and the suggestion is deleted.

blake · July 31, 2015, 8:10pm

With performance, on my desktop I have no problems whatsoever with division length projects. So I think for the desktop it will be fine.

I’ve pushed a fix for the “needs work” bug - it should work properly now. Please pull and run patch.py
There is also a fix in the suttacentral repository for the whitespace problems in sc-po2html

Can we not simply mod sc-htm2po to output both Pali and English?

In principle, we could. Note that at the moment sc-po2html returns somewhat abbreviated HTML, it strips out some (most) of the paragraph numbers, most of the notes. If it kept everything, then it could be done exactly as you should suggest. Alternatively, as long as the alphabetical characters of the pali doesn’t change (much), it’s not hard to reconcile the full pali text with the msgid’s - even if the msgids break into different sentences than they used to.

Note it is possible to modify what is stripped out, sc-html2po.py --help shows the options, but in brief the defaults are:

--strip-tags  ".var, .cross, .ms, .msdiv, q.open, q.close"
--strip-trees "#metaarea"

The metadata should be the same for all texts so is trivial to reintroduce. The paragraph markers can add quite a bit of comment noise, and aren’t really desirable to keep in the translation. They are also trivial to reintroduce to the pali as it’s simple enough to make a correlation mapping.
The variant notes are the trickiest to reintroduce. They probably aren’t meaningful to keep in the translation but could be useful during translation. So if you wanted to generate english and pali from the po files, it’d make sense to keep the variant notes and cross references, and just not keep them in the translation.

Those three clicks generate the JSON data for that term, and the next time it comes up, I get the entries for those three words.

It’s a clever idea. I’ll see if I can turn it into something workable.

And a third idea. As time goes on the number of translation suggestions increases, and sometimes it’s a lot of noise. For example, if you misspell a word, then correct it later, the misspelled version will keep coming back as a suggestion. I wonder if there’s a way that these can be filtered, either by gradually dropping off unused suggestions, or by having a X on the suggestion field, click it and the suggestion is deleted.

If you’re talking about the translation memory suggestions, you can obliterate the unused ones by running reset.py, the translation memory will then be rebuilt from the existing po files. It could even be set to run automatically from time to time. The way the translation memory is built, deletion wouldn’t stick unless the memory is truly unused. It might be interesting to have some kind of way of tracking how “popular” a memory is, like you might knowingly use an exceptional rendering due to very unusual context.
For now though, reset.py will wipe out unusued sugestions, and modifications would be best left for pootle2.7, where the translation memory is built into pootle and uses elasticsearch, rather than being done externally with amagama/postgresql.

sujato · August 1, 2015, 9:16am

Thanks for the fixes, I’ll test them over the next few days.

And we could filter out unnecessary tags on export to English, yes? Unless this creates inordinately long comments, it seems to me this would be the best way. Then I can be confident that the Pali and English are exactly parallel. But if you prefer to reconcile the texts later, I’ll leave that up to you. I must say, though, that I feel more confident doing it by hand, especially when there is so much repetition and so on in the texts.

Lets go with this, unless you have any great objections. Oh, and

Cannot die too soon.

sujato · August 10, 2015, 6:59am

Okay, just checking the latest changes, still testing.

Just to note, some of the unfixed things I noted earlier are still unfixed: they work on palilookup2.0 for SC, but not on Pootle. These are the “fuzzy” thing for lookup matches, and the “disappearing popup for expanded entries” bug. Another little inconsistency is that the lookup on SC uses serif for expanded entries, but sans on Pootle. In this case, the sans is correct.

Another minor bug with the palilookup2.0: the popup is now too sticky. It seems to remain until you hover over another word. Unless you’re moving to another word, in which case it should continue as it is now to disappear immediately, it should remain for a second or or two after leaving hover, then fade.

blake · August 12, 2015, 8:20pm

Okay, new version is up, both of palilookup2.0 branch, and pootle. Upgrade pootle as usual

cd ~/pootle
git pull
./patch.sh

This version features the requested compound break memory:

Click on a term, and the terms will have a little tick next to them. Click the tick and it will remember that term belongs to that compound.
In future, the checked terms will show as normal results for that compound.
You can uncheck by clicking on the tick again.
It understands conjugations, but if you want to modify an entry by unticking, it must be done to exactly the same compound form as it was originally accepted on.
The data is stored in elasticsearch. There is no way to edit it manually, but you can get a dump of it at http://localhost:9200/pali-lookup/compounds/localuser

I was skeptical of this feature when asked to implement it, but I like it a lot.

All features now should be operational in Pootle (I had it requesting an obsolete script file), and I also tuned the popup stickiness somewhat, so they should more reliably disappear when you don’t have the mouse inside them.

sujato · August 14, 2015, 1:06am

Thanks so much. I have tested it briefly, and all features work as advertised.

We’ve created a Pali-translating monster, which I hope will chomp up Pali and spit out excellent translations for a long time!

At the moment I’m stuck in Kaoshiung. Travel to Qimei is by plane, which flies every day but is booked months in advance, freighter, which island-hops to eventually get there, and ferry, which normally goes once a week, but was cancelled due to the typhoon. So I’m waiting for the next ferry, on the 18th. When I get there, Dustin will be with me and he’ll have the internet turned on for a few weeks, after that I’ll be in isolation. But because of these delays, it’s not very long before I’m in Europe anyway!

sujato · August 19, 2015, 7:08am

Hey @blake I’m now at Qimei, it’s amazing, perfect! A very quiet location, nice small house. Stinking hot, but air-con so no worries. Dustin and family are here for a while to set things up.

My desktop and all is fine, everything is going, and is all up-to-date. One bug with the system: Amagama doesn’t seem to be adding new TM strings. Past translations are showing up, but not anything I newly do.

I don’t know if its related, but when using ./start.sh I get:

File "/usr/lib/python2.7/dist-packages/translate/misc/wsgiserver/wsgiserver2.py", line 1844, in start
raise socket.error(msg)
socket.error: No socket could be created

Neither this error nor the bug is showing on my laptop.

blake · August 19, 2015, 10:04am

That’s a simple one, it means the server is already running. Or technically, it can mean that any service is running on port 8000, if you point your browser at localhost:8000 you’ll see what is running.

sujato · August 19, 2015, 12:31pm

Not so simple: looks like the two things are unrelated. Yes, there was already an instance of Pootle running, i must have set it to run on startup. But that doesn’t affect the TM bug. I have tried killing all Pootle, git pulling, run patch.sh, then start.sh, still nothing…

Incidentally, there’s a mistake in the elasticsearch github wiki: the instructions for running as a service are out of date, due to Ubuntu’s switch to systemd. This works.

And another bug I just noticed: most of the navigation shortcuts don’t work in Pootle. I’ve tested FF and Chrome., and the only ones that work as advertised are ctrl+up and ctrl+down.

blake · August 20, 2015, 10:18am

Okay, there was a bug preventing discovery of .po files in subdirectories. So git pull, also run reset-tm.sh for good measure - it clears and reinitializes the amagama database. It should work then (assuming you are using subdirectories).
The cleanest way to restart the server and all associated services is:

killall start.sh
./start.sh

[quote]
Incidentally, there’s a mistake in the elasticsearch github wiki: the instructions for running as a service are out of date, due to Ubuntu’s switch to systemd.[/quote]

Technically the instructions are not exclusive to the latest release of Ubuntu, certainly they should also be applicable to the latest Ubuntu LTS and Debian 7. Looks like Ubuntu 15.04 and Debian 8 use systemd.

(edit: I just checked, the instructions for Debian/Ubuntu should be valid regardless. update-rc.d is init system agnostic - it wasn’t easy to google the information, but reading the update-rc.d script itself makes it clear that it updates systemd on Ubuntu 15.04)

Note: The Elasticsearch ppa repository seems to be primarily targeted at Debian which makes sense, because if you’re going to run a server which only runs elasticsearch (the normal setup), you’d run it under Debian for maximum stability, or else Ubuntu server LTS. Some versions seem to have problems under Ubuntu. My experience is Elasticsearch 1.7 doesn’t have any problems with Ubuntu, but that 1.5 and 1.6 potentially could.

My testing in Chrome:

Works:
ctrl+enter (submit and advance)
ctrl+space (toggles “needs work”)
ctrl+up/down (moves to next/prev entry)
ctrl+shift+s (search box)

Does nothing:
ctrl+shift+home/end
ctrl+shift+,/.

Intercepted by Chrome UI:
ctrl+shift+pgup/pgdown (moves chrome tab)
ctrl+shift+n (creates incognito window)

Are there some particular shortcuts or key bindings you want? They should be very easy to hack in particularly if it’s just a binding an existing function to a key.

sujato · August 20, 2015, 10:49am

Thanks, I’m starting it now, will let you know in a few if it works.

Re the shortcuts, basically the “does nothing” and “intercepted” are the ones I want.

sujato · August 20, 2015, 11:12am

Okay, everything seems to be good now, thanks so much.

There was a new problem, though. I encountered a mistake in the Pali. yonisomanasikāra was spelt as yoniso—manasikāra, so sc-html2po put the second part on a new string. I fixed it in the Pali, and also corrected it by hand in the po file, by deleting the extra string. I didn’t change any of the msgctxt numbers.

Well, this caused the whole thing to freak out, which coincidentally happened at the same time I got your fix. After that, the database wouldn’t connect, the whole thing hung, restarting, rebooting, nothing helped. Only a hard reset—actually a couple of them—eventually sorted it out.

Anyway, that was bad. But what I need is a safe way to make changes to the pali text. We’ve talked about it before, but I don’t think we ever settled on a foolproof method.

O, and another thing: I can’t figure out what’s up with the Terminology file. It works fine, but I can’t figure out where it lives.

blake · August 20, 2015, 11:17am

The main terminology file is suttacentral/data/dicts/en/bb_glossary.json, each line consists of three things, pali, english and context - the context string is always empty because of the source data, but you could put something in it.

If you want custom terminology, it currently lives only in elasticsearch.

sujato · August 20, 2015, 11:18am

Okay, so how to edit? Just write in the JSON file? I’m nervous after my recent experience! But I have tried it, and it doesn’t update the server: is there a way to do that without restarting everything?

blake · August 20, 2015, 12:16pm

Do a git pull in suttacentral

I’ve added a task

invoke update_glossary

It does the normal things the server will do, but reading bb_glossary.json and loading it into elasticsearch is the only thing it does so errors wont get buried. If there is a syntax error in the JSON, it will barf and tell you the line number - not necessarily the line where the problem is, but the line where it ceased to be parsable as JSON.

Note that JSON is very particular about the commas.
In bb_glossary.json, the structure is a long array of entries, each line is an array of 3 strings. In an array, every element must be separated by a comma, and there must not be a trailing comma after the last element in an array:
This is valid JSON:
[1, 2, 3] (An array of numbers)
This is invalid JSON:
[1, 2, 3,]
Commas are used strictly to separate two things, because 3 is the last element in the array, the comma isn’t separating anything, so is invalid syntax. (Note that [1, 2, 3,] is valid in both Python and Javascript because it’s generally considered this level of strictness is excessively anal, but JSON is that strict, in part to keep it very simple and unambiguous)

So every entry/line should have exactly this form:
["pali", "english", "context"],
Except the very last entry, which shouldn’t have a trailing comma, instead the closing bracket for the array of entries follows:

["pali", "english", "context"]
]```

Suffice to say that quoting is also strict. All text strings must be double quoted. A literal double quote within a string must be escaped `\"` , most likely your text editor's syntax highlighting will expose quoting errors.

Running `invoke update_glossary` will point out syntax errors, and there are also non-syntax errors. Like if you did something like this:
```["pali", "english", "context",
"pali", "english", "context",
"pali", "english", "context"],```
Then what it will see is a single array, of 9 entries. This is valid JSON, but when my program processes the array, it only expect 3 entries. So you will probably get an exception saying something like 
`ValueError: too many values to unpack (expected 3)`

This should be enough to keep the format correct. As emphasized, the most surprising gotcha will be the trailing comma.

If there are no errors, it should be loaded into elasticsearch just fine.

sujato · August 20, 2015, 12:30pm

Okay, great thanks. Except I’m still not seeing the changes… What exactly do I need to do?

Also, and I don’t know if this is relevant to anything, in my local SC I am getting the following:

[2015-08-20 12:33:40] ERROR sc.search.indexer Oops pali-lookup, {'aliases': {}}
[2015-08-20 12:33:40] ERROR sc.search.indexer Oops pi2en-glossary, {'aliases': {}}

blake · August 20, 2015, 1:34pm

With further investigating of the shortcuts, apparently the pagedown and pageup ones aren’t actually implemented in the code so it’s kind of irrelevant that chrome intercepts them. The home/end ones also seem to be suffering some kind of existential crisis. Presumably all the missing shortcuts are missing for good reason.

Many of Chrome’s keyboard shortcuts can be wrested away by javascript and suppressed, including ‘F5’ or ‘ctrl-r’ for reloading a page. However the ones bound to manipulating/changing/creating tabs and windows cannot be intercepted at all and are never even seen by the javascript.

Basic pageup/pagedown and in conjunction with SHIFT can be captured. Any home/end combination can be captured, although they do have useful functions within a textbox.

Considering these constraints and sticking to seemingly unused key combos I have implemented:
Go up by 10: ctrl-shift-down (parallels ctrl-down to advance by 1)
Go down by 10: ctrl-shift-up
Go to first entry: shift-pageup
Go to last entry: shift-pagedown
Select goto entry box: ctrl-shift-g (rebound from ctrl-shift-n)

Let me know if there are more shortcuts you want implemented.

blake · August 20, 2015, 1:43pm

It should be a matter of editing the bb_glossary.json, adding entries according to the existing form.
Then run invoke update_glossary, if that runs with no errors, then the entries should appear.

The mentioned errors are spurious and should be harmless, but I’ll fix the code so they go away.

edit: I fixed another possible issue with the glossary, and just pushed a fix.