Pootle for translation

blake · June 8, 2015, 1:26pm

I have been investigating Pootle. It is a tool both for performing translation, and for managing translation projects.

This is the Pootle thread. It will mainly be of interest for @sujato, but will also serve as a record on getting things working.

In the future we will probably set up a pootle server for ease of doing translations and coordinating translation efforts.

blake · June 8, 2015, 2:43pm

Pootle Install Guide for Ubuntu/Debian (Local server)

Because of some difficult dependencies which are ‘optional’ because they are difficult to install, but essential for functionality, we are going to use the system python, and install the difficult dependencies using apt-get.

sudo apt-get install python-virtualenv python-lxml python-levenshtein  python-lucene

Next we create a virtual environment using virtualenv, which allows accessing system packages:

mkdir ~/pootle
cd ~/pootle
virtualenv --system-site-packages -p /usr/bin/python2.7 env
source ./env/bin/activate

Finally we can install and configure pootle

pip install Pootle==2.5.1.3
pootle init
pootle setup
pootle createsuperuser

It is best to run the above commands individually as they might ask questions. Enter a username and password which pleases you for the createsuperuser step.

Run the server like this:

cd ~/pootle
source ./env/bin/activate
pootle start

Point your browser to http://localhost:8000, you can expect that the first page load will take a long time (~1 minute) as it builds assets and stuff. But if it dumps loads of error messages into the terminal then something went terribly wrong.

If everything went well you can now log in with your admin account.

Running the server automatically

Run the following code in the terminal to create a ‘daemonize’ script which can start the server automatically.

echo '#!/bin/bash

HOME='$HOME'
cd $HOME/pootle
source env/bin/activate
pootle start' > daemonize-pootle.sh
chmod +x daemonize-pootle.sh

This script can be added to startup applications so it starts when you log in.

Overall this installation should be adequate for a single user. None of the performance optimizations are installed but this is highly unlikely to matter when only one user is using it.

Upgrade to MySQL

To improve performance, it is recommend to use MySQL, although in places pootle claims to support postgresql, it does not!

The procedure to use MySQL is quite simple.

Save your data (optional)

Assuming you already have pootle set up the way you like it with SQLite, you can save the data in this way:

cd ~/pootle
source env/bin/activate
pootle dumpdata -n > ./data.json

We’ll make use of data.json after installing and configuring MySQL.

Install MySQL-server

sudo apt-get install mysql-server

I recommend reading and following the instructions here for installing and securing MySQL, especially under the “Initial Setup” section - the rest doesn’t really matter.

Create account and database

First get a MySQL prompt:
mysql -u root -p
Enter the root password you set earlier when prompted.

Now enter the following commands:

CREATE USER pootle IDENTIFIED BY 'pootle';
CREATE DATABASE pootledb CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
GRANT ALL PRIVILEGES ON pootledb.* TO pootle@localhost IDENTIFIED BY 'pootle';
FLUSH PRIVILEGES;
exit

Configure Pootle

Edit ~/.pootle/pootle.conf (consider making a backup copy first!), the DATABASES section should look something like this:

# Database backend settings
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        # Database name
        'NAME': 'pootledb',
        'USER': 'pootle',
        'PASSWORD': 'pootle',
        'HOST': 'localhost',
        'PORT': '3389',
    }
}

Now that pootle is configured, restart:

cd ~/pootle
source env/bin/activate
pip install python-mysql
killall shart.sh

Initialize and load the data:

pootle syncdb
pootle migrate
pootle loaddata ./data.json

Loading the data might take several minutes.

Once done, run:
./start.sh

Pootle should now be running the same as before, except backed by MySQL instead of SQLite.

sujato · June 8, 2015, 9:42pm

If we can get this going that would be terrific. An online interface would solve any problems with installation, and it means that we can use the Pootle “alternative language” suggestions and other things. I’ll get this set up and we’ll see how we go.

sujato · June 8, 2015, 11:44pm

Okay, so I have the server up and running on my laptop. I didn’t have to do the whole “pip regression” thing. Next try to use it, not sure how to add projects. Could you segment the whole Pali text and add it as a project?

blake · June 9, 2015, 4:20pm

Adding texts is very easy. It’s in the admin panel. You first go to languages and add a language, then you go to projects and add a project. Then you go to the project you just created and add a translation language (i.e. english).

At that point you can go to the project (the normal, non-admin overview) and upload a zip file containing po files.

One of the tricky things here is getting the translation memory working as it relies on some external dependencies. In 2.5.x there is truly a proliferation of possible dependencies. This is eased somewhat in 2.7 because in 2.7 it simply uses redis for caching, and elasticsearch for translation memory. 2.7 is still in alpha, and it difficult to set up because there are no release versions of it.
A further challenge is the documentation is haphazard at best, and often the documentation you find online is for the wrong version of pootle, some of the most easily found configuration documentation is for 2.7 even though 2.7 is not released.
So I’m still working through finding the best way to get everything working.

blake · June 11, 2015, 3:51pm

I have updated the installation instructions to use a different method, which should result in a Pootle server which is 100% functional, including terms suggestion and translation memory.

If you followed the previous installion guide using pyenv, please run the following to undo it:

pyenv uninstall pootle
rm -rf ~/pootle

And then follow the new instructions.

sujato · June 11, 2015, 10:18pm

Great, I will get to this in the next day or two.

sujato · June 17, 2015, 11:23pm

Okay, now Pootle is installed on my desktop, all went smoothly. I’ve uploaded the Thig PO file, and the pali terms, all works, excpet no TM.

sujato · June 19, 2015, 5:35am

Any assistance with the TM? it’s useless without it.

blake · June 19, 2015, 1:10pm

I was talking to a dev on the translatehouse IRC channel, unfortunately automatic local TM only works in Pootle 2.7. It is possible to set up local TM in Pootle 2.5 using the TM server amagama - now one thing with local TM is deciding when a translation should be committed to memory. One complaint I’ve heard of Virtaal is that it has a memory like an elephant. A nice thing about amagama is you can easily tell it to forget everything it has ever learned, and then rebuild its memory from “clean” po files. So while installing and running amagama is a minor hassle, it does offer some impressive power.

Installing Amagama

These instructions assume Pootle has been installed in accordance with the previous instructions for installing Pootle.

Install postgresql

sudo apt-get install postgresql

Install amagama (into existing pootle python environment)

cd ~/pootle
source ./env/bin/activate
git clone https://github.com/translate/amagama.git
cd amagama/
pip install -r requirements/recommended.txt
pip install pathlib

Add an amagama database to postgresql

createdb -E UTF-8 amagama

Now edit amagama/settings.py , change the entry DB_USER to your username.

Because we are using pali, we need to make amagama at least minimally pali-aware. There are a list of language codes near the start of amagama/tmdb.py called CODE_CONFIG_MAP, edit this list adding the entry:

'pi': 'simple',

(We use ‘simple’ because this tells postgresql how to handle the language)

Things which can be done with amagama-manage

These are optional, I’ve attached some scripts which will do it all for you

Before running amagama we need to export some paths:

cd ~/pootle/amagama
export PATH=$(pwd)/bin:$PATH
export PYTHONPATH=$(pwd):$PYTHONPATH

We can now initialize an amagama pali database:

amagama-manage initdb -s pi

At this point, amagama should be ready to remember translations, a po file (or files) can be added manually like this:

amagama-manage build_tmdb -s pi -t en -i sn56.11.po

If a directory is passed to ‘-i’ then the directory will be processed recursively and all po files loaded into tm.

If amagama is suffering from lots of bad memories, it is possible to nuke the database and return it to a clean slate:

amagama-manage dumpdb -s pi
amagama-manage initdb -s pi

Getting Pootle to talk to Amagama

edit ~/.pootle/pootle.conf and add the following two lines:

AMAGAMA_URL = 'http://localhost:8888/tmserver/'
AUTOSYNC = True

This tells Pootle where to look for TM suggestions, and also tells it to write .po files to disk immediately upon modification which is shortly going to come in useful.

Getting Amagama to function as Local TM

There is no built in way for Pootle 2.5 to automatically send translations to amagama, amagama is designed to be a vetted TM rather than indiscriminately remembering everything ever submitted.

As such I have written a python script which will scan the Pootle project folders for modified .po files, and load them into amagama.

Find in the attached archive pootle_scripts.tar.gz (1.2 KB) the following files

start.sh: Run this to start the amagama server, the pootle server, and remember.py, when closed (i.e. by ctrl-c) it will terminate all the servers it started.
reset.sh: Run this to restore the translation memory to a blank slate. It is fine to run this while the pootle/amagama servers are running.
remember.py : this is a script which synchronizes the .po files from Pootle with the amagama server. You don’t need to run this manually.

In brief, just extract those scripts to the ~/pootle server, first run reset.sh to make sure it’s in a good state, and then run start.sh, you can then navigate to http://localhost:8000 and should have a working pootle server, with working translation memory.

start.sh supercedes daemonize-pootle.sh which should be deleted. If pootle is running, kill it by running killall pootle.

To make pootle (and friends) start automatically, simply add start.sh to startup applications.

sujato · June 19, 2015, 11:47pm

I find myself rather astonished that this all actually worked. Congratulations, this can’t have been easy to figure out. But I have a TM working with Pootle! I’ve installed it on my laptop, and this evening will do the same on the desktop.

The only hitch was with creating the DB: first it said their was no role sujato, then that I didn’t have permissions. I stack-exchanged it and tried some things and it works, but I don’t really know how.

I’m guessing that the way forward will be for me to use the 2.5 for now. When we deploy our own dedicated SC translator online—which, correct me if I’m wrong, but it looks like its going to be a thing—2.7 will be ready, complete with auto-TM and elasticsearch goodness. It will be seriously awesome to be able to outsource translation for SC simply in the browser: no installation!

As far as timing goes, I have only a few hours left to set up my desktop on this visit. I will have a couple of days in July, and that’s it, then we’re packing it up and sending to Qimei. So we’ll have to have a stable system ready by then. However I will be going to Europe in Nov/Dec so maybe we can update to 2.7 then.

blake · June 20, 2015, 7:57am

It wasn’t! I had to talk to a dev on the IRC channel, and after he confirmed that there truly was no built in local TM, I knew I just had to bite the bullet and figure out how to integrate pootle and amagama (it actually wasn’t that hard once I knew it was the only way). I also spent quite a bit of time trying to get 2.7 to work (again bothering the dev on the IRC channel, who said it should work very well - I could get the tutorial project working fine, but when I tried to add a new project the workage kind of ended, so it definitely deserves to be in alpha still).

I agree with everything here.

I can look into integrating the existing pali dictionary lookup we have, perhaps in a similar way to what we do on SuttaCentral - I’m not sure how much pootle will object to javascript messing with its original translation strings, as it already does some stuff like wrapping urls so you can click them to copy them to the translated string, but I imagine it should tolerate what I would need to do.

My understanding of terminology is it should be set by you, the translator, using the words you want to use consistently in translations, and so in time the terminology also becomes a definite guide to what english word maps to what pali word. So terminology can’t (shouldn’t) be pre-generated from existing sources. However pootle will be oblivious to the rules of Pali and I might be able to educate it making its terminology function more accurate at identifying stems. But since I haven’t looked into how it works or how its coded yet, this is highly uncertain.

sujato · June 20, 2015, 9:19am

I’m on my desktop now, cannot get past the database creation. The DB is there, but it says: FATAL: role "sujato" is not permitted to log in

I can’t figure out what I did before that worked, please help!

sujato · June 21, 2015, 3:41am

An additional thought to keep in mind. I’ve arranged with Piya tan to help out with proofing and so on for my translations. If he could do this on online Pootle that would be great. Probably he’ll get the first batch of texts when I come to Europe, so early Nov. Anyway, if we could have Pootle deployed online by then, but only for private use, that could be very useful; it has the tools of suggestions and so on built right in.

blake · June 22, 2015, 1:03pm

Unfortunately I don’t know, but I’ll take a wild guess that the following might help:

try this:

psql

ALTER ROLE sujato WITH password '12345';
\q

Then set the DB_PASSWORD in amagama/settings.py to ‘12345’

(This might help because depending on system settings it might not be happy about a passwordless login)

If this doesn’t work run these commands:

psql
\l amagama

And paste the output to give some clue about what might be going on.

blake · June 22, 2015, 1:04pm

Adding a working Pootle server will be no problem whatsoever. It even has all the access control you could possibly desire built right in.

sujato · June 25, 2015, 2:24am

Okay I’m away from Sydney and can’t do any work on this for the next couple of weeks. What we’ll have to do is arrange the timing of my next visit to Syd, I have only a couple of days and I need to get a working Pootle set up. Meanwhile I will keep testing on my laptop.

sujato · July 3, 2015, 10:16am

Over the past few days I have been translating, when I get a moment, the Satipatthana Sutta (MN10), which on SC is unfortunately identical with DN22.

I’ve finished the translation, and here I will post a few Pootle-related issues that came up. Generally it was a nice experience, the local server works well, and once the TM and terminology are set up they are reliable. Here are some things for thought:

The TM sometimes takes a little while to register things, I’m not sure if this is tweakable. I’m wondering if, on my more powerful desktop, we can set it at more aggressive settings. It’s quite common that you want to reuse nearly identical strings in subsequent segments, but the memory hasn’t registered it yet.
MN10 numbers the lists of meditations with <span class="brnum">. These are obviously inline so don’t get swept up in the HTML parser. It would be nice if these were automagically preserved. Probably there will be other inline HTML tags that are like this. Perhaps if these could be listed nicely in the Python code somewhere so I can add any new examples I come across? Or else we just check the Pali text in advance to ensure we don’t miss anything? It should be possible, should it not, to automatically include any inline styles, where the tags open and shut at the beginning and end of the segment.

Despite the consistency of the Mahasangiti text, there are a few places where the punctuation is irregular, and hence the segmenting. This is sometimes just a mistake, as parallel passages are different, but sometimes it is an editorial choice, but might mean the sentence break is not ideal in English. I could fix the PO file by hand, but then the script won’t produce the correct result in future. Or I could just edit the Pali text and run the script again. Ideas?
I can’t find a find and replace in Pootle! We should be able to handle this on a global level, i.e. change everything in the project. Some online advice says to do this outside Pootle in your text editor. That’s okay for me, if not ideal, but if we want to make a truly simple universal online service it will need proper find and replace.
Do you have any thoughts re punctuation? It is a little simpler and quicker for me to enter "'''"than “‘’”, but I am happy to do the right ones. However if we are to create our online service it would be best to not expect users to do this. This would mean incorporating punctuation transformation in the scpo2html script, I guess.
There are some things that I want to leave untranslated, such as uddanas and the like. However there doesn’t seem to be any way to enter an empty translation, so it still registers as having untranslated sections.

Regarding point 3 above, i was chatting about this with Ven Kassapa, and he made an interesting point. The problem is that if we change the segmenting we mess up the numbering in the PO file; we either have unnumbered segments, or add numbers in between or whatever. However, there is no need for the PO msgctxt numbers to be the same as the final outputted numbers. So when we run scpo2html we can simply create another set of numbers to mark up the HTML file. In fact, it may be a good idea to rewrite the PO numbers at the same time, so we end up with consistency.

To illustrate: let us assume we have a PO file with msgctxt 1, 2, 3, 4, 5. While translating I discover it would be better to split the third string in two. So I do that, let’s say manually in the PO file, and assign an arbitrary number to the new segment: 1, 2, 3a, 3b, 4, 5. When we later run scpo2html this does two things: it rewrites the msgctxt numbers in the PO file to 1, 2, 3, 4, 5, 6; and it also assigns <span id="1">xyz</span> or similar to the segments in the HTML file, with matching ids of 1 thru 6. Does this make sense?

I’ve used scpo2html and generally it works great, but:

There’s some problems with spacing. If no space is left at the end of a segment, the generated HTML has no spacing also, so no space after fullstops and the like. Rather than expecting users to get this right—it’s not obvious at all in the editor—it would be better to simply ensure that every relevant segment finished with a single space. Note that dashes do not have trailing spaces.

Conversely, an extra space is sometimes (not sure not always) inserted after the id tags, so that a paragraph begins with a space.
A later detail, but we should have a metadata entry section in Pootle, which would automatically add the relevant metaarea section… The translator should also be added to the head. Perhaps other metadata from the PO file could be included also.
The generated HTML mistakenly has the language set as “pi” in the metadata.

Okay, that’s all I can think of for now.

Just for the fun of it, I have added my new MN10 to production: it’s live! Which raises a more systematic problem for later: how are we to handle multiple translations?

O, and one more thing: don’t forget, we still have not implemented the Pali text with section numbers for DN and MN (https://github.com/suttacentral/suttacentral/issues/113). It would surely be good to do this before segmenting the Pali text, to ensure the numbering is also included in the translations. The text was sent some time ago by Khinabija. As far as I know, it should be trivial to update it, it should merely have the new numbers included. But since you were the one who mainly worked with the Pali, I have left it for you. Let me know if you’d like me to look at this.

mikenz66 · July 4, 2015, 5:13am

Interesting translation. I guess it is a bug that on this page: https://suttacentral.net/mn10 the author is still listed as Bhikkhu Bodhi?

sujato · July 4, 2015, 8:41am

Thanks for pointing this out. our system periodically creates something called a “Textual Infomation Model” (TIM), which is a digital representation of all the texts we have. I’m guessing that this hasn’t updated from the old TIM.

… And just checking again, it is updated now.