Pootle for translation

I have been investigating Pootle. It is a tool both for performing translation, and for managing translation projects.

This is the Pootle thread. It will mainly be of interest for @sujato, but will also serve as a record on getting things working.

In the future we will probably set up a pootle server for ease of doing translations and coordinating translation efforts.

Pootle Install Guide for Ubuntu/Debian (Local server)

Because of some difficult dependencies which are ā€˜optionalā€™ because they are difficult to install, but essential for functionality, we are going to use the system python, and install the difficult dependencies using apt-get.

sudo apt-get install python-virtualenv python-lxml python-levenshtein  python-lucene

Next we create a virtual environment using virtualenv, which allows accessing system packages:

mkdir ~/pootle
cd ~/pootle
virtualenv --system-site-packages -p /usr/bin/python2.7 env
source ./env/bin/activate

Finally we can install and configure pootle

pip install Pootle==2.5.1.3
pootle init
pootle setup
pootle createsuperuser

It is best to run the above commands individually as they might ask questions. Enter a username and password which pleases you for the createsuperuser step.

Run the server like this:

cd ~/pootle
source ./env/bin/activate
pootle start

Point your browser to http://localhost:8000, you can expect that the first page load will take a long time (~1 minute) as it builds assets and stuff. But if it dumps loads of error messages into the terminal then something went terribly wrong.

If everything went well you can now log in with your admin account.

Running the server automatically

Run the following code in the terminal to create a ā€˜daemonizeā€™ script which can start the server automatically.

echo '#!/bin/bash

HOME='$HOME'
cd $HOME/pootle
source env/bin/activate
pootle start' > daemonize-pootle.sh
chmod +x daemonize-pootle.sh

This script can be added to startup applications so it starts when you log in.

Overall this installation should be adequate for a single user. None of the performance optimizations are installed but this is highly unlikely to matter when only one user is using it.

Upgrade to MySQL

To improve performance, it is recommend to use MySQL, although in places pootle claims to support postgresql, it does not!

The procedure to use MySQL is quite simple.

Save your data (optional)

Assuming you already have pootle set up the way you like it with SQLite, you can save the data in this way:

cd ~/pootle
source env/bin/activate
pootle dumpdata -n > ./data.json

Weā€™ll make use of data.json after installing and configuring MySQL.

Install MySQL-server

sudo apt-get install mysql-server

I recommend reading and following the instructions here for installing and securing MySQL, especially under the ā€œInitial Setupā€ section - the rest doesnā€™t really matter.

Create account and database

First get a MySQL prompt:
mysql -u root -p
Enter the root password you set earlier when prompted.

Now enter the following commands:

CREATE USER pootle IDENTIFIED BY 'pootle';
CREATE DATABASE pootledb CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
GRANT ALL PRIVILEGES ON pootledb.* TO pootle@localhost IDENTIFIED BY 'pootle';
FLUSH PRIVILEGES;
exit

Configure Pootle

Edit ~/.pootle/pootle.conf (consider making a backup copy first!), the DATABASES section should look something like this:

# Database backend settings
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        # Database name
        'NAME': 'pootledb',
        'USER': 'pootle',
        'PASSWORD': 'pootle',
        'HOST': 'localhost',
        'PORT': '3389',
    }
}

Now that pootle is configured, restart:

cd ~/pootle
source env/bin/activate
pip install python-mysql
killall shart.sh

Initialize and load the data:

pootle syncdb
pootle migrate
pootle loaddata ./data.json

Loading the data might take several minutes.

Once done, run:
./start.sh

Pootle should now be running the same as before, except backed by MySQL instead of SQLite.

If we can get this going that would be terrific. An online interface would solve any problems with installation, and it means that we can use the Pootle ā€œalternative languageā€ suggestions and other things. Iā€™ll get this set up and weā€™ll see how we go.

Okay, so I have the server up and running on my laptop. I didnā€™t have to do the whole ā€œpip regressionā€ thing. Next try to use it, not sure how to add projects. Could you segment the whole Pali text and add it as a project?

Adding texts is very easy. Itā€™s in the admin panel. You first go to languages and add a language, then you go to projects and add a project. Then you go to the project you just created and add a translation language (i.e. english).

At that point you can go to the project (the normal, non-admin overview) and upload a zip file containing po files.

One of the tricky things here is getting the translation memory working as it relies on some external dependencies. In 2.5.x there is truly a proliferation of possible dependencies. This is eased somewhat in 2.7 because in 2.7 it simply uses redis for caching, and elasticsearch for translation memory. 2.7 is still in alpha, and it difficult to set up because there are no release versions of it.
A further challenge is the documentation is haphazard at best, and often the documentation you find online is for the wrong version of pootle, some of the most easily found configuration documentation is for 2.7 even though 2.7 is not released.
So Iā€™m still working through finding the best way to get everything working.

I have updated the installation instructions to use a different method, which should result in a Pootle server which is 100% functional, including terms suggestion and translation memory.

If you followed the previous installion guide using pyenv, please run the following to undo it:

pyenv uninstall pootle
rm -rf ~/pootle

And then follow the new instructions.

Great, I will get to this in the next day or two.

Okay, now Pootle is installed on my desktop, all went smoothly. Iā€™ve uploaded the Thig PO file, and the pali terms, all works, excpet no TM.

Any assistance with the TM? itā€™s useless without it.

I was talking to a dev on the translatehouse IRC channel, unfortunately automatic local TM only works in Pootle 2.7. It is possible to set up local TM in Pootle 2.5 using the TM server amagama - now one thing with local TM is deciding when a translation should be committed to memory. One complaint Iā€™ve heard of Virtaal is that it has a memory like an elephant. A nice thing about amagama is you can easily tell it to forget everything it has ever learned, and then rebuild its memory from ā€œcleanā€ po files. So while installing and running amagama is a minor hassle, it does offer some impressive power.

Installing Amagama

These instructions assume Pootle has been installed in accordance with the previous instructions for installing Pootle.

Install postgresql

sudo apt-get install postgresql

Install amagama (into existing pootle python environment)

cd ~/pootle
source ./env/bin/activate
git clone https://github.com/translate/amagama.git
cd amagama/
pip install -r requirements/recommended.txt
pip install pathlib

Add an amagama database to postgresql

createdb -E UTF-8 amagama

Now edit amagama/settings.py , change the entry DB_USER to your username.

Because we are using pali, we need to make amagama at least minimally pali-aware. There are a list of language codes near the start of amagama/tmdb.py called CODE_CONFIG_MAP, edit this list adding the entry:

'pi': 'simple',

(We use ā€˜simpleā€™ because this tells postgresql how to handle the language)

Things which can be done with amagama-manage

These are optional, Iā€™ve attached some scripts which will do it all for you

Before running amagama we need to export some paths:

cd ~/pootle/amagama
export PATH=$(pwd)/bin:$PATH
export PYTHONPATH=$(pwd):$PYTHONPATH

We can now initialize an amagama pali database:

amagama-manage initdb -s pi

At this point, amagama should be ready to remember translations, a po file (or files) can be added manually like this:

amagama-manage build_tmdb -s pi -t en -i sn56.11.po

If a directory is passed to ā€˜-iā€™ then the directory will be processed recursively and all po files loaded into tm.

If amagama is suffering from lots of bad memories, it is possible to nuke the database and return it to a clean slate:

amagama-manage dumpdb -s pi
amagama-manage initdb -s pi

Getting Pootle to talk to Amagama

edit ~/.pootle/pootle.conf and add the following two lines:

AMAGAMA_URL = 'http://localhost:8888/tmserver/'
AUTOSYNC = True

This tells Pootle where to look for TM suggestions, and also tells it to write .po files to disk immediately upon modification which is shortly going to come in useful.

Getting Amagama to function as Local TM

There is no built in way for Pootle 2.5 to automatically send translations to amagama, amagama is designed to be a vetted TM rather than indiscriminately remembering everything ever submitted.

As such I have written a python script which will scan the Pootle project folders for modified .po files, and load them into amagama.

Find in the attached archive pootle_scripts.tar.gz (1.2 KB) the following files

  • start.sh: Run this to start the amagama server, the pootle server, and remember.py, when closed (i.e. by ctrl-c) it will terminate all the servers it started.
  • reset.sh: Run this to restore the translation memory to a blank slate. It is fine to run this while the pootle/amagama servers are running.
  • remember.py : this is a script which synchronizes the .po files from Pootle with the amagama server. You donā€™t need to run this manually.

In brief, just extract those scripts to the ~/pootle server, first run reset.sh to make sure itā€™s in a good state, and then run start.sh, you can then navigate to http://localhost:8000 and should have a working pootle server, with working translation memory.

start.sh supercedes daemonize-pootle.sh which should be deleted. If pootle is running, kill it by running killall pootle.

To make pootle (and friends) start automatically, simply add start.sh to startup applications.

I find myself rather astonished that this all actually worked. Congratulations, this canā€™t have been easy to figure out. But I have a TM working with Pootle! Iā€™ve installed it on my laptop, and this evening will do the same on the desktop.

The only hitch was with creating the DB: first it said their was no role sujato, then that I didnā€™t have permissions. I stack-exchanged it and tried some things and it works, but I donā€™t really know how.

Iā€™m guessing that the way forward will be for me to use the 2.5 for now. When we deploy our own dedicated SC translator onlineā€”which, correct me if Iā€™m wrong, but it looks like its going to be a thingā€”2.7 will be ready, complete with auto-TM and elasticsearch goodness. It will be seriously awesome to be able to outsource translation for SC simply in the browser: no installation!

As far as timing goes, I have only a few hours left to set up my desktop on this visit. I will have a couple of days in July, and thatā€™s it, then weā€™re packing it up and sending to Qimei. So weā€™ll have to have a stable system ready by then. However I will be going to Europe in Nov/Dec so maybe we can update to 2.7 then.

It wasnā€™t! I had to talk to a dev on the IRC channel, and after he confirmed that there truly was no built in local TM, I knew I just had to bite the bullet and figure out how to integrate pootle and amagama (it actually wasnā€™t that hard once I knew it was the only way). I also spent quite a bit of time trying to get 2.7 to work (again bothering the dev on the IRC channel, who said it should work very well - I could get the tutorial project working fine, but when I tried to add a new project the workage kind of ended, so it definitely deserves to be in alpha still).

I agree with everything here.

I can look into integrating the existing pali dictionary lookup we have, perhaps in a similar way to what we do on SuttaCentral - Iā€™m not sure how much pootle will object to javascript messing with its original translation strings, as it already does some stuff like wrapping urls so you can click them to copy them to the translated string, but I imagine it should tolerate what I would need to do.

My understanding of terminology is it should be set by you, the translator, using the words you want to use consistently in translations, and so in time the terminology also becomes a definite guide to what english word maps to what pali word. So terminology canā€™t (shouldnā€™t) be pre-generated from existing sources. However pootle will be oblivious to the rules of Pali and I might be able to educate it making its terminology function more accurate at identifying stems. But since I havenā€™t looked into how it works or how its coded yet, this is highly uncertain.

Iā€™m on my desktop now, cannot get past the database creation. The DB is there, but it says: FATAL: role "sujato" is not permitted to log in

I canā€™t figure out what I did before that worked, please help!

An additional thought to keep in mind. Iā€™ve arranged with Piya tan to help out with proofing and so on for my translations. If he could do this on online Pootle that would be great. Probably heā€™ll get the first batch of texts when I come to Europe, so early Nov. Anyway, if we could have Pootle deployed online by then, but only for private use, that could be very useful; it has the tools of suggestions and so on built right in.

Unfortunately I donā€™t know, but Iā€™ll take a wild guess that the following might help:

try this:

psql

ALTER ROLE sujato WITH password '12345';
\q

Then set the DB_PASSWORD in amagama/settings.py to ā€˜12345ā€™

(This might help because depending on system settings it might not be happy about a passwordless login)

If this doesnā€™t work run these commands:

psql
\l amagama

And paste the output to give some clue about what might be going on.

Adding a working Pootle server will be no problem whatsoever. It even has all the access control you could possibly desire built right in.

Okay Iā€™m away from Sydney and canā€™t do any work on this for the next couple of weeks. What weā€™ll have to do is arrange the timing of my next visit to Syd, I have only a couple of days and I need to get a working Pootle set up. Meanwhile I will keep testing on my laptop.

Over the past few days I have been translating, when I get a moment, the Satipatthana Sutta (MN10), which on SC is unfortunately identical with DN22.

Iā€™ve finished the translation, and here I will post a few Pootle-related issues that came up. Generally it was a nice experience, the local server works well, and once the TM and terminology are set up they are reliable. Here are some things for thought:

  1. The TM sometimes takes a little while to register things, Iā€™m not sure if this is tweakable. Iā€™m wondering if, on my more powerful desktop, we can set it at more aggressive settings. Itā€™s quite common that you want to reuse nearly identical strings in subsequent segments, but the memory hasnā€™t registered it yet.
  2. MN10 numbers the lists of meditations with <span class="brnum">. These are obviously inline so donā€™t get swept up in the HTML parser. It would be nice if these were automagically preserved. Probably there will be other inline HTML tags that are like this. Perhaps if these could be listed nicely in the Python code somewhere so I can add any new examples I come across? Or else we just check the Pali text in advance to ensure we donā€™t miss anything? It should be possible, should it not, to automatically include any inline styles, where the tags open and shut at the beginning and end of the segment.
  • Despite the consistency of the Mahasangiti text, there are a few places where the punctuation is irregular, and hence the segmenting. This is sometimes just a mistake, as parallel passages are different, but sometimes it is an editorial choice, but might mean the sentence break is not ideal in English. I could fix the PO file by hand, but then the script wonā€™t produce the correct result in future. Or I could just edit the Pali text and run the script again. Ideas?
  • I canā€™t find a find and replace in Pootle! We should be able to handle this on a global level, i.e. change everything in the project. Some online advice says to do this outside Pootle in your text editor. Thatā€™s okay for me, if not ideal, but if we want to make a truly simple universal online service it will need proper find and replace.
  • Do you have any thoughts re punctuation? It is a little simpler and quicker for me to enter "'''"than ā€œā€˜ā€™ā€, but I am happy to do the right ones. However if we are to create our online service it would be best to not expect users to do this. This would mean incorporating punctuation transformation in the scpo2html script, I guess.
  • There are some things that I want to leave untranslated, such as uddanas and the like. However there doesnā€™t seem to be any way to enter an empty translation, so it still registers as having untranslated sections.

Regarding point 3 above, i was chatting about this with Ven Kassapa, and he made an interesting point. The problem is that if we change the segmenting we mess up the numbering in the PO file; we either have unnumbered segments, or add numbers in between or whatever. However, there is no need for the PO msgctxt numbers to be the same as the final outputted numbers. So when we run scpo2html we can simply create another set of numbers to mark up the HTML file. In fact, it may be a good idea to rewrite the PO numbers at the same time, so we end up with consistency.

To illustrate: let us assume we have a PO file with msgctxt 1, 2, 3, 4, 5. While translating I discover it would be better to split the third string in two. So I do that, letā€™s say manually in the PO file, and assign an arbitrary number to the new segment: 1, 2, 3a, 3b, 4, 5. When we later run scpo2html this does two things: it rewrites the msgctxt numbers in the PO file to 1, 2, 3, 4, 5, 6; and it also assigns <span id="1">xyz</span> or similar to the segments in the HTML file, with matching ids of 1 thru 6. Does this make sense?


Iā€™ve used scpo2html and generally it works great, but:

  1. Thereā€™s some problems with spacing. If no space is left at the end of a segment, the generated HTML has no spacing also, so no space after fullstops and the like. Rather than expecting users to get this rightā€”itā€™s not obvious at all in the editorā€”it would be better to simply ensure that every relevant segment finished with a single space. Note that dashes do not have trailing spaces.
  • Conversely, an extra space is sometimes (not sure not always) inserted after the id tags, so that a paragraph begins with a space.
  • A later detail, but we should have a metadata entry section in Pootle, which would automatically add the relevant metaarea sectionā€¦ The translator should also be added to the head. Perhaps other metadata from the PO file could be included also.
  • The generated HTML mistakenly has the language set as ā€œpiā€ in the metadata.

Okay, thatā€™s all I can think of for now.

Just for the fun of it, I have added my new MN10 to production: itā€™s live! Which raises a more systematic problem for later: how are we to handle multiple translations?

O, and one more thing: donā€™t forget, we still have not implemented the Pali text with section numbers for DN and MN (https://github.com/suttacentral/suttacentral/issues/113). It would surely be good to do this before segmenting the Pali text, to ensure the numbering is also included in the translations. The text was sent some time ago by Khinabija. As far as I know, it should be trivial to update it, it should merely have the new numbers included. But since you were the one who mainly worked with the Pali, I have left it for you. Let me know if youā€™d like me to look at this.

Interesting translation. I guess it is a bug that on this page: https://suttacentral.net/mn10 the author is still listed as Bhikkhu Bodhi?

Thanks for pointing this out. our system periodically creates something called a ā€œTextual Infomation Modelā€ (TIM), which is a digital representation of all the texts we have. Iā€™m guessing that this hasnā€™t updated from the old TIM.

ā€¦ And just checking again, it is updated now.