Notes on the segmentation of Pali Vinaya with Brahmali's translation

sujato · June 14, 2019, 12:24am

Just let me know when it is done, and I’ll do the next step.

What that will involve is essentially this:

Download the PO files from Pootle. From then on, no work, corrections, or anything should be done on these texts on Pootle.
I will go over the files and massage them until they are all in a consistent and clean form:
- Ensure markup is correct, deduplicate where necessary.
- Deduplicate references and put them in data form (eg, <a class="sc" id="sc12"></a> will become sc12)
- Ensure all meta content is on separate and labelled lines in PO files.
- Ensure each kind of content is on one line per segment.
Adjust segmenting
Adjust paragraphing
Run automated tests to ensure data reliability
Convert to JSON.

To describe this all in more detail, let me give an example from Kd 9:

#. HTML: </p><p>
#. REF: sc2
msgctxt "pli-tv-kd9:1.2.1"
msgid "Atha kho kassapagottassa bhikkhuno etadahosi—"
msgstr ""
"<p><a class=\"pts-cs\" id=\"Kd.9.1.2\" href=\"#Kd.9.1.2\">Kd.9.1.2</a><a "
"class=\"ms-pa\" id=\"MS.3.1775\" href=\"#MS.3.1775\">MS.3.1775</a>Soon "
"afterwards Kassapagotta thought,"

When I have processed the PO file (step 2) this will look like:

#. HTML: </p><p>
#. REF: sc2, kd.9.1.2, ms.3.1775
msgctxt "pli-tv-kd9:1.2.1"
msgid "Atha kho kassapagottassa bhikkhuno etadahosi—"
msgstr "Soon afterwards Kassapagotta thought,"

Which I think you can see is much cleaner and better organized.

Next, I will review the segments to improve consistency and accuracy (step 3). In the PO files, any segments that have been marked as “needing work” will have the tag #fuzzy. So I go through all the fuzzy segments and resolve problems. I’ll also check generally for consistency and coherence in segmenting.

The next stage will be to review the paragraphing, as I have recently done with the nikayas (step 4). To do this, I take advantage of a little quirk in the PO files: since they have HTML tags recorded as comments, with a few tweaks they can be made to render as actual HTML files! Then I can visually review the paragraph breaks by just opening the files in a browser. i will make the paragraphs conform to the normal rules, for example, paragraphs for each speaker in a dialogue passage. Generally speaking, the outcome will be to make more finely articulated and readable text by having shorter paragraphs; occasionally, however, it also means combining existing paragraphs.

When adjusting the segments, the numbering of the segments gets put out of wack. This does not affect the reference numbers, only the msgctxt, which is the universal key for all information associated with that segment. So at the end of the process, Blake will re-generate the msgctxt numbers to ensure that they are all correct, sequential, and unduplicated (step 5). He will also run tests to ensure that the text remains exactly as it was before this process. We will also run tests to ensure all the markup is valid and correct, and all the reference numbers are sane (for example, checking if any page reference numbers are omitted or doubled).

Up to now, we are still working with PO files, and they can, in principle, be re-uploaded to Pootle for further editing and so on. However the aim is to move on to Bilara, so the next step is to adjust the data for Bilara (step 6). If the preparation work has been done well, this will be an automated process, merely duplicating the process that is being done at the moment for the nikayas. This will split the PO data into separate JSON files. Currently, in the PO files, we have in the same file: original text, translation, segment ID, reference numbers, HTML markup, variant readings, and comments, as well as PO-specific file. Keeping all of this straight is the same file is ridoinculous. So the idea is that this is cleanly separated by data type, and may be recombined at will, all coordinated by the universal ID supplied by the segment number (which in PO is called msgctxt).

To see how beautiful these look, check out SN 1.1.

Root text:

https://github.com/suttacentral/bilara-data/blob/master/source/pli/ms/sn/sn1/sn1.1.json

Translation:

https://github.com/suttacentral/bilara-data/blob/master/translation/en/sujato/sn/sn1/sn1.1.json

References:

https://github.com/suttacentral/bilara-data/blob/master/reference/sn/sn1/sn1.1.json

In the markup files, we have the HTML skeleton, fleshed out with the ID numbers.

https://github.com/suttacentral/bilara-data/blob/master/markup/sn/sn1/sn1.1.html

By abstracting and separating concerns like this, we can combine these things across any language. The same set of references will work in Pali, English, Italian, of Thai. The same HTML markup will apply. If we like, we can apply comments across the different languages. None of this has been previously possible, because the relevant data is embedded in a file, and can’t be transferred from one context to another except by hand—which is exactly what you folks have been doing these past months. Now that you’ve done it, no-one else will have to. Yay!

i would estimate roughly a month to get the above process completed.