Notes on the segmentation of Pali Vinaya with Brahmali's translation

sujato · September 17, 2018, 1:32pm

Updated!

I am currently working on the PO files for Brahmali’s Vinaya translation. I have made some adjustments to the way the PO is handled in line with:

@Brahmali @blake @sabbamitta @greenTara

Here are the corrected files:

pootle-corrections-sep-2018.zip (2.6 MB)

All Pali and Brahmali translations up to Pc 50.
All Pali only texts for the remainder of the Vinaya.
Detailed reference info up to Pc 50.
Incomplete ref info for some Khandhakas.
A few Sutta texts that also need updating.
A set of corrections to sutta numberings, based on sanity checking of msgctxt numbers. The folder “corrected numbers” contains both the corrected files and the sanity test with notes.

Things I have done

Basis of numbering is the pts-cs numbers of the PTS edition. Segments subdivide these.
However, the notation pts-cs has been removed from the segment numbers. It is assumed.
Generally speaking, the msgctxt numbers in these files do correctly follow the pts-cs numbers. This should be preserved as much as possible. Nonetheless, when the English is added it will also contain the pts-cs numbers. Where there are two sets, they can be checked against each other accuracy. And if a number is missing from one it can be supplied from the other. For this reason, ensure that all numbers are preserved for now. We can reconcile them later.
Unnumbered text is assigned zeroth numbers. This is usually for headings or front matter that is not assigned a number in the pts-cs scheme. In such cases we insert a 0 so we can count the numbers without affecting the pts-cs count. For example, beginning the Aniyata rules we have pli-tv-bu-vb-ay1:0.1. This zeroth level continues up to pli-tv-bu-vb-ay1:0.5.
Segments that are in the translation but not the Pali text are assigned an ID suffix a (eg. pli-tv-bu-vb-ss6:3.1.0a), or in some cases, also b. This is used for extra headings. In such cases the segment is blank in the Pali. In a few cases I have adjusted and simplified these from Brahmali: it is best to have as few extra segments as possible.
Reference numbers included class pts-cs-empty. These have been deleted. They were originally added as a display convenience.
Explicit markers have been added for each kind of segment metadata. The form of the mark follows the current #. VAR: for variant readings. Thus we have:
- #. HTML: = HTML markup
- #. VAR: = variant reading
- #. REF: = reference
- # NOTE: = note by Brahmali
Each of these kinds of metadata normally takes one line only. An exception is in the case of notes and variant readings, which occasionally have more than one per segment. These are genuine cases where they have two or more notes, or two or more variant readings per segment, so should be retained.
Note that # NOTE: lacks a trailing period after the hash. I think this is how to get the notes to work as comments in Pootle.
The REF numbers are stripped of the HTML trappings and made consistent. They are comma separated.
For some reason the word Ānanda appeared multiple times in the reference data. It has been removed.
Sometimes the reference numbers and HTML structural markup was not cleanly separated. This has been done.
The refs had inconsistently wt-pa and ms-pa. I have changed these all to ms. They represent the ms (not msdiv) numbers in the Mahasangiti edition, which is the primary system in that edition.
Use markdown link syntax for cross-references. This has [square brackets for the displayed text](followed by round brackets for the ID reference). Original uses both class=“cr” and <ref>, but both can be treated the same way. Example:
- To be expanded as in [Relinquishment 1, paragraphs 13–17](pli-tv-bu-vb-np1#13), with appropriate substitutions
The original included rule counts at the end of the text. These were of the form <em>42</em>:<strong>91</strong>. The first number is a rule count in that class of rules, the second the total rule count. These had been assigned a separate segment. In fact, there is no need for them at all. I have deleted them and their segments.
Certain other numbers had also been assigned separate segments. But we should only give segments to genuine text, so I have moved the numbers into the previous msgid.
All HTML/XML style markup has been removed from the text and translation. Instead we use the markdown-style conventions as defined in Nilakkhana. Things actually used in the text are:
- *abc* — emphasis, = <em>
- _abc_ — Pali text quoted in another language. = <i lang=“pli”>
- **abc** — strong emphasis = <strong> (I think this is only in the unfinished Khandaka texts and will end up being replaced.)
- # — numbers found in text = .counter.
- «abc» — A note in the text identifying the speaker = .speaker (once only!)
- [link text](link ID) — link for cross reference, etc. = <a class=“cr”> (usually).
The reference numbers have been separated and keyed off the segment numbers. These are in a separate CSV file. I have retained the REF data in the PO files, and made all corrections in both places.
I have used the new and simpler HTML “starter” code, which leaves off everything before <section>.
I have added the missing text from Ss 13. This required re-segmenting it from the beginning.
DN 6 fixes a numbering issue.
AN 4.106 is a ghost text. We supply a file to explain that it is missing.
The AN Elevens got seriously borked on Pootle, I have corrected them.

Things to be done

Blake

Each set of reference numbers must be checked for sanity and completeness. See “Correcting the segment numbers” below.
The msgctxt numbers must be re-incremented, as certain segments have been removed or merged.
Check that NOTE displays as comments in Pootle.
Upload all texts to Pootle and check that they work.
Export texts to the site.

Later

These things can be left until the entire collection is done.

Check and ensure heading levels are correct and consistent.
Semantic labelling of headings and sections needs to be made consistent.

About the reference data

Here are the reference types we have, and what they mean.

pts-cs-seg: The pts-cs numbers, with an added level to subdivide for the segments.
pts-cs: The chapter and section numbers for the PTS Pali (and English) edition. These need to be checked to ensure they match up properly with the pts-cs-seg numbers. Once done, the pts-cs can be deleted, as they are redundant.
sc: The SuttaCentral paragraph numbers. These are created from the ms numbers, adjusted so that they are a simple increment from the start of each sutta (i.e. an HTML file on SC). As with the pts-cs numbers, they are redundant, so we should check that they match correctly with the ms numbers, then delete them.
ms: These are the primary reference system of the Mahasangiti edition. They should be retained. Once we are confident that they are correct, we can use them as the key to import the remaining ms reference data for multiple editions, which is found here.
pts-vp-pli: Volume/page for the PTS Pali edition.
pts-vp-en: Volume/page for the PTS English edition.
msdiv: They are from the Mahasangiti edition, and equal the paragraph numbering in the VRI source text. Only in SS 13.

Correcting the segment numbers

Sometimes the numbering of extra segments is incorrect. The basic principle is that no extra segment should interfere with the numbering of the original text. Now, in our edition we have cases like this:

Added headings using zeroth numbers

pli-tv-bu-vb-ay1:0.6
pli-tv-bu-vb-ay1:1.0a
pli-tv-bu-vb-ay1:1.1

This is fine. The extra heading appears as the numbering restarts, and is accommodated with a zero. Zero level numbers can, of course, be added in the Mahasangiti itself, as the heading levels are not incorporated as numbered sections. So adding the -a suffix makes it explicit that this is in the translation alone.

Added segments in sequence

Sometimes the sequence does not restart yet we have an added segment, again usually a heading.

pli-tv-bu-vb-ay1:1.35
pli-tv-bu-vb-ay1:1.35a
pli-tv-bu-vb-ay1:1.36

Here the extra segment has the same number as the previous, with an added letter suffix. This is also correct.

Added segments that mess the sequence

However, sometimes we have cases where the extra segments mess with the numbers:

pli-tv-bu-vb-np9:1.42
pli-tv-bu-vb-np9:1.43a
pli-tv-bu-vb-np9:1.44

This is incorrect, it should be:

pli-tv-bu-vb-np9:1.42
pli-tv-bu-vb-np9:1.42a
pli-tv-bu-vb-np9:1.43

So we shall have to test for such cases and fix them.

karl_lew · September 17, 2018, 4:22pm

For auto-recitation, I have been stripping html found in text segments ( e.g., <i>...</i>). Will I need to strip full markdown within text segments?

I just need a simple way to access semantics. So far the text segments are awesome since they are normally plain-text. Embedded HTML is rare.

sujato · September 18, 2018, 8:17am

I have actually been working on this right now. The aim is that there will be no HTML in text segments, but there will occasionally be plain-text markup like Markdown. This can either be stripped or converted to HTML etc as needed.

Having said which, surely the HTML tags, or at least some of them, are useful for voice? An <em> tag denotes emphasis; an <i> tag denotes a word in a foreign language, usually defined with a lang attribute.

karl_lew · September 18, 2018, 12:04pm

The only emphasis we can place in voice is pauses. Bold vs. italic is indistinguishable. Also, Markdown has headings, tables, links, images and more. There is no voice equivalent for “#” vs. “##”, however, it will be easy to remove the “#” and simply put in a larger pause of 2 seconds perhaps.

sujato · September 20, 2018, 12:49am

Sure; but <em> is not italic. It’s visually represented by italics in browsers, but it should be represented by verbal stress when read. But I’m guessing that’s not possible? It’s not a big deal, as it is used sparingly in formal contexts, but there are a few places it is warranted.

karl_lew · September 20, 2018, 12:51am

I can put in periods of silence (breaks), but we’re already using some for inserted Pali words as well as paragraphs. I can make sections louder, but I’m uncertain if that’s the effect desired?

sujato · September 20, 2018, 7:04am

Yeah, it doesn’t sound like a great idea. If the voices don’t contain a mechanism for stressing certain words, then just don’t worry about it.

karl_lew · September 26, 2018, 6:43pm

Bhante @sujato, was the omission of the “en” folder in the pli-tv folder intended? Without it we would have to assume that all pli-vi content is English. This was a change from “pi-vi/en”.

sujato · September 27, 2018, 1:50am

Good question! @blake?

These texts are still settling down, give us a few days.

sujato · September 27, 2018, 12:59pm

@Brahmali @greenTara @sabbamitta

We’re back! The Vinaya texts have now been completely updated and are available on Pootle.

So please start work when you’re ready!

sabbamitta · September 27, 2018, 2:05pm

Will come back to you if I get stuck somewhere…

(from here)

The same here too.

sujato · September 28, 2018, 2:14am

Oops, some bad news folks. Brahmali has discovered some file errors, due to a simple coding bug. Blake will fix it when he is back online, but I recommend not doing any work on the Vinaya until then.

sujato · September 30, 2018, 11:50pm

@Brahmali Blake has fixed the problems, I hope, and it is refreshed, can I ask you to check it once more.

greenTara · October 1, 2018, 12:50am

I was just starting to look at

https://pootle.suttacentral.net/en/pli-tv/translate/pli-tv-bu-vb/pli-tv-bu-vb-pc/pli-tv-bu-vb-pc51.po

and I see that this segment is missing

#. </p><p>
msgctxt "pli-tv-bu-vb-pc51:0.3"
msgid "Pācittiyakaṇḍa"
msgstr "The chapter on confession (<i lang=\"pli\">pācittiya</i>)"

sujato · October 1, 2018, 12:54am

? Sorry, the segment shows for me.

greenTara · October 1, 2018, 1:10am

That’s really strange.

So it shows me that there are 76 segments, and the third one has index 0.4.

It also looks like I don’t have editing priviledge.

EDIT: now I see 77 segments, and index 0.3 is showing up. Don’t know what that was about - maybe there was something cached.

sujato · October 1, 2018, 3:38am

Yes, probably a cache error. I’ll look into assigning permissions later today.

sujato · October 2, 2018, 12:58am

I have now created a Pootle project for the Vinaya, and assigned Brahmali, Tara, and Sabbamitta as editors.

This means that everyone should be able to get on with their work when ready.

May I ask Tara and Sabbamitta to let me know when you start work. I’d like to work with you in the beginning and make sure we’re all on the same page.

Note that the permissions enable Sabbamitta and Tara to edit anything in the Vinaya. So please make sure you don’t edit one of Brahmali’s edits by mistake! This shouldn’t usually be a problem, but just beware that it can happen.

greenTara · October 2, 2018, 2:14am

I’ve now started on pc51, and following the example of pc41 as a template, and have completed the front matter and first paragraph (of the English version).

I have ignored any html markup having to do with indexing, such as

<a class="pts-cs" id="Bu-Pc.51.1">Bu-Pc.51.1</a><a class="pts-vp-en" id="BD.2.382">BD.2.382</a><a class="wt-pa">1018</a>

<a class="wt-pa">1019</a>

and

<a class="pts-vp-pi" id="Vin.4.109">Vin.4.109</a>

and then there is a </p? at the end, which doesn’t match the paragraph markup in the Pali.

Please advise if this is how you want it, or should the markup also be inserted in the translation box?

Also, there is a segment missing (1.01a) for the <h2 id="nidana">Origin story</h2> subheading.

sujato · October 2, 2018, 2:29am

I have to go to lunch now, I will be with you soon!

However, please review this first:

https://discourse.suttacentral.net/t/segmentation-of-ajahn-brahmalis-vinaya-texts/10451/110