Notes on the segmentation of Pali Vinaya with Brahmali's translation


Sādhu! Sādhu!! Sādhu!!!

Thanks so much @sabbamitta for hanging in there until this was completed. You have been a stalwart of this project. You have been incredibly helpful over such a long period of time. Without your support, this would have taken much longer. Not only that, but now that I have reviewed much of your work (and that of your two excellent colleagues), I can say that it is all of a very high quality. It’s rather amazing you have been a able to do this without much knowledge of Pali.

The input project is now almost complete. There is a little bit left for @tracy to do on Kd14, but she too is getting very close. In a few days time we will be ready for the next stage. I am not sure how we proceed from here. @sujato?


Just let me know when it is done, and I’ll do the next step.

What that will involve is essentially this:

  1. Download the PO files from Pootle. From then on, no work, corrections, or anything should be done on these texts on Pootle.
  2. I will go over the files and massage them until they are all in a consistent and clean form:
    • Ensure markup is correct, deduplicate where necessary.
    • Deduplicate references and put them in data form (eg, <a class="sc" id="sc12"></a> will become sc12)
    • Ensure all meta content is on separate and labelled lines in PO files.
    • Ensure each kind of content is on one line per segment.
  3. Adjust segmenting
  4. Adjust paragraphing
  5. Run automated tests to ensure data reliability
  6. Convert to JSON.

To describe this all in more detail, let me give an example from Kd 9:

#. HTML: </p><p>
#. REF: sc2
msgctxt "pli-tv-kd9:1.2.1"
msgid "Atha kho kassapagottassa bhikkhuno etadahosi—"
msgstr ""
"<p><a class=\"pts-cs\" id=\"Kd.9.1.2\" href=\"#Kd.9.1.2\">Kd.9.1.2</a><a "
"class=\"ms-pa\" id=\"MS.3.1775\" href=\"#MS.3.1775\">MS.3.1775</a>Soon "
"afterwards Kassapagotta thought,"

When I have processed the PO file (step 2) this will look like:

#. HTML: </p><p>
#. REF: sc2, kd.9.1.2, ms.3.1775
msgctxt "pli-tv-kd9:1.2.1"
msgid "Atha kho kassapagottassa bhikkhuno etadahosi—"
msgstr "Soon afterwards Kassapagotta thought,"

Which I think you can see is much cleaner and better organized.

Next, I will review the segments to improve consistency and accuracy (step 3). In the PO files, any segments that have been marked as “needing work” will have the tag #fuzzy. So I go through all the fuzzy segments and resolve problems. I’ll also check generally for consistency and coherence in segmenting.

The next stage will be to review the paragraphing, as I have recently done with the nikayas (step 4). To do this, I take advantage of a little quirk in the PO files: since they have HTML tags recorded as comments, with a few tweaks they can be made to render as actual HTML files! Then I can visually review the paragraph breaks by just opening the files in a browser. i will make the paragraphs conform to the normal rules, for example, paragraphs for each speaker in a dialogue passage. Generally speaking, the outcome will be to make more finely articulated and readable text by having shorter paragraphs; occasionally, however, it also means combining existing paragraphs.

When adjusting the segments, the numbering of the segments gets put out of wack. This does not affect the reference numbers, only the msgctxt, which is the universal key for all information associated with that segment. So at the end of the process, Blake will re-generate the msgctxt numbers to ensure that they are all correct, sequential, and unduplicated (step 5). He will also run tests to ensure that the text remains exactly as it was before this process. We will also run tests to ensure all the markup is valid and correct, and all the reference numbers are sane (for example, checking if any page reference numbers are omitted or doubled).

Up to now, we are still working with PO files, and they can, in principle, be re-uploaded to Pootle for further editing and so on. However the aim is to move on to Bilara, so the next step is to adjust the data for Bilara (step 6). If the preparation work has been done well, this will be an automated process, merely duplicating the process that is being done at the moment for the nikayas. This will split the PO data into separate JSON files. Currently, in the PO files, we have in the same file: original text, translation, segment ID, reference numbers, HTML markup, variant readings, and comments, as well as PO-specific file. Keeping all of this straight is the same file is ridoinculous. So the idea is that this is cleanly separated by data type, and may be recombined at will, all coordinated by the universal ID supplied by the segment number (which in PO is called msgctxt).

To see how beautiful these look, check out SN 1.1.

Root text:



In the markup files, we have the HTML skeleton, fleshed out with the ID numbers.

By abstracting and separating concerns like this, we can combine these things across any language. The same set of references will work in Pali, English, Italian, of Thai. The same HTML markup will apply. If we like, we can apply comments across the different languages. None of this has been previously possible, because the relevant data is embedded in a file, and can’t be transferred from one context to another except by hand—which is exactly what you folks have been doing these past months. Now that you’ve done it, no-one else will have to. Yay!

i would estimate roughly a month to get the above process completed.

Fixing HTML on legacy texts
Where can I find the embarrassing questions that were/are asked of prospective bhikkhunis?

Shiny!!! Hooray data organization.

I hope to finish this weekend, definitely will within the next week.


I assume you will be doing the entire Vinaya Piṭaka in one go. If so, this would mean no editing for the duration of one month, right?

The original text that I have entered on Notepad is already formatted in this way. Many (all?) of the paragraph breaks I inserted in the plain text file have been kept in the Pootle version. Most of the time all you need to do is to make use of the html paragraph tags to recreate paragraph breaks at the right place.

So once the month of processing is over, I may continue the editing on Bilara?

I see what you mean. :lying_face:

And because there is nothing more to do, saṃsāra comes to an end. :thinking:


Huhh… wow! I never thought we could end Saṁsāra by means of copy & paste… :joy: :rofl:


That’s correct.

Oh, excellent, well that makes that much easier. in that case, i will just do a brief review of the paragraphing. In any case, so long as it is generally okay, it can always be adjusted later; it is, after all, a matter of presentation rather than content.


In some cases the paragraphs don’t quite match with the segment breaks. They can easily be found searching for “</p><p>” within a segment.


Right, this would be part of item 2, move all html into segments where possible. In cases where html markup cannot be moved to a segment level, for example inline emphasis, we use markdown, as we do here on discourse. (Actually a specific SC version of markdown called nilakkhana.)


kd14 is entered! It was fun to read an accessible translation and dip into the Vinaya and Pali, thank you! Now I gotta find another task…


Yay!!! Well done @tracy! Your support has been very valuable. You have done a tremendous job.

I have now reviewed eight Khandhakas, including at least one from each one of you, and the quality is very high. There are occasional mistakes, of course, a lack of which would only be attributable to super-normal powers! Not that you haven’t got them, it’s just that I am sure you wouldn’t flaunt them here on the forum. :grinning:

I wish to thank all three of you once more for your generous and kind contribution to this project. I am hoping this Vinaya translation will be of use to monastics and others for a long time to come, at least several decades. What a wonderful thing it is to have this available on the web. And that’s thanks to the three of you!

I wish you all a long and joyful association with the Dhamma.

@sabbamitta @greenTara


Nice name! :orange_heart:


And I would like to thank you in return for patiently answering all our questions, silly or otherwise, and for accompanying our work, never short of encouraging words!

For me this has been a great opportunity to learn both about the Vinaya and Pali. Even if a systematic study of Pali is still waiting for me to come, my knowledge and understanding now is so much better than when I started working on this project. Hopefully that will be very useful in other respects, so thank you for the opportunity! :anjal:


Dear friends, may I echo the celebratory words of Ven Brahmali, and add to the list Brahmali himself! It all looks like it’s coming together fantabulously.

If I understand correctly, everything is done now, is that right? If the work from your part is complete, I’ll download it and get to work.


Yes, you know, sort of. I was hoping to review the input before you download it. I’ve done 9 out 22 Khandhakas so far. But perhaps it is not required? Or rather, perhaps I can do this at a later stage?

One of the problems is that segmentation of the Pali is often awkward. This will make the line-by-line display on SuttaCentral seem awkward too. I was hoping to go through all of this and streamline it. I am wondering, however, whether this can be done on Github, once everything has been uploaded there? Or is the Pali segmenting going to be fixed and unchangeable, as it was in Pootle?


Well, it’s up to you. Once I have finished my work, the whole text will be much cleaner and more consistent, which would make it easier for you. So it really just depends on how you want to work. If it’s something that can be readily done on Pootle, then by all means go ahead. Or if you are happy to work offline also, that is fine, but it may be better to wait until I have done my bit first.

The segmenting can be adjusted, it will not be as rigid as it is on Pootle (which is really just a problem with Pootle’s database.) However it is best to get it right first up and keep any later adjustments to a minimum.

I’m wondering whether you want to make similar adjustments to the Vibhangas?


When it comes to Pali segments that need merging, all I can do on Pootle is mark them with “needs work”. It’s not all that satisfactory, I feel.

Not sure. But it will take me a while to go through the entire Vinaya. If you have the time right now, I think it’s probably better for you to just go ahead. So I say, go for it!

Is this because the segment numbering gets out of whack?


And also, I am still making changes, so if you would please let me know the exact time you intend to download it, that would be very helpful.


Okay, keep going for now, I will tell you when I’m ready to start.


A suggestion for the modification of segment breaks:

In passages like Atha kho āyasmā kaccānagotto yena bhagavā tenupasaṅkami; upasaṅkamitvā bhagavantaṃ abhivādetvā ekamantaṃ nisīdi. Ekamantaṃ nisinno kho āyasmā kaccānagotto bhagavantaṃ etadavoca: (here from the Kaccānagottasutta; but in the Vinaya there are plenty of such instances) the segment usually breaks after ekamantaṃ nisīdi, and then it leaves something like “and said” for the next segment. Wouldn’t it make more sense to merge this sort of segments?


I’ll check. But the basic logic of segment breaks is primarily based on the Pali. If a translation is only a word or two, or indeed nothing at all, it doesn’t really matter.