Major changes to paragraphs in 4 nikayas

sujato · May 22, 2019, 6:39am

Summary

I have just completed a major set of corrections and adjustments to the paragraphing of the four main nikayas. Previously our paragraphs simply inherited those of the Mahasangiti edition on which our Pali text is based. These are generally sane, but far from perfect.

Improvements have been made in the following main areas:

Use standard conventions for dialogue, each speaker has a new paragraph.
Ensure proper semantic divisions of doctrinal passages.
Ensure abbreviated passages present properly.

To this end, over 12,000 new paragraph tags have been added, and a few hundred removed.

In addition, I made a number of other adjustments:

Prefer using comma rather than colon to introduce direct speech. Exceptions include such things as quotes of doctrinal statements.
Many minor corrections and adjustments along the way.
Use “gentleman” for kulaputta.

Process

To get it done, I used a quirky feature of our PO files: since they contain HTML tags in the notes, they can be renamed, and with a couple of tweaks, work as HTML files. This allowed me to view the presentation as I worked. In addition, it allowed me to run HTML Tidy over all files, picking up a number of errors.

My aim was to produce something that as would require as little further processing as possible.

Segment conventions

Trailing space is correct as is, in both English and Pali. All segments include trailing space, except where they end with em-dash —. (It seemed to me this is the simplest solution, no processing is required.)

The text follows nilakkhana. There are only two cases (both these in English only). These will have to be transformed on export to HTML:

For list markup: ~ → <li>
- Let the list items self-close. There is no need to add <ol> tags, these are already present.
For emphasis: \*(.*?)\* → $1

In line with the Vinaya, all meta-info for segments, i.e. every line starting with # has an explicit type. Here are examples of each type:

#. HTML: 
# NOTE: See BB.
#. VAR: bhaddasāritīre → kaddamadahatīre (bj, s1-3, km, pts1)
#. REF: pts-vp-pli5.477, sc1

Notes:

There may be zero or one instances of HTML, REF, and NOTE per segment.
There may be multiple VAR.
The sequence of types is not necessarily consistent.
Only NOTE lacks a trailing period after the #.

File conventions

The HTML for each file is complete, and ends with the required close tags. There is no need to add close </blockquote>, </section>, etc.

Versions

The added and deleted paragraphs are not explicitly indicated in the final commit.

However, if at any point someone wishes to reconstruct the added  tags, they are indicated with class="added" in this commit.

To recover the deleted paragraph tags: the original sc numbers were added per paragraph. So for any segment that contains an sc number, but has no  (or other block-level tag), a  can be inferred.

karl_lew · May 22, 2019, 1:50pm

Thank you for the advance notice, Bhante. About when might this updated content be available on staging? We shall need to update Voice in coordination with you.

sujato · May 22, 2019, 8:54pm

Soon. The ball is in Blake’s court now. Assuming I’ve done my job well—which is always possible if not probable—it can be pushed right away to staging and production. It should be days rather than weeks.

There are not many changes in the translation per se, although there are some; and there are an even smaller number of cases where I have changed the segment breaks (less than a dozen cases). But mostly this change is only in the HTML so it won’t affect you too much.

mikenz66 · May 22, 2019, 9:20pm

Bhante, Will this break current sutta links like: SuttaCentral ?

karl_lew · May 22, 2019, 11:28pm

Perfect! Thanks for letting us know. That does help us plan.

sujato · May 23, 2019, 5:59am

No, it doesn’t affect the IDs of the segments, just the structural markup of them.

Having said which, I did adjust a very small number of segment IDs, and any link to those segments would be now out by one. But this affects a fraction of a percent of all segment numbers.

karl_lew · May 23, 2019, 11:49am

Unlike SuttaCentral itself, Voice currently has no link dependencies on segment numbers. The numbers are only used for display. In the future, however, it might prove interesting/valuable (?) to provide urls to a segment spoken in a particular language by a particular speaker for a particular supported translation.

Snowbird · June 5, 2019, 4:00am

Bhante, I’m wondering if there is an update on the status of these changes. And will the changes automatically appear in the epub files? Thanks!!!

sujato · June 5, 2019, 6:55am

I’ll be talking to Blake in a couple of hours, and see if we can get the changes pushed through. He’s been making a bunch of corrections and refinements, we want to make sure this is all 100% before going live. You can see Blake’s work here:

Snowbird · June 14, 2019, 7:03am

I don’t mean to pester, but I just wanted to see if the changes had been pushed out. I’m not able to understand what the github feed means.
Mostly I’m interested in creating Kindle versions of the epubs and I don’t want to do that before the changes go out. I’m really happy about the new paragraphing.
So I’d also like to know if the links on the download page to the e-pubs are generated on the spot from the current public version of the site.

sujato · June 14, 2019, 7:07am

Sorry about the delay, I too am trying to get it pushed out! But Blake is travelling at the moment, hopefully it will be ASAP.

Not on the spot per se, but they are re-generated every time we update the text.

Snowbird · June 14, 2019, 7:57am

Thanks! So in effect, it sounds like the links always give you the latest.

I really appreciate all the work people do behind the scenes to make this site possible.

Snowbird · June 27, 2019, 4:51am

Will you be making an announcement when the changes go public, or is it good for us to ask for periodic updates?