SuttaCentral

Opinions on breaking dialogue paragraphs


#1

I thought I would try to break up paragraphs with multiple speakers into, what I understand to be, more traditional paragraphs. I did this simply by searching for instances of close quotes followed by a space followed by open quotes.

Here are the results. I would love to hear people’s opinions about them.

Kindle: www.dropbox.com/s/ig8wnqwfj2yjpxx/Pgh-Split_%20Majjhimanikaya_%20Midd%20-%20Bhikkhu%20Sujato.azw3?dl=1
EPUB https://www.dropbox.com/s/k3yob375gs5ugos/Pgh-Split_%20Majjhimanikaya_%20Midd%20-%20Bhikkhu%20Sujato.epub?dl=1

Personally I’ve never liked the merged paragraphs but understood the need, especially in print editions. And I know if one is reading with 100% concentration, then it’s not an issue. However once I looked at the final product about, I felt an instant sense of ease at how clear it was to read.

As a related issue, I’ve always wanted to ask if there were plans on including the traditional open quotes on paragraphs that continue with the same speaker from the previous paragraph. I had assumed their lack was due to the segmentation.


#2

I unmerged some paragraphs on SuttaCentral. Here is the comparison:

Merged (actual):

Unmerged:


#3

Thanks for looking into this, the changes are definitely an improvement.

Improving the paragraphs has always been on my 2-do list, I just haven’t gotten around to it. The current paragraphs are strictly inherited from the Pali Mahasangiti edition, and as you point out, they don’t follow the normal rules of paragraphing in English. It’s not a question of space-saving, it’s just what we have inherited.

I haven’t looked into it in any detail, but I would guess that using simple methods like you do here we could get maybe 80% or 90% of them done, but still, the whole text would still need to be gone over.

I am also not sure what method should be used. I’m sure there is a nice approach that won’t take too long, but I don’t know what that is, yet. I’ll look into it.

As for the open quote marks on paragraphs, we left these out for this very reason. When you change the paragraphs, it all goes awry. The current approach is simpler and more robust, and so far no-one has complained that it makes things less comprehensible. (In fact, I suspect few people even understand the normal convention.)


#4

I have done a little work on this this morning, here is a proposed method.

Proposal for correcting paragraph breaks in the nikayas

Stage 1: easy pickings

We can probably get 80%–90% done with afew simple regexes.

The following will apply to all double quotes. Single quotes are often not candidates for their own paragraph, eg. “It is said that ‘this is so’.” They will have to be handled manually. Also, this method will likely bug out with nested quotes.

  • Use translation, not root, as basis. (target msgstr).
  • Segment ends in :|?|!|.|—|” AND next segment begins with “[A-Z].
  • Does not apply where any structural markup exists (in PO files, such markup begins a line with a close tag, or is a <br>).

To start with, get the PO files consistently on four lines per segment.

  • "\n" -->
  • #(.*)\n --> #$1

Then the following will add paragraphs. Mark them “added”.

(:|;|\?|!|\.|”)"\n\n^(?!#\. (</|<b))(.*)\n(.*)\nmsgstr "“([A-Z])

$1"\n\n#. </p><p class="added">$3\n$4\nmsgstr "“$5

Finally reverse the line reduction regex:

  • >(#|msgctxt) --> >\n$1

Upload the updated files to Pootle. Make them visible on staging.

Stage 2: the hard stuff

There are several cases that this won’t fix.

  • Single quotes
  • Nested quotes
  • Unwanted paragraph breaks in original
  • Additional paragraphs that are not quotes.

So far as I can see, all these cases must be dealt with by hand.

  • View cooked text on staging or local environment side by side with Pootle.
  • Use the comment field: p+means “add paragraph” p- means “remove paragraph”.
  • This will be a bit fiddly if there is already a comment!
  • Be sure to add comments to the segment before which a change is needed.
  • Go through each text and add marks.
  • When ready, download text and use regex to add or remove markup as appropriate.
  • Reupload text to pootle.
  • Check on site.
  • When happy, push to production.

#5

:pray: :pray: :pray:

Achieving English paragraph structure would make the texts much easier to follow for readers of English. The process as described is too complex for me to manage, but I ‘m happy to volunteer to help with the final checking.


#6

Thanks! For now, I’m just doodling, trying to figure out a way to do it reasonably efficiently. I’ll chat with Blake and see if we come up with anything.


#7

This is surely easier to read! To additionally preserve the original paragraphing in one-or-another-way one might give the extracted partial paragraphs an additional indentation.

I’ve once created a “nano-index” of my most favorite sutta on my homepage, and have done manually formatting for better readability (see the subpages there http://go.helms-net.de/txt/palikanon/index.htm ) but this manual reformatting needs really much time [1]. And also some sensitivity, because the paragraphing indicate also a “weighting” /accentuing/pausing between parts of the talks and it might not be the best idea to just wipe-this-out compeletely/unrecoverably … (I even did bold-face and underline for my idea of helping the read of the reader, but I won’t propose such subjective formatting for any larger corpus)


[1] Which I do not critizise here, btw. In the contrary, I liked that working with that texts so intensely and mindful. And while I started that thing in 2008 (and left it so far) I’m used to download important texts and reformat for my own needs. It’s really a meditation and befriending with the text!


#8

So just to let y’all know, I have added extra paragraphs using the method outlined above (with a few tweaks). This means that the regular paragraphs that are defined with quotes should become visible when the site is updated. This won’t be complete or perfect, but it is a step in the right direction.

From here, I think we’ll need to make further updates by hand. We will figure out a way to do this. But currently our Pootle translation server is badly misbehaving, it keeps falling over for reasons unknown. So we’ll have to table further improvements until further notice, I’m afraid.


#9

One consequence of the new paragraphs is that the paragraph numbers don’t really indicate paragraphs anymore.

Not sure if this is a problem, but pointing it out just in case.


#10

Just my $.02, but I don’t think it matters. If I was using the pgh number to link to a paragraph, I would probably want to link to the beginning of the dialogue anyway.


#11

Yes, that’s correct, the “numbers formally known as paragraph numbers” in fact indicate the paragraph breaks of the underlying Mahasangiti edition (also the VRI Pali edition.) So we keep them the same for compatibility reasons. More granular references are provided to the segment within the paragraph.