SuttaCentral text segment naming inconsistencies

…continued from Github 1056

We seem to have several different conventions for segment id, all based on the Mahāsaṅgīti Pali.

dn1:1.1.1 So I have heard

mn1:1.1 Why is that?

mn1:28-49:1 They directly know water

Further, although grouping is implied in these segment ids, it is not always consistent. For example, the following segments actually come from a single paragraph group in the Pali original.

mn1:29-49.23 They directly know extinguishment …
mn1:50.1 But they shouldn’t conceive extinguishment …

The implication of the inconsistencies of current segment ids means they cannot be used to infer semantic grouping. Essentially, the current segment ids can only serve as unique, invariant identifiers.

Why do these inconsistencies matter?
For voice-assistance, it is important to not overwhelm the listener. Sighted people are accustomed to skimming massive amounts of information at a glance. The visually impaired do not have this luxury of skimming and would be overwhelmed and rapidly bogged down in minutiae. Grouping can greatly improve the search experience for the visually impaired by providing progressive disclosure.

For example, a vision-assisted search for “pleasure in extinguishment” would provide a choice of phrases

  1. extinguishment is mine, they take pleasure in extinguishment
  2. pleasure in extinguishment
  3. that extinguishment is mine, they don’t take pleasure in extinguishment
  4. that extinguishment is mine, he doesn’t take pleasure in extinguishment

For a given choice (e.g., #3), progressive disclosure could then offer this paragraph:

They directly know extinguishment as extinguishment. But they shouldn’t conceive extinguishment, they shouldn’t conceive regarding extinguishment, they shouldn’t conceive as extinguishment, they shouldn’t conceive that ‘extinguishment is mine’, they shouldn’t take pleasure in extinguishment. (MN1 Sujato)

However, given the inconsistency of labeling, what they would actually get is this:

But they shouldn’t conceive extinguishment, they shouldn’t conceive regarding extinguishment, they shouldn’t conceive as extinguishment, they shouldn’t conceive that ‘extinguishment is mine’, they shouldn’t take pleasure in extinguishment. (MN1 Sujato)

This example shows how segment naming inconsistency affects the user experience and leads to the omission of potentially useful information.

What should we do about it?
Given that changing segment ids in any way would be laborious, it’s probably not worth the effort. The user experience will be affected, but not horribly so. This post is essentially a disclaimer and answer to future questions.

1 Like

As I alluded to in my previous post, I think the problem here is simply that we are using a different edition, which divides the text slightly differently.

Just so you know, the general principle we use for our segments (and references generally) is:

  1. If available, use an established standard.
  2. If not, use the paragraphs of our edition (for Pali, the Mahasangiti).

In practice, for the Pali texts, this means:

  • For Digha and Vinaya, use section numbers from the PTS Pali editions (pts-cs). This is why some DN texts use extra section levels: it is from the PTS.
  • For Majjhima, use section numbers from the Nyanamoli translation (nya) as adopted by Ven Bodhi.

These have been quite widely adopted in translations and reference material.

  • For everything else, use the paragraphs. The Mahasangiti edition also based numbering on the same paragraphs, so these are compatible. These numbers are rarely used in any other referencing.

Segments add granularity to these sections.

Note that the implementation of this is currently incorrect in some respects on SC, and this is something we will resolve in coming months.

I’m not sure how it is inconsistent: it s used the same way throughout the text so far as I can see.

It’s not perfect, but the section numbers do represent meaningful divisions in the text, as determined by the editors/translators of the editions upon which we rely. BTW, in the history of Pali/Buddhist studies, no-one else has ever done this!

In the example you give, the text as supplied by us is in fact semantic: it is a meaningful bunch of words. Sure, more context is provided by the preceding sentence, but this is just how it is with natural language. You have to draw a line somewhere, and it is necessarily arbitrary.

The segments are based primarily on punctuation (. , : ; ? ! …), reviewed and adjusted by hand. The goal was to create the smallest meaningful units. As a general rule, however, I preferred to keep doctrinal statements in a single segment. So, for example, the jhana formulas are one per jhana, even though they may have more than one sentence. In other cases, especially with repetitions, a segment may be a single word or in translations, nothing at all.

1 Like

Dear Venerable

I was thinking about starting a new topic about similar matter probably, but found this topic and though that maybe I can ask it here.

I have some Chinese friends that they are working on some computer program for pali to other languages word by word translations, and also they are thinking about adding a verses translation option.
They’ve asked me if I know some way of obtaining some nicely organized data of the pali canon, so that they can extract by code the whole thing.
I thought about suttacentral, because obviously, at least in some % we have pali eng pararell suttas, but I am only basic into computers so I even don’t know if it’s possible easily to get such data?

1 Like

I asked Bhante Sujato a similar question recently and he kindly pointed me at GitHub - suttacentral/translation: Storage for Pootle texts, managed by Pootle. This repository has a large subset of all the information on SuttaCentral itself. I am using it myself for generating audio files.

1 Like

Sounds awesome! Can you put us in touch, or let us know what exactly they’re working on?

Sure, they can get it from our Github, mostly at the link supplied by Karl. But if we have a better idea what they want, we can help them get exactly what they need.

1 Like