Making sense of the segment numbering system

chaz · January 15, 2022, 5:05am

I’m looking at this page on segment numbering to understand what each value in a segment ID means.

It takes dn1:10.1 as an example. dn1 is the sutta, 10 is the section number and 1 is the segment number. The actual segment implemented however is dn1:1.10.3. What is the extra 1 in front of the 10? Or is it that the section number is 1.10?

My initial understanding was that all segment IDs will have the form [sutta]:[section_number].[segment_number]. How should I make sense of segments with more than 1 dot? It seems that the DN has up to 3 dots (for example, dn14:1.33.8.0), the MN has segments with 2 dots on 4 occasions, and the rest of the sutta pitaka has 2 dots on another 2 occasions (thag1.1:1.0.1 and thag1.1:1.0.2).

And I guess that whatever explanation applies here will also be applicable to the Vinaya translations?

And finally a side question - dn1 for example can be referred to as “discourse” or “sutta”. Whats an equivalent unit called in the Vinaya?

Thank you!

sujato · January 15, 2022, 5:44am

That page is actually a little out of date, sorry about that.

The basic principles still apply, but there are a few nuances.

Essentially, the system is:

sutta_number:section_number.segment_number

And most texts are like this.

However, in certain cases, notably dn and the Vinaya, we endeavored to keep our section numbers in line with the previously-published section numbers of the PTS editions. So what happened was that the PTS published certain texts with section numbers, but they didn’t do it for all texts, and people ended up not really using them and just using volume page instead. I think this is a big missed opportunity!

I didn’t want to unnecessarily impose a new section system where one already existed. So we made the decision to respect the pre-existing section numbers for dn and vinaya; and also Nyanamoli’s numbers for mn, but these don’t get into the extra level.

The cases where there are more deeply nested numbers are usually those places where the PTS edition used such systems. Maybe this was a mistake, maybe we should have just kept the three-part number throughout, but oh well!

So dn1:1.10.3 is section 1.10 in the PTS edition, segment three.

The situation with thag is different, there the numbers merely distinguish a heading and homage section that precede the first verse. For these, we use zeroth-level numbering, and it’s just an unusual case that required an extra level. Similarly in mn10, it’s caused by the nested heading structure, otherwise they all have the normal form.

We usually just use “sutta” for everything, or maybe “text” if we’re being inclusive. There’s no specific term for “individual Vinaya unit”.

chaz · January 17, 2022, 3:31am

Awesome, thanks for the detailed response.

chaz · January 17, 2022, 6:47am

Would this be accurate?

DN:
  section_num = 0
    segment_num = 1 -> sutta number
    segment_num = 2 -> sutta title
  
  MN:
    section_num = 0
      segment_num = 1 -> sutta number
      segment_num = 2 -> sutta title
  
  SN:
    section_num = 0
      segment_num = 1 -> samyutta number
      segment_num = 2 -> vagga title
      segment_num = 3 -> sutta title
  
  AN:
    section_num = 0
      segment_num = 1 -> book number
      segment_num = 2 -> vagga title
      segment_num = 3 -> sutta title
      segment_num = 4 -> sutta subtitle

I’m not sure if I’ve got the terms right - book/vagga/nipata etc…

sujato · January 18, 2022, 9:02am

Sorry, I don’t really understand what you’re doing here? Maybe illustrate with actual segment numbers?

chaz · January 18, 2022, 11:47pm

Sorry, I was just trying to understand which segments I should be looking at to extract the title of each sutta. So as examples:

For the DN, [sutta]:0.2 would correspond to the title.

For AN, I’m guessing it would be [sutta]:0.3

I think I’ve gotten it right but I just wanted to double check.

I think it’s all good though, thanks.

sujato · January 21, 2022, 6:57am

The most reliable way to extract sutta titles is to correlate with the html class='sutta-title'. But to do that you have to collate the text and HTML. Use bilara i/o for this.

Otherwise, the logic goes something like:

find the last segment number at the start of the file with a zero immediately after the colon.

This will get you the sutta title in most cases.

However, you will run into complications with what we call “range suttas”. These are cases where files consist of a range of suttas. In this case, the method above will get you the title of the “range”. (Usually a chapter or an abbreviated repetition series.)

If this works for you, that’s fine. If not, probably matching with the sutta-title is the way to go.

We’re working with these range suttas currently, here is the full list.

github.com/suttacentral/suttacentral

Serve range suttas per-sutta.

suttacentral:master ← suttacentral:Hongda-issue2296

opened 03:34PM - 08 Dec 21 UTC

ihongda

+183 -8

e.g. http://localhost/an1.71-81/pli/ms?layout=sidebyside&reference=main&notes=s…idenotes&highlight=false&script=latin Current, the above URL can be accessed normally, but when the user uses the URL: http://localhost/an1.72/pli/ms?layout=sidebyside&reference=main&notes=sidenotes&highlight=false&script=latin the relevant scriptures cannot be displayed normally. This PR solves this problem.

chaz · January 23, 2022, 10:54pm

WOW!! Bilara-io is amazing!!! I wish I had known about this 6 months ago! It even gives you the matching PTS refs at the segment level…

sujato · January 25, 2022, 10:15pm

Right! Nice to see someone who gets it!

It is, admittedly, a bit of a burden to set it up, but the results are great.

Hmm, I wonder if we could automate it as a web service? Would there be any demand? I’ll make a thread asking about this.

chaz · January 25, 2022, 10:51pm

I was thinking exactly the same thing. I’ve been toying with the idea of improving my serverless computing skills, and was wondering if there was a point making an API for it.

Also maybe it would be a good idea improving the documentation for bilara I/O? I’m still a bit vague as to its limits. Is it just an extraction tool for Sutta, vinaya and abhidhamma texts? Or is there more data in bilara-data that it can pull?

sujato · January 26, 2022, 10:32pm

Bilara i/o works with anything in its repo, i.e. bilara-data.

Currently this includes the entire Pali canon, and a growing selection of Chinese texts thanks to Charles.

We’re hoping to start adding Sanskrit soon.

Sure, if you like, any suggestions are welcome. (Actually I believe the README is somewhat outdated; Blake made pyexcel an optional dependency, you don’t need it normally, just export tsv.)

Gillian · March 23, 2023, 1:58am

I’ve read this and a coup of associated threads with interest in order to understand the nature of sections and segments, and found two definitions

“A segment is a meaningful chunk of text, such as a sentence or a line of verse.” ref
“we endeavored to keep our section numbers in line with the previously-published section numbers of the PTS editions. So what happened was that the PTS published certain texts with section numbers, but they didn’t do it for all texts,” ref

So, for the purpose of understanding segment boundaries, the sections simply reflect a historical fact, and the decimal representation of the segment boundaries is not significant.

@sujato here are a couple of follow-up questions, asked because I’ve become involved in the project mentioned here:

(Just for interest) Who put in the paragraph breaks: the Mahāsaṅgīti editors, the PTS or SC?
Can we unpack “a meaningful chunk of text” a bit further? A discourse analyst or a psycholinguist would find the term unproblematic, but a grammarian could only work with “such as a sentence”. It’s clear that there’s not a one-to-one correspondence between sentence and segment. Do you recall how you were operating? (Maybe looking for ‘translatable chunks’ or remembering something like ‘chantable chunks’?)

sujato · March 23, 2023, 7:17am

Not significant for compatibility purposes.

The basics were done by VRI, inherited by Mahasangiti (and possibly changed in some cases, I haven’t checked this, but it’s basically the same), and then in some cases adjusted by myself or Brahmali when the text was clearly in error. That doesn’t happen often.

Is exactly it. It can only every be approximate, but that’s the idea. We started by dividing on major punctuation (. : ; — ? ! …) then I corrected some thousands of instances when doing my translations. These days it’s pretty stable, but I still do make a few adjustments to ensure consistency.

github.com/suttacentral/bilara-data

segment fixes

opened 11:54PM - 10 Jan 23 UTC

sujato

dn3:1.3.2": "“yamahaṁ jānāmi taṁ tvaṁ jānāsi; ", "dn3:1.3.3": "yaṁ tvaṁ jānās…i tamahaṁ jānāmī”ti. ", "dn3:1.4.9": "Yadi vā so bhavaṁ gotamo tādiso, yadi vā na tādiso, tathā mayaṁ taṁ bhavantaṁ gotamaṁ vedissāmā”ti. ", etu kho, bhante, bhagavā. fix quotes: “sutaṁ metaṁ, bho gotama, samaṇo gotamo brahmānaṁ sahabyatāya maggaṁ desetī”ti. vuddhiyeva, licchavī, t’s fantastic that the Buddha is comfortable and well check lines around dn16:5.27.10, unclear where verse starts. sumuttā mayaṁ tena mahāsamaṇena dukkhā sāpekkhassa kālaṅkiriyā Paccānusiṭṭhavacanāpi aviññāpitatthā Ejā, bhikkhave, rogo _Yāva supaññattā