List of concatenated texts for debaking

sujato · June 20, 2019, 10:50pm

Concatenated texts are those where more than one “sutta” is contained within a single file. For this list, I only include those texts that may be meaningfully split into the respective suttas.

If we are to adopt the principle of not baking presentation assumptions into text data, then these are candidates for debaking. The segments of each of these need to be renamed from the range ID to the individual sutta ID.

Note that the distinction between debakable and not debakable texts usually follows explicit indications in the texts. Often these suttas are indicated in the MS edition with a bracketed number (1) at the end of each divisible sutta. in other cases, the different suttas are indicated by a repetition of the setting, or a final pali number (pathamam). On the other hand, Peyyala suttas are often indicated with an explicit remark that the text should be expanded. They are clearly intended to be a “range” and thus are not debaked.

Anguttara Ones and twos.
an11.22-29
an3.156-162
an3.163-182
sn12.83-92
sn12.93-213
sn23.23-33
sn23.35-45
sn33.11-15
sn33.16-20
sn33.51-54
sn35.33-42
sn35.43-51
sn43.14-43
sn45.104-108
sn45.110-114
sn45.116-120

Also some segment corrections:

“an11.502-981:5.5”: “.”,

“sn45.50-54:1.4”: “yadidaṃ—”,
“sn45.50-54:1.5”: “chandasampadā … pe …”,

Aminah · June 24, 2019, 2:27pm

And going in the other direction, should sn24.38 to sn24.44 actually be baked?

sujato · July 10, 2019, 3:04am

No, they are unbaked in BB, so we follow him.

Vimala · September 11, 2019, 7:58am

Note that doing this might have an effect on the parallels that refer to certain parts of these suttas. So that is something to keep in mind when you do this: check the parallels afterwards.

sujato · September 11, 2019, 8:42am

Right, yes, we shall have to take care of this.

sujato · September 13, 2019, 5:30am

Note, debaking is done.

Aminah · September 13, 2019, 7:58am

Thanks.
Do you have a handy, easy to read final list?
As it happens just yesterday I was just referring to this list when coding up some texts and thought it curious that sn12.83-92 was on the list but sn12.93-213 wasn’t, but lo, there’re both included in the update.

sujato · September 13, 2019, 8:07am

I honestly have no idea how that happened. I was just using the list above. Must have stumbled on the extra text at some point! I’ll add it above, so both of these comments will look weird!

But apart from that, it is the same as the list. The only variation is that not all of the Ones and Twos needed debaking, a few were genuinely combined texts so were left as-is.

Aminah · September 13, 2019, 10:36am

For the sake of clarity, as I’m trying to follow this for legacy coding, can I confirm that (going by the commit comments on GH) the only ones that remain as “single vaggasutta suttas” are: an1.378-393 and an2.230-279?

sujato · September 14, 2019, 12:43am

Yes, that’s correct.

Vimala · September 14, 2019, 6:09am

Please note that this will also affect the menu structure as well as the parallels.
So what needs to be changed are:

sc-data/an.json at master · suttacentral/sc-data · GitHub
sc-data/sn.json at master · suttacentral/sc-data · GitHub
sc-data/sutta.json at master · suttacentral/sc-data · GitHub
sc-data/parallels.json at master · suttacentral/sc-data · GitHub
All AN and SN files in all languages in the html texts, as Aminah already pointed out

Otherwise the files will simply not show up in the menu and suttaplex cards and parallels will not show.
You might want to make an issue on ZenHub for this.

Next to this, the Buddhanexus data need changing but we were already discussing pulling certain things together because of the large amount of repetitions, for instance in Samyuttas.

With regards to parallels, I wonder how this affects the paragraph numbers. Most of the time, paragraph numbers start at sc1 for each file (that was created automatically) and the parallels reflect this. I guess now all the parallels (95 for AN and 164 for SN) will have to be gone over by hand.

sujato · September 14, 2019, 7:59am

Only the segment IDs are changed, so no, I don’t think any of this will be affected.

Vimala · September 14, 2019, 9:45am

Ta … thanks for checking. My bad … I thought it was also the filenames. Will just check the parallels after it has been implemented just in case.

sujato · September 14, 2019, 8:42pm

The parallels are based on the actual text number (eg. an1.1) rather than the range, which was what the old (unbaked) segment IDs had (eg. an1.1-10:1). So the new system is in fact closer to what the parallels have. The only issue might be, not in the data, but in the processing. So once the new changes are pulled into the system, we can see if they throw any errors.