On segmenting SA

For @llt and @blake

We’ve mentioned the problems with segmenting the Chinese texts based on punctuation, and that there is a version with superior punctuation available on http://www.mahabodhi.org/

I’ve needed to check a text for my translations, and it is clear that the revised version does indeed have better punctuation. Not only does this make the ideas clearer, it obviously improves the segmentation for us.

A brief perusal of the site shows that the texts are numbered and have a class=“sutra”, so it shouldn’t be difficult to extract them. If it’s possible to meld the two versions that would be useful.

A separate issue is the sequence, but we can leave that for another day.

Where is the superior version on that site?

1 Like

@llt will be more helpful, but you can see an example here


The actual sutras are the numbered portions, appearing blue, and marked class=“sutra”. the rest is commentary, etc. I’m not sure how to find an index of all sutras. Perhaps we can just suck out everything marked “sutra” and save them in individual files named after the relevant number …

I think those pages are for some classes being taught on the āgamas, and just have a selection of texts. The full collection is here in three volumes:

However, I think I found a somewhat better source than the other site:

The numbering is vastly different than what is used in the Taishō edition, in part because whenever there are some notes about how the sūtra is repeated for this list or that list, each of those is counted as a sūtra. As a result, hundreds of new sūtras are counted that have basically no text associated with them. For example, the first real page of the first volume starts here:


一(5); 一(6)( 一) (7)如是我聞:一時,佛住舍衛國祇樹給孤獨園。爾時、世尊告諸比丘:「當觀色無常,如是 觀者,則為正見(8);正見者則生厭離,厭離者喜貪盡,喜貪盡者說心解脫。如是觀受、想、行、 識無常,如是觀者,則為正見;正見者則生厭離,厭離者喜貪盡,喜貪盡者說心解脫。如是比丘 [P3] !心解脫者,若欲自證,則能自證:我生已盡,梵行已立,所作已作,自知不受後有」。

二──四(9); 二──四( )

五(10); 五( 二)
如是我聞:一時,佛住舍衛國祇樹給孤獨園。爾時、世尊告諸比丘:「於色當正思惟,觀色 無常如實知。所以者何?比丘於色正思惟,觀色無常如實知者,於色欲貪斷,欲貪斷者說心解脫 。如是受……。想……。行……。識,當正思惟,觀識無常如實知。所以者何?於識正思惟,觀 識無常者,則於識欲貪斷,欲貪斷者說心解脫。如是心解脫者,若欲自證,則能自證:我生已盡 ,梵行已立,所作已作,自知不受後有」。(11)時諸比丘聞佛所說,歡喜奉行。

In general, sūtras that are not just expansions of previous texts, will start with “如是我聞:” (Thus have I heard), and they usually also end with “歡喜奉行。” (and they joyfully practiced in accordance). Beyond that, the numbering is what tells us exactly which sūtra in the collection it is.

In the above example, the first block of text is sūtra #1. Then the next bit that tells how it is expanded for other items, is counted as #2-4. Then the third block of text is #5. In parentheses, we see that sūtra #1 was also #1 in the Taishō. But for #2-4, there is nothing in parens (not in Taishō). For sūtra #5, we see in parens that it is #2 in the Taishō.

As for how to scrape out the data, notice that in this second source, the sūtra headers are marked (經):

二──四(9); 二──四()

Then before each important block of text, there seems to be a span element marking it as a section (段):

Also, I want to mention that there are three sets of numbers. Some of these go very high, though:



Another example:

一〇六九七──一〇八六四; 一二五七四──一二七四一()

10697~10864; 12574~12741 ( )


“As with the twenty-four sūtras on severence…” and then they are said for all these other items too, effectively multiplying them. There are over 10,000 sūtras just in that one varga, for example, while only 2000 or so preceded them.

There are three sets of numbers in these headers. If I understand them correctly, they are:

  1. The sūtra number within the varga
  2. The sūtra number within the entire corrected SA
  3. The sūtra number within the entire Taishō SA (in parens)

Well, this is all getting very complicated, isn’t it. I can see the reason for the changes, but it does make things difficult.

The good part is that the two numbering systems are both present, so they can be extracted as data. Yay!

I suppose it would be possible to parse the texts so that the suttas were broken only on the Taisho numbers, to keep the sutta count the same as we have it currently.

Just to clarify, this means that there is no sutra number for this in Taisho. The text is actually present, it is just a kind of addendum to the previous sutra. It is, however, punctuated quite differently.

Okay, let me make a suggestion for @blake. At the moment, we’re just playing around. Rather than trying to figure out what’s best, why don’t we put up both forms. Pull a few suttas from both sources, segment them and upload them to pootle. Then @llt can play with them for a few days. It shouldn’t take long to figure out is there is any real advantage to using one source or the other. Once we have a verdict on this, we can think about how to do it. Sound reasonable?

Right, but I guess it depends on what you consider to be a text. My impression is that these were just instructions. For example, “repeat these 24 sūtras for these list of 10 items,” could be expanded to 240 texts, but the sentence itself is not 240 texts.

On the Wikipedia page for the SA, they actually have two sequences. One is the Taishō numbering (on the right), still with the text on King Asoka mistaken for three SA sūtras. The numbering goes up to 1362.

The other is an alternate numbering for the text in the proper order, but following the sūtra divisions as before. It simply removes the Asoka text, and marks those fascicles as “King Asoka Biography” (阿育王傳). Since the extraneous text has been removed, the numbering goes up to 1359 instead of 1362.

I’ve added a sa project to pootle.suttacentral.net, I stripped out stuff fairly aggressively - along with the paragraph numbers I stripped out the left side of the mirror-headings, I also stripped out the .suttainfo paragraph, please take a look.

At the moment I’ve only added those from the suttacentral.net, I haven’t looked into the alternative sources.

@llt feel free to make an account on pootle.suttacentral.net, I can then give you the proper privelages

1 Like

Okay, I’ve done a few rough translations of texts on Pootle. It works very well and because there is a lot of repetition in the SA, it is definitely useful. The fuzzy matching is nice, and even when the suggestion is not exactly what I need, the suggestions often remind me of how I translated something similar earlier in the text, or in another text.

The interface is also nice, and because everything works like a queue, the whole process seems pretty straightforward and linear. The only thing I found was that without seeing the final text, it’s hard to get a good handle on how the finished translation reads. I didn’t put much effort into actually formatting and finishing the texts I did, because I assume this is mostly just for trying out.

For that matter, I’m not sure exactly what the best way is to format abbreviations and lists of items in these texts. Also, I’ve been numbering some important lists of items like [1] form, [2] sensation, etc., and I find this can be useful for reading sometimes (as long as its not overdone). I don’t know how that would relate to doing translation with something like Pootle, though. Maybe that would make matching a bit more difficult.

In any case, segmenting the SA and using Pootle this way definitely works. I don’t think there are any major problems with the approach at all.

Well, that is excellent. With better segmentation and some extras like the Chinese lookup, it should be even better.

This is true, it’s important to keep the flow in mind. At the end of the day, the important thing is that the translation reads well. Frequently you have to bend the sequence and so on. One trick I use is to export to plain text, then have my computer read it to me!

Can you give me some examples?

We do have a means of exporting as a list. You can see details for this, and lots of other stuff, here: http://pootle.suttacentral.net/pages/guidelines/ This is a page I’m developing as I go on using pootle. Basically if you want to make a list, you can simply insert a tilde before ~each and ~every ~single item you want listified. This will export as a HTML <ol>.

I use this in very limited cases, essentially for cakkas of cases such as “A not B, B not A, A and B, neither A nor B”.

Personally I prefer to leave such things out, but if you want to include, no problems, what you’ve done should be fine. There’s no special markup for such inline lists, and there probably shouldn’t be, otherwise you end up creating block-level elements in HTML and having to fix it in CSS and it’s all too much. As long as the numbering is consistent we can do what we want later on.

Oh, I mean the way the Saṃyukta sometimes just lists items rather than fully expanding them, like it will say something for form, and then just list all the other skandhas that it also applies to. But sometimes the last one is expanded as well, which is not common in Mahāyāna texts, at least. Sometimes it seems like just commas are necessary, sometimes semicolons, sometimes ellipses… So I have to come up with a consistent style.

An example of abbreviating the items in the middle of the list:

That is to say, [1] dwelling in mindfulness of the body, observing the inner body: ardent, aware and mindful, setting aside worldly sorrow and distress; and of the outer body; and of dwelling observing both inner and outer body: ardent, mindful and aware, setting aside worldly sorrow and distress. It is also such as this [2] for sensations; [3] for the mind; [4] and for dharmas, dwelling in the mindfulness of observing inner dharmas, outer dharmas, and both inner and outer dharmas: ardent, mindful and aware, setting aside worldly sorrow and distress.

Normally modern English writing doesn’t do things like this, so it seems kind of awkward.

Okay, I always left that sort of thing out of my translations before, but after I translated SA 379, some people complained that it was just “saying the same thing over and over.” With numbering, though, it’s clear that there are three turnings and twelve motions. I use it a bit like that if I think it may clarify the text for readers.

Otherwise I try to keep the text as “clean” as possible.

How does this look for an extracted version of the revised SA?

Okay, yes, well I guess there’s no hard and fast rules, we just do what seems best.

Looks great. So here we have the different versions, all nicely mapped against each other.

First up, you weren’t kidding when you said that the YS text counts everything: 13,402 sutras. :mindblown:

The YS_VARGA_SN number is, what, the sutra number within the varga? If so, don’t we need the varga number as well?

Can you briefly describe the difference in sequence between the three editions? Are the Fo Guang and YS sequences the same?

Yeah, it’s a pretty extreme numbering system. Somewhere in volume 3, it jumps from around 2000 to around 13,000 mostly due to instructions about different ways to recite the sūtras.

Yes, this is the Yin Shun sūtra number within the varga. It is just extracted from the book as published on their website. Varga numbers could be added as well, but I didn’t know if the Yin Shun sūtra number within the varga was actually something that would be used or useful.

Okay, it makes more sense to look back in terms of the Taishō edition. The Taishō version has a few fascicles out of place, and two fascicles missing. The sūtras are numbered 1-1362 based on their position within the received (faulty) text. The Asoka text is mistakenly counted as 3 sūtras.

In 1983, after a lot of this research was done to rearrange the text back to its proper order, a new edition of the Chinese Buddhist Canon was printed called the Foguang Tripiṭaka. This text has the SA in its correct order, and the numbering of the texts is similar to that of the Taishō, but it numbers them according to their position in what would have been the original text. The numbering goes 1-1359. The original text, however, had the three aṅgas somewhat interspersed with each other, and some saṃyuktas were even split up as a result. You can see the layout on the Wikipedia page.

Also in 1983, Yin Shun published his set of three volumes of the corrected SA, which is what we have extracted the sūtras from. His own ordering is different and “cleaner.” He clearly divides the text into its three aṅgas, and then puts the vargas, saṃyuktas, and sūtras in order within them. This means the aṅgas, vargas, and saṃyuktas are not interspersed at all. His numbering system follows this “cleaned up” ordering of the text, and goes from 1-13,412.

I extracted the Yin Shun numbering and Taishō sūtra numbers from the book, and then used a lookup table to calculate the Foguang sūtra numbers. These are the three numbering schemes that seem really useful and relevant.

As for which to actually use, in my opinion, the Taishō numbering is still the standard. After 33 years, the other numbering schemes have never caught on. Basically every reference to SA sūtras published by scholars uses the Taishō numbers. However, the Yin Shun ordering of sūtras logically by aṅga, varga, and saṃyukta, seems the most sensible for a modern canon (especially in light of the importance of the three aṅgas in the early history of Buddhism).

In brief:

  1. Taishō: 1-1362, based on the mistaken order after the filing error.
  2. Foguang: 1-1359, after correcting the order and removing Asoka.
  3. Yin Shun: 1-13,412, likely the most original, based on the three aṅgas.
1 Like

It would be useful, I think. This could be the basis for one display of the text. We display SN per samyutta, as there are too many texts to put on one page (especially for mobiles, etc.) But for SA due to the lack of clear structure in the Taisho we just arbitrarily divided it into groups of 100 sutras, which is far from ideal. A semantic structure based on samyuktas would be ideal.

Thanks, this is very useful. I’m still not sure what exactly the basis for Yin Shun’s additional suggestions is. Did he do this on purely logical grounds, or is there evidence in the text? And, leaving aside the purely incidental details of the numbering, how different is his structure to the FGS edition?

Given the significance of this for early Buddhism, I could see an argument for including all versions, or maybe just the Taisho and the YS version. Obviously we will always include the Taisho version and numbers, but it may be possible to develop different views so a reader could examine the text in the “corrected” sequence. Again, one of the advantages of the segmented approach is that we could automatically apply these different views to translations in any language.

Perhaps we could deploy the corrected semantic structured of SA as the default version, and give the user the option of selecting the current Taisho view for the purposes of reference.

I think dividing the text into three angas was based on commentary about the structure in the Yogacarabhumi. Actually, though, I am now finding that Yin Shun believed that the vyakarana materials were subordinate to big sections of the sutra anga. Looking at the corrected Gunabhadra text, as it appears in the Foguang, the angas appear like:

  1. sutra (五陰誦)
  2. vyakarana (佛所說誦)
  3. sutra (六入處誦 and 雜因誦)
  4. vyakarana (弟子所說誦 and 佛所說誦)
  5. sutra (道品誦)
  6. vyakarana (佛所說誦)
  7. geya (八眾誦)

This would have been the structure of Gunabhadra’s own text. It gets a bit messy, though, and it looks in some ways like there was another format that was also prevalent. When Yin Shun reorganized the SA and put it alongside commentary, he organized everything according to anga, varga, and samyukta, as:

  1. sutra
  2. geya
  3. vyakarana

I’m not sure what specifically caused him to do so, except that it may have been influenced by the Yogacarabhumi, and the idea that the sutra anga was the first to appear historically.

Providing different “views” of the text should be pretty straightforward, because a table of contents is basically just a bunch of hyperlinks, and they can be presented in any order.

Also, I found there was a mistake in my file before. There is actually no number within the varga. It is instead the number within the samyukta. I’ve uploaded a new file and also numbered the samyuktas themselves. Having the samyukta information should be a lot more useful.

Okay, sounds good, thanks.

The tricky part, I imagine, will be because our whole system is built on the idea of unique IDs, which are found in the file names for the sutras. But anyway, it should be doable.

There are so many possible directions to go in with this collection.

The problem is maybe just that the current organization presumes a single “correct” version of each collection. In that case, one route might be to simply replace the Taisho text with the revised and edited version, but still maintain the Taisho numbering and sutra boundaries. That’s just one possibility, though.

I was looking for the version of the SA as it is in the Foguang Tripitaka 佛光大藏經, to see what was in it. The canon appears to have been scanned and released as the 佛光電子大藏經, but basically behind glass (Flash), like the Tripitaka Koreana. I don’t see any digitized text like HTML.


The organization is of course similar to the received Taisho version, but the punctuation has all been revised.

1 Like

Would this have any advantages over the YS edition? the sense I’m getting is that that’s the way to go for a updated digital text.

In terms of organization, the 1359 sūtras is obviously a more manageable organizational scheme than the 13,414 of YS.

In terms of content, they both add and “fix” parts of the sutra text. The difference is that in the YS edition, he clearly marks all such additions with parentheses. In the Foguang versions, they just made the changes.

In YS:


He added “desire” to mean “separation from desire.”


Here “if there is severance” was missing, and he added it in.

The punctuation in the Foguang version seems a bit nicer than in the YS edition, though. Notice in the above example, YS has 生老病死 (birth, old age, sickness, death) without any commas between those items. It would be nicer as 生、老、病、死, which is what we see in the Foguang edition. It’s a minor complaint, but just something I noticed.