SuttaCentral

Requirements for representing suttas in JSON


#1

To help software process suttas, it is convenient to have a single internal representation for each sutta. JSON has become a common data format and is distinguished by being THE data exchange format for Javascript, which is understood by all browsers. Notably, a JSON document can be parsed directly as a language primitive JSON.parse(document). To understand how we might represent a sutta in JSON, we should first examine requirements.

Here are some requirements to consider for representing suttas in JSON. Most of the following should be obvious, but there may be some nuances for discussion:

  • unique JSON document for Pali sutta we must have one and only one canonical JSON document for each Pali sutta.
  • unique JSON document for each sutta translation we must have at most one JSON document for each sutta translation. I.e., it’s fine to include translations with the Pali original JSON
  • segment id lookup: we must be able to get from a text segment id to the corresponding text segment with ease. Note: using a JSON object keyed by segment id would do this
  • immutable segment ids: segment ids are used for translation references and must therefore be global and permanent. Note: We should probably just leave the current segment ids as they are, inconsistencies and all
  • segment sequence: must be derivable from the JSON. Note: if we use segment-ids as JSON object keys, the segment-ids must be sortable, since there is no guarantee of key sequence in JSON objects. Given the inconsistencies in segment ids today, I’m not sure if we can guarantee that sorting order matches sutta order
  • diff-able: we must be able to compare different revisions of a sutta or translation of a sutta. Note: Existing JSON diff utilities already do this.
  • presentation support at sutta granularity It must be possible to generate HTML for the entire sutta from the JSON document. I.e., let’s not have a separate document for presentation.
  • presentation support at segment granularity It must be possible to generate HTML for a single text segment from the JSON document Note: this is probably trivially achieved by wrapping the segment in <div class=“myfavoritesegmentcss”>
  • support Pali canon line groups: The Pali canon groups lines together and so should we for Dhamma transmission fidelity
  • presentation support at Pali canon line group granularity It must be possible to generate HTML for a Pali segment group from the JSON document. Note: this is probably trivially achieved by wrapping the segments in <div class=“myfavoritelinegroupcss”>

#2

We can absolutely 100% guarantee this. The ids are in fact programmatically generated. They can’t be sorted using a trivial sort key like ASCII sort, instead they are sorted by generating tuples of integers by splitting on hyphens and periods, to take the example from the github issue:

id            sort key
mn1:27.1      (27, 1)
mn1:28-49.1   (28, 49, 1)
mn1:28-49.22  (28, 49, 22)
mn1:29-49.23  (29, 49, 23)
mn1:50.1      (50, 1)

Those tuples are then sorted using Python’s standard sort algorithm for tuples and this produces the desired order, in python it’s just about a 2-liner. We’ve been using this sort algorithm for quite a while to put files in proper order because the Mahasangiti original came in the form of a lot of HTML files with filenames which use an insane number of periods and ranges such as 2.12.2.4-9 Saṃyojanagocchakādi.html and I’d always sort the files before processing them - it so happens the MS files also contain strictly ascending paragraph ids so it could trivially be verified that sorting by filename yielded the correct order despite all the whacky uses of periods and hyphens.


#3

Markup

The Markup represents a conendrum, in the original github issue, something like this is proprosed:

  "dn1:1.1.1":"<p><span class="evam">{dn1:1.1.1}</span>",
  "dn1:1.1.2":"{dn1:1.1.2}</p>"

There are two problems though. First is that these are not valid HTML snippets, in this case you can only get a block of valid HTML by concatenating the two strings. In normal cases you could only get a block of valid HTML by concatenating ALL the strings. One questions: what is the purpose of breaking it into smaller units if those units are not individually valid?

The second problem is one of ownership. In the first string we define “open paragraph”, in the second we define “close paragraph”, the problem is that the “close paragraph” doesn’t really “belong” to the second string alone, it is closing the paragraph opened before the first string. The paragraph owns the two strings.

At the moment we just use blocks of html, taking a random example and cleaning it up a little:

<section class="sutta" id="sn1.6">
  <article>
    <div class="hgroup">
      <p class="division">{sn1.6:0.1}</p>
      <p>{sn1.6:0.2}</p>
      <h1>{sn1.6:0.3}</h1>
    </div>
    <p>{sn1.6:1.1} {sn1.6:1.2}</p>
    <blockquote class="gatha">
      <p>
        {sn1.6:2.1}<br>
        {sn1.6:2.2}<br>
        {sn1.6:2.3}<br>
        {sn1.6:2.4}
      </p>
      <p>
        {sn1.6:3.1}<br>
        {sn1.6:3.2}<br>
        {sn1.6:3.3}<br>
        {sn1.6:3.4}
      </p>
    </blockquote>
  </article>
</section>

The block is valid HTML. At render time the relevant strings are inserted into the markup.

The first solution to the Markup conundrum is to continue doing exactly this, we just have one big block of markup for each sutta.

The second possibility would be to break it up into smaller chunks, but we would want to guarantee that each chunk is itself legal HTML so it’s actually meaningful to render only part of the markup.

For example we could chop off the section and article: that’s easily templated away. Then we could divide it like this:

[
 `<div class="hgroup">
    <p class="division">{sn1.6:0.1}</p>
    <p>{sn1.6:0.2}</p>
    <h1>{sn1.6:0.3}</h1>
  </div>`,
  
 `<p>{sn1.6:1.1} {sn1.6:1.2}</p>`,

 `<blockquote class="gatha">
    <p>
      {sn1.6:2.1}<br>
      {sn1.6:2.2}<br>
      {sn1.6:2.3}<br>
      {sn1.6:2.4}
    </p>
    <p>
      {sn1.6:3.1}<br>
      {sn1.6:3.2}<br>
      {sn1.6:3.3}<br>
      {sn1.6:3.4}
    </p>
  </blockquote>`
]

The blockquote has to remain as one thing.

Other options would essentially involve “hacks” - actually rather literally, hacking apart things in ways that generate illegal fragments.

I would tend to favor the second solution: breaking the document up into the smallest valid units of HTML and putting them in an array. This would have a measure of value for generating rich previews and such. It does mean the markup would be rather different than the other data, altough the ownership problem kind of guarantees it has to be handled differently.


#4

(That’s an enormous relief. I had this horrific vision of Bhante Sujato laboriously counting and typing!)

Would you help me understand the oddity of MN1 28-49.23 and MN1 50.1, which to my mind looking at the original Pali should actually be in the same line group just above the 26th Taɱ kissa hetu? This break is not in the Pali.

NOTE: With the segment id’s being canonically sortable as tuples, I’m totally happy with the proposed use of segment ids as keys in a JSON object. The values could be the exact text or it could be extended to full JSON objects with multiple properties for text , annotation, etc. Thanks!


#5

Since Javascript is quite adept at templates, I would propose using a template syntax directly evaluable by Javascript, “e.g., ${sn(1.6,2,1)}” The use of dollar sign and curly braces invokes Javascript. In this example, sn is just a function. This would allow the this source to be viewed both as a template with the references or as HTML with the references evaluated:

    var snsutta = {
         "sn1.6:2.1": "How many sleep while others wake?",
    };
    var sn = (s,a,b) => {
        var key = `sn${s}:${a}.${b})`; 
        return snsutta[key] || `(not found:${key})`;
    };
    var html = '<html>${sn(1.6,2,1)}</html>';
    console.log(`template: ${html}`);
    console.log("HTML:", eval("`" + html + "`"));

Executable templates can help a bit with the separator issue you mentioned, since a Javascript blockquote(...args) function can do all the separator finagling:

  <html>${blockquote(sn(1.6,2,1),sn(1.6,2,2))}</html>

#6

Thanks for the useful summary, it seems we are moving in the right direction.

For us, a Sutta is an abstract entity that has multiple concrete representations. These may include such things as:

  • Original text, which currently has only one edition, but which may (probably will) have several editions
  • translations (ditto)
  • markup
  • references
  • audio files
  • parallels
  • and so on.

So the abstract entity is a single thing, but it has multiple representations. For the most part, these are accessible to a user in our “suttaplex” cards.

I’m not exactly sure what you mean here: are you saying keep the HTML in the same JSON file as the text. 'Cos that’s not going to happen. But see Blake’s proposal.

Remember, while HTML is dominant ATM, in principle we could support multiple sets of markup. Practically, that means we will probably do LaTeX.

Have a look at how the SC app works under the hood. It’s based on web components, so this is currently achieved with a custom element:

<sc-seg id="mn1:1.3" class="translated-text" lang="en">There the Buddha addressed the mendicants:</sc-seg>

I am not sure exactly what you mean by these. Are you referring to the lines on Obo’s site like this?

Tatra kho Bhagavā bhikkhū āmantesi:|| ||
“Bhikkhavo” ti.|| ||

If so, then these are merely editorial decision for this edition (Buddha Jayanthi) and are not represented in the SC text. Our granular support is our segments, and we do not propose adding any other granular system.

Well, the idea was that the HTML has no intrinsic relation to the text. It is purely contingent, and is only used to conveniently build a web page, at which time it is in fact valid HTML. But I see your point: it comes down to the basic nature of HTML, which is that it is for “documents”.

I prefer the first example, it seems nice and simple. For the second one, it seems like a little overkill. Surely in the vast majority of cases we simply render a full text, so why should we take on extra complexity to cater for marginal cases? Currently we either render a full text or a single segment, so I am not sure if there is really a role for what is essentially “render a single block level element”.

Even to take your example of a “rich preview”, I doubt if this would be useful. Consider a case of a verse text. Since they are wrapped in <blockquote> we have to show the whole set of verses, which might easily be several tens of verses. Furthermore, how is the HTML going to actually be relevant in the specialized context? The point of having a verse markup is to distinguish it from prose, but what does that mean in another context, or even, say, in audio?

Surely it would be better to simply use segments for this. For a rich preview, show three segments. Done!

Anyway, it’s up to you, but I am not convinced by that example so far. BTW, please feel free to update my Github issue, it should represent a reasonably current proposal.

It would help if you could link to the exact segment in our text, or show the literal text from the site you’re linking. Otherwise I am not really sure what the problem is here; it looks fine to me.

Remember, all the breaks, segments, paragraphs and so on in modern editions are conventions introduced by modern editors, they do not exist in the manuscripts.

Manuscripts do, it is true, sometimes make use of certain dividers, the danda | and double danda ||. But the relation between the manuscript punctuation and that used in modern editions is, so far as I know, completely unknown. It is safest to simply assume that all punctuation is modern. If you want to see how a Pali manuscript actually looks, check out our bendall-cv project:

I know nothing about JS, but just to ensure we’re on the same page, our app will be transitioning over to LitElement, which instantiates HTML <template> tags using ES2015 tagged template literals. So any templating we do should be compatible with this. But I think you’re already using template literals in your example?

https://polymer.github.io/lit-html/


#7

Bhante, from what Blake said, I believe the existence of || || triggers a new section number in the SC auto-numbering software. Here is the Mulapariyaya I am referencing. It is a single section that ends with a double line break || ||:

sabbaɱ -||
nibbānaɱ nibbānato abhijānāti,||
nibbānaɱ nibbānato abhiññāya nibbānaɱ mā maññi,||
nibbānasmiɱ mā maññi,||
nibbānato mā maññi,||
nibbānaɱ-me ti mā maññi,||
nibbānaɱ mā abhinandi.|| ||

In this Pali line group there is no || || after nibbānaɱ nibbānato abhijānāt. It is only a single line break ||. However, the SC MN 1 sutta numbering does have a section break between MN28-49.23 and MN1 50.1:

MN1:28-49.22 all …
MN1:28-49.23They directly know extinguishment as extinguishment
MN1:50.1 But they shouldn’t conceive extinguishment,

Since the automated SC numbering of MN1 doesn’t match the Pali text in the link above, the SC Pali text used for MN1 numbering was probably slightly different than the Obo Pali text linked above. It’s a minor maddening point, but I had assumed the Pali text to be invariant.

SC MN1 is also inconsistent with itself in that we have a later section with a consecutive numbering:

MN1:54-72.22 all …
MN1:54-72.23 They directly know extinguishment as extinguishment
MN1:54-72.24 But they don’t conceive extinguishment,

The primary concern is validity of reference. We all rely heavily on references for quotes and comparison. If segment ids change, our links break. Indeed, I believe that if we applied the SC auto-segmenting algorithm to the Pali text I linked above, we would end up with different numbering. But I don’t think we should renumber anything.

To maintain link integrity, we should probably NOT change any segment id. They are what they are. In particular, I’d recommend that we simply live with the semantic quirks of inconsistent section id breaks. They also are what they are. These are just like two bad bricks in a beautiful wall.


#8

Ah. OK. I see that LitElement does actually adopt the Javascript ${template} syntax. That’s a relief, since it would clear up some of the hiccups mentioned by Blake.
:pray:


#9

I agree. In fact thinking further on it, I think it might just be best to store the HTML in .html files . It’s not exactly pleasant editing HTML when it’s stored in JSON strings because you can’t have literal newlines, this also screws up diffing. Transmitting it as JSON is fine.


#10

Wow! Great to see a pristine canonical Pali source in Github. SInce these will no doubt be immune to Western editorial changes, I hope we can use these directly for segment numbering to avoid any future hiccups. Basing segment numbering on vagaries of Western editions is brittle.


#11

The issue with multiple representation is that it can fracture the value available to a user. For example, suppose the HTML has a wonderful annotation in a .html file that elaborates on a fine semantic point. That annotation would have to be copied and maintained in a Latex version. And the annotation would not be available in audio because the audio doesn’t consult the HTML.

Even something as minor as section headings editorially introduced in the HTML for “Lay person”, “Trainee”, “Arahant” and “Tathagata” would be valuable to audio listeners as navigational aids. I.e., the existence of a heading is semantic, since grouping is in this sense a semantic concern for navigation. I think we all agree that the heading font and color are presentation concerns alone.

For this reason, it is valuable to have each sutta and each translation be represented semantically as single documents. For example, at Cengage our source documents were XML and they were automatically translated into HTML via XSLT stylesheets. We did not have separate HTML files. Indeed we could have generated Latex from the XML files.

For SuttaCentral, JSON is much better than XML. And Javascript is better than XSLT. I would still, however, hesitate to adopt multiple semantic representations independently maintained. The effort to keep them semantically aligned is potentially daunting. I am new to this domain, so this is merely an observation.


#12

I’m glad you made this edit, because otherwise I would have had to write a diatribe about how much better XML is than JSON. Whew! Now I just have professional curiousity - what are the factors that lead you to say that for SC, JSON is much better than XML?


#13

Oh dear. I didn’t mean to cause an uproar. I used XML/XSLT/Java for about 5 years. It was my job as software content architect to recommend, design and even implement XML based solutions. I found everything involving XML to be rather massive and monolithic. XML has the nimbleness of a battleship. XML requires special editors (e.g., Arbortext), it requires really complicated transformation software (XSLT). Etc. In contrast, modern web development with JSON/Javascript is light and fast. JSON.stringify(obj) and you’re done. With Java and XML, I couldn’t begin to write the serialization from memory even though I’ve written the code countless times. And traversing the DOM in Java? Lots of code. Javascript not so much since JSON is the serialization of a Javascript object. Things like this make the XML skillset quite large, making it difficult to hire engineers wanting to work with XML/XSLT. Hiring Javascript/JSON engineers? No problem.

Please say more about your passion for XML. I am quite curious. :heart:


#14

Re the greatness of XML, I was (mostly) joking. I do use XML professionally, in a specialized field (knowledge representation) that puts up with the cumbersome aspects of XML in order to take advantage of XML schemas for standardization and validation purposes. We regularly look longingly at the sleekness of JSON, but the development of JSON schemas is still lagging far behind.


#15

There’s no general “autonumbering software”. Let me give you a little background.

We wanted to translate text, so we looked for a translation software. We settled on Pootle, which uses segmented texts. We discussed how to segment our Pali texts—the Mahasangiti edition—and settled on the system I mentioned above. The reason we thought this would work is because it is by and large consistently edited. The segmenting was then done in Python by Blake, and adjusted by hand by me. The native numbering of the MS system was adopted—essentially, one number per paragraph, with segments as increments of that—with the exceptions listed previously, which were added by hand and later checked by machine (because there’s no automatic way of knowing where the numbers from another edition should go.)

The BJT edition that you link to has nothing to do with this process. If there is any relation between its breaks and ours, it is purely because the breaks do, by and large, occur on genuine semantic breaks in the text itself. Whether either of them has anything to do with an actual manuscript tradition is unknown.

They are not. The text is fairly consistent, but it would be wise to avoid making any assumptions about the punctuation.

You’re really digging into the details here! In this case, compare with Ven Bodhi’s translation, which is the source for these numbers. He gives a much-abbreviated translation of this passage, hence we only have the range (54-72) to work with, and number the segments per range. Another approach would have been to try to isolate each item within the range and number that, but that would be a lot of work, and would likely also result in problematic situations.

Yep, it’s plain ol JS inside.

Sounds reasonable.

I’m afraid that’s not gonna work. I haven’t studied the manuscripts in detail, but from what I understand, modern editions—specifically our edition—is far more consistent in applying punctuation than any manuscript. You can see this for yourself by looking at the sanitized version of bendall-cv:

And searching on the page for āroces. This will find all the instances of a stock phrase in Vinaya texts, where the Buddha makes an announcement. There’s 19 cases. Of these, 13 are followed by ||, 3 by |, and 3 by no punctuation.

Now compare this with the Mahasangiti text of the same passage, which you can find here:

https://raw.githubusercontent.com/sujato/bendall-cv/master/mahasangiti-files/ms-kd14--kd15-segmented.txt

Here there are 25 occurrences of āroces (more than bendall-cv because that text is incomplete). All of these are followed by a full stop, except in cases where there is abbreviation, in which case they have ellipsis.

Thus the modern edition has been corrected and made more consistent as compared to the manuscript. This is only one example, and bendall-cv is a very unusual case, but you get the point.

In the future, we’ll segment the whole of the Pali canon, using the Mahasangiti edition.

That’s the plan.

The whole point of this was to avoid this kind of situation. Notes would be not be kept in HTML, but in a separate JSON file, or many JSON files for many sets of notes. Since every note is identified by the same segment number, it can be applied anywhere, to original texts or translations or whatever. If you want notes in your audio, great, grab them from the JSON.

Again, that is the point. But there is no single text that is “MN 1”. There are multiple Pali editions, each of which has a different version of MN 1 and must be distinguished by language, text number, and edition. I.e.:

  • MN 1 equals:
    • MN 1 Pali Mahasangiti
    • MN 1 Pali Buddha Jayanthi
    • MN 1 Pali PTS
    • MN 1 English Sujato
    • MN 1 English Bodhi
    • references …
    • parallels ….
    • audio files …
    • ye vā pan’aññā (“or whatever else…”)

Now, currently we only have one Pali text, but that will change. The bendall-cv project is the first step towards working out how to integrate multiple Pali editions. And it is based on the fact that, for all their differences, at the semantic segment level different Pali texts can be aligned. So the idea will be to align each Pali text on the segments as already established via the Mahasangiti.

Have you seen the work done on standoff properties? This was by a group of programmers frustrated with the limitations of XML. Not just the contingent limitations of tooling and complexity, but the deeper problem that any markup system must represent a hierarchy, and in ancient editions we are often faced with multiple overlapping and inconsistent hierarchies. So essentially all markup is separated from text and maintained in JSON files. Standoff properties are awesome, and I believe, the future; but they are very much a work in progress.

You can see a brief discussion in our issue here (with the caveat that this issue lags behind the discussion here on Discourse!)


#16

(reads pdf…) :astonished:
Wow. Coming from the pristine world of canonical current documents, I now see the enormous challenge of representing historical documents with their embedded historical edits and variations (!). :scream:

I really like the stable plaintext Unicode manuscript annotated by standoff layer semantic markup. The plaintext itself is stable, and yet each standoff layer can faithfully represent an independent perspective. Hence we have multiple files. It’s all a bit head-exploding. And good.

As a consumer of this information for vision assistance, all I really need is a JSON representation of the “SC accepted” layer as a sequence of text segments. Ideally, a rudimentary paragraph-level grouping of text segments would be welcomed for assisted navigation. Currently, a grouping that exists in the source manuscript can be inferred from the text segment id numbering. However, if a translation standoff layer introduces additional hierarchy useful for navigation, we may wish to represent that additional hierarchy or grouping somehow in the JSON.

I’m having fun puzzling out voice assistance, but if you and Blake need assistance with any of this I will happily re-prioritize. Thank you both for all your patient explanation. I certainly have learned a lot today!

:pray:


#17

Well, the segment IDs essentially give you a section, which by default is a “paragraph”. But since in some cases (i.e. DN and MN) there may be multiple paragraphs per section, adding a layer of “paragraph” information would be possible. In addition, the HTML in the texts does encode various other kinds of information: whether something is a heading, a verse, a statement summarizing a text, and so on. I suspect that most of this would be of little value for a text reader, but the information is there if needed.


#18

@Blake, I’ve been parsing PO files and have found the following JSON sutta structure quite adequate for voice assistance:

var sutta = {
  meta: { ... },
  segments: [{
    scid: 'mn1:1.1',
    pli: '...Pali...',
    en: '...English',
  },{
     ...
  }],
}

Oddly, the direct key lookup via scid we discussed previously was not needed. I am simply searching segments using the following line of Javascript:

 var result = segments.filter(seg => /root of suffering/.test(seg.en));

Here is how I find all the segments of section 2 in mn1:

 var result = segments.filter(seg => /mn1:2.*/.test(seg.scid));

And here is how I search the Pali:

 var result = segments.filter(seg => /viharati/.test(seg.pli));

Although this is an exhaustive O(N) search, it’s actually quite fast and good enough for all my needs. In the spirit of Donald Knuth’s warning about premature optimization, I’ll be shying away from mapping scids to segments unless it proves critically necessary.

Basically, I’d prefer the above JSON object with a simple array of segments as the JSON response for a REST API.


#19

To be clear, PO is deprecated and as soon as possible will be completely absent from our system. The JSON structure we have outlined will be our new data format. So don’t build on PO!


#20

The po-parser is already written and took only a day. Its sole purpose is to generate the JSON I described above so that we can get a prototype in front of people. The rest of the code does require the JSON I outlined above. If SuttaCentral returns some other JSON schema, I will need to spend about a day converting that JSON to this JSON. I’d rather not spend that day, but there may be other consumers with separate requirements.