Requirements for representing suttas in JSON

karl_lew · August 13, 2018, 5:25pm

To help software process suttas, it is convenient to have a single internal representation for each sutta. JSON has become a common data format and is distinguished by being THE data exchange format for Javascript, which is understood by all browsers. Notably, a JSON document can be parsed directly as a language primitive JSON.parse(document). To understand how we might represent a sutta in JSON, we should first examine requirements.

Here are some requirements to consider for representing suttas in JSON. Most of the following should be obvious, but there may be some nuances for discussion:

unique JSON document for Pali sutta we must have one and only one canonical JSON document for each Pali sutta.
unique JSON document for each sutta translation we must have at most one JSON document for each sutta translation. I.e., it’s fine to include translations with the Pali original JSON
segment id lookup: we must be able to get from a text segment id to the corresponding text segment with ease. Note: using a JSON object keyed by segment id would do this
immutable segment ids: segment ids are used for translation references and must therefore be global and permanent. Note: We should probably just leave the current segment ids as they are, inconsistencies and all
segment sequence: must be derivable from the JSON. Note: if we use segment-ids as JSON object keys, the segment-ids must be sortable, since there is no guarantee of key sequence in JSON objects. Given the inconsistencies in segment ids today, I’m not sure if we can guarantee that sorting order matches sutta order
diff-able: we must be able to compare different revisions of a sutta or translation of a sutta. Note: Existing JSON diff utilities already do this.
presentation support at sutta granularity It must be possible to generate HTML for the entire sutta from the JSON document. I.e., let’s not have a separate document for presentation.
presentation support at segment granularity It must be possible to generate HTML for a single text segment from the JSON document Note: this is probably trivially achieved by wrapping the segment in <div class=“myfavoritesegmentcss”>
support Pali canon line groups: The Pali canon groups lines together and so should we for Dhamma transmission fidelity
presentation support at Pali canon line group granularity It must be possible to generate HTML for a Pali segment group from the JSON document. Note: this is probably trivially achieved by wrapping the segments in <div class=“myfavoritelinegroupcss”>

blake · August 14, 2018, 5:25pm

We can absolutely 100% guarantee this. The ids are in fact programmatically generated. They can’t be sorted using a trivial sort key like ASCII sort, instead they are sorted by generating tuples of integers by splitting on hyphens and periods, to take the example from the github issue:

id            sort key
mn1:27.1      (27, 1)
mn1:28-49.1   (28, 49, 1)
mn1:28-49.22  (28, 49, 22)
mn1:29-49.23  (29, 49, 23)
mn1:50.1      (50, 1)

Those tuples are then sorted using Python’s standard sort algorithm for tuples and this produces the desired order, in python it’s just about a 2-liner. We’ve been using this sort algorithm for quite a while to put files in proper order because the Mahasangiti original came in the form of a lot of HTML files with filenames which use an insane number of periods and ranges such as 2.12.2.4-9 Saṃyojanagocchakādi.html and I’d always sort the files before processing them - it so happens the MS files also contain strictly ascending paragraph ids so it could trivially be verified that sorting by filename yielded the correct order despite all the whacky uses of periods and hyphens.

blake · August 14, 2018, 5:46pm

Markup

The Markup represents a conendrum, in the original github issue, something like this is proprosed:

  "dn1:1.1.1":"<p><span class="evam">{dn1:1.1.1}</span>",
  "dn1:1.1.2":"{dn1:1.1.2}</p>"

There are two problems though. First is that these are not valid HTML snippets, in this case you can only get a block of valid HTML by concatenating the two strings. In normal cases you could only get a block of valid HTML by concatenating ALL the strings. One questions: what is the purpose of breaking it into smaller units if those units are not individually valid?

The second problem is one of ownership. In the first string we define “open paragraph”, in the second we define “close paragraph”, the problem is that the “close paragraph” doesn’t really “belong” to the second string alone, it is closing the paragraph opened before the first string. The paragraph owns the two strings.

At the moment we just use blocks of html, taking a random example and cleaning it up a little:

<section class="sutta" id="sn1.6">
  <article>
    <div class="hgroup">
      <p class="division">{sn1.6:0.1}</p>
      <p>{sn1.6:0.2}</p>
      <h1>{sn1.6:0.3}</h1>
    </div>
    <p>{sn1.6:1.1} {sn1.6:1.2}</p>
    <blockquote class="gatha">
      <p>
        {sn1.6:2.1}<br>
        {sn1.6:2.2}<br>
        {sn1.6:2.3}<br>
        {sn1.6:2.4}
      </p>
      <p>
        {sn1.6:3.1}<br>
        {sn1.6:3.2}<br>
        {sn1.6:3.3}<br>
        {sn1.6:3.4}
      </p>
    </blockquote>
  </article>
</section>

The block is valid HTML. At render time the relevant strings are inserted into the markup.

The first solution to the Markup conundrum is to continue doing exactly this, we just have one big block of markup for each sutta.

The second possibility would be to break it up into smaller chunks, but we would want to guarantee that each chunk is itself legal HTML so it’s actually meaningful to render only part of the markup.

For example we could chop off the section and article: that’s easily templated away. Then we could divide it like this:

[
 `<div class="hgroup">
    <p class="division">{sn1.6:0.1}</p>
    <p>{sn1.6:0.2}</p>
    <h1>{sn1.6:0.3}</h1>
  </div>`,
  
 `<p>{sn1.6:1.1} {sn1.6:1.2}</p>`,

 `<blockquote class="gatha">
    <p>
      {sn1.6:2.1}<br>
      {sn1.6:2.2}<br>
      {sn1.6:2.3}<br>
      {sn1.6:2.4}
    </p>
    <p>
      {sn1.6:3.1}<br>
      {sn1.6:3.2}<br>
      {sn1.6:3.3}<br>
      {sn1.6:3.4}
    </p>
  </blockquote>`
]

The blockquote has to remain as one thing.

Other options would essentially involve “hacks” - actually rather literally, hacking apart things in ways that generate illegal fragments.

I would tend to favor the second solution: breaking the document up into the smallest valid units of HTML and putting them in an array. This would have a measure of value for generating rich previews and such. It does mean the markup would be rather different than the other data, altough the ownership problem kind of guarantees it has to be handled differently.

karl_lew · August 14, 2018, 7:19pm

(That’s an enormous relief. I had this horrific vision of Bhante Sujato laboriously counting and typing!)

Would you help me understand the oddity of MN1 28-49.23 and MN1 50.1, which to my mind looking at the original Pali should actually be in the same line group just above the 26th Taɱ kissa hetu? This break is not in the Pali.

NOTE: With the segment id’s being canonically sortable as tuples, I’m totally happy with the proposed use of segment ids as keys in a JSON object. The values could be the exact text or it could be extended to full JSON objects with multiple properties for text , annotation, etc. Thanks!

karl_lew · August 14, 2018, 8:14pm

Since Javascript is quite adept at templates, I would propose using a template syntax directly evaluable by Javascript, “e.g., ${sn(1.6,2,1)}” The use of dollar sign and curly braces invokes Javascript. In this example, sn is just a function. This would allow the this source to be viewed both as a template with the references or as HTML with the references evaluated:

    var snsutta = {
         "sn1.6:2.1": "How many sleep while others wake?",
    };
    var sn = (s,a,b) => {
        var key = `sn${s}:${a}.${b})`; 
        return snsutta[key] || `(not found:${key})`;
    };
    var html = '<html>${sn(1.6,2,1)}</html>';
    console.log(`template: ${html}`);
    console.log("HTML:", eval("`" + html + "`"));

Executable templates can help a bit with the separator issue you mentioned, since a Javascript blockquote(...args) function can do all the separator finagling:

  <html>${blockquote(sn(1.6,2,1),sn(1.6,2,2))}</html>

sujato · August 15, 2018, 9:08am

Thanks for the useful summary, it seems we are moving in the right direction.

For us, a Sutta is an abstract entity that has multiple concrete representations. These may include such things as:

Original text, which currently has only one edition, but which may (probably will) have several editions
translations (ditto)
markup
references
audio files
parallels
and so on.

So the abstract entity is a single thing, but it has multiple representations. For the most part, these are accessible to a user in our “suttaplex” cards.

I’m not exactly sure what you mean here: are you saying keep the HTML in the same JSON file as the text. 'Cos that’s not going to happen. But see Blake’s proposal.

Remember, while HTML is dominant ATM, in principle we could support multiple sets of markup. Practically, that means we will probably do LaTeX.

Have a look at how the SC app works under the hood. It’s based on web components, so this is currently achieved with a custom element:

<sc-seg id="mn1:1.3" class="translated-text" lang="en">There the Buddha addressed the mendicants:</sc-seg>

I am not sure exactly what you mean by these. Are you referring to the lines on Obo’s site like this?

Tatra kho Bhagavā bhikkhū āmantesi:|| ||
“Bhikkhavo” ti.|| ||

If so, then these are merely editorial decision for this edition (Buddha Jayanthi) and are not represented in the SC text. Our granular support is our segments, and we do not propose adding any other granular system.

Well, the idea was that the HTML has no intrinsic relation to the text. It is purely contingent, and is only used to conveniently build a web page, at which time it is in fact valid HTML. But I see your point: it comes down to the basic nature of HTML, which is that it is for “documents”.

I prefer the first example, it seems nice and simple. For the second one, it seems like a little overkill. Surely in the vast majority of cases we simply render a full text, so why should we take on extra complexity to cater for marginal cases? Currently we either render a full text or a single segment, so I am not sure if there is really a role for what is essentially “render a single block level element”.

Even to take your example of a “rich preview”, I doubt if this would be useful. Consider a case of a verse text. Since they are wrapped in <blockquote> we have to show the whole set of verses, which might easily be several tens of verses. Furthermore, how is the HTML going to actually be relevant in the specialized context? The point of having a verse markup is to distinguish it from prose, but what does that mean in another context, or even, say, in audio?

Surely it would be better to simply use segments for this. For a rich preview, show three segments. Done!

Anyway, it’s up to you, but I am not convinced by that example so far. BTW, please feel free to update my Github issue, it should represent a reasonably current proposal.

It would help if you could link to the exact segment in our text, or show the literal text from the site you’re linking. Otherwise I am not really sure what the problem is here; it looks fine to me.

Remember, all the breaks, segments, paragraphs and so on in modern editions are conventions introduced by modern editors, they do not exist in the manuscripts.

Manuscripts do, it is true, sometimes make use of certain dividers, the danda | and double danda ||. But the relation between the manuscript punctuation and that used in modern editions is, so far as I know, completely unknown. It is safest to simply assume that all punctuation is modern. If you want to see how a Pali manuscript actually looks, check out our bendall-cv project:

I know nothing about JS, but just to ensure we’re on the same page, our app will be transitioning over to LitElement, which instantiates HTML <template> tags using ES2015 tagged template literals. So any templating we do should be compatible with this. But I think you’re already using template literals in your example?

https://polymer.github.io/lit-html/

karl_lew · August 15, 2018, 1:42pm

Bhante, from what Blake said, I believe the existence of || || triggers a new section number in the SC auto-numbering software. Here is the Mulapariyaya I am referencing. It is a single section that ends with a double line break || ||:

sabbaɱ -||
nibbānaɱ nibbānato abhijānāti,||
nibbānaɱ nibbānato abhiññāya nibbānaɱ mā maññi,||
nibbānasmiɱ mā maññi,||
nibbānato mā maññi,||
nibbānaɱ-me ti mā maññi,||
nibbānaɱ mā abhinandi.|| ||

In this Pali line group there is no || || after nibbānaɱ nibbānato abhijānāt. It is only a single line break ||. However, the SC MN 1 sutta numbering does have a section break between MN28-49.23 and MN1 50.1:

MN1:28-49.22 all …
MN1:28-49.23They directly know extinguishment as extinguishment
MN1:50.1 But they shouldn’t conceive extinguishment,

Since the automated SC numbering of MN1 doesn’t match the Pali text in the link above, the SC Pali text used for MN1 numbering was probably slightly different than the Obo Pali text linked above. It’s a minor maddening point, but I had assumed the Pali text to be invariant.

SC MN1 is also inconsistent with itself in that we have a later section with a consecutive numbering:

MN1:54-72.22 all …
MN1:54-72.23 They directly know extinguishment as extinguishment
MN1:54-72.24 But they don’t conceive extinguishment,

The primary concern is validity of reference. We all rely heavily on references for quotes and comparison. If segment ids change, our links break. Indeed, I believe that if we applied the SC auto-segmenting algorithm to the Pali text I linked above, we would end up with different numbering. But I don’t think we should renumber anything.

To maintain link integrity, we should probably NOT change any segment id. They are what they are. In particular, I’d recommend that we simply live with the semantic quirks of inconsistent section id breaks. They also are what they are. These are just like two bad bricks in a beautiful wall.

karl_lew · August 15, 2018, 1:45pm

Ah. OK. I see that LitElement does actually adopt the Javascript ${template} syntax. That’s a relief, since it would clear up some of the hiccups mentioned by Blake.

blake · August 15, 2018, 1:49pm

I agree. In fact thinking further on it, I think it might just be best to store the HTML in .html files . It’s not exactly pleasant editing HTML when it’s stored in JSON strings because you can’t have literal newlines, this also screws up diffing. Transmitting it as JSON is fine.

karl_lew · August 15, 2018, 1:50pm

Wow! Great to see a pristine canonical Pali source in Github. SInce these will no doubt be immune to Western editorial changes, I hope we can use these directly for segment numbering to avoid any future hiccups. Basing segment numbering on vagaries of Western editions is brittle.

karl_lew · August 15, 2018, 2:14pm

The issue with multiple representation is that it can fracture the value available to a user. For example, suppose the HTML has a wonderful annotation in a .html file that elaborates on a fine semantic point. That annotation would have to be copied and maintained in a Latex version. And the annotation would not be available in audio because the audio doesn’t consult the HTML.

Even something as minor as section headings editorially introduced in the HTML for “Lay person”, “Trainee”, “Arahant” and “Tathagata” would be valuable to audio listeners as navigational aids. I.e., the existence of a heading is semantic, since grouping is in this sense a semantic concern for navigation. I think we all agree that the heading font and color are presentation concerns alone.

For this reason, it is valuable to have each sutta and each translation be represented semantically as single documents. For example, at Cengage our source documents were XML and they were automatically translated into HTML via XSLT stylesheets. We did not have separate HTML files. Indeed we could have generated Latex from the XML files.

For SuttaCentral, JSON is much better than XML. And Javascript is better than XSLT. I would still, however, hesitate to adopt multiple semantic representations independently maintained. The effort to keep them semantically aligned is potentially daunting. I am new to this domain, so this is merely an observation.

greenTara · August 15, 2018, 8:00pm

I’m glad you made this edit, because otherwise I would have had to write a diatribe about how much better XML is than JSON. Whew! Now I just have professional curiousity - what are the factors that lead you to say that for SC, JSON is much better than XML?

karl_lew · August 15, 2018, 8:33pm

Oh dear. I didn’t mean to cause an uproar. I used XML/XSLT/Java for about 5 years. It was my job as software content architect to recommend, design and even implement XML based solutions. I found everything involving XML to be rather massive and monolithic. XML has the nimbleness of a battleship. XML requires special editors (e.g., Arbortext), it requires really complicated transformation software (XSLT). Etc. In contrast, modern web development with JSON/Javascript is light and fast. JSON.stringify(obj) and you’re done. With Java and XML, I couldn’t begin to write the serialization from memory even though I’ve written the code countless times. And traversing the DOM in Java? Lots of code. Javascript not so much since JSON is the serialization of a Javascript object. Things like this make the XML skillset quite large, making it difficult to hire engineers wanting to work with XML/XSLT. Hiring Javascript/JSON engineers? No problem.

Please say more about your passion for XML. I am quite curious.

greenTara · August 15, 2018, 8:43pm

Re the greatness of XML, I was (mostly) joking. I do use XML professionally, in a specialized field (knowledge representation) that puts up with the cumbersome aspects of XML in order to take advantage of XML schemas for standardization and validation purposes. We regularly look longingly at the sleekness of JSON, but the development of JSON schemas is still lagging far behind.

sujato · August 16, 2018, 1:26am

There’s no general “autonumbering software”. Let me give you a little background.

We wanted to translate text, so we looked for a translation software. We settled on Pootle, which uses segmented texts. We discussed how to segment our Pali texts—the Mahasangiti edition—and settled on the system I mentioned above. The reason we thought this would work is because it is by and large consistently edited. The segmenting was then done in Python by Blake, and adjusted by hand by me. The native numbering of the MS system was adopted—essentially, one number per paragraph, with segments as increments of that—with the exceptions listed previously, which were added by hand and later checked by machine (because there’s no automatic way of knowing where the numbers from another edition should go.)

The BJT edition that you link to has nothing to do with this process. If there is any relation between its breaks and ours, it is purely because the breaks do, by and large, occur on genuine semantic breaks in the text itself. Whether either of them has anything to do with an actual manuscript tradition is unknown.

They are not. The text is fairly consistent, but it would be wise to avoid making any assumptions about the punctuation.

You’re really digging into the details here! In this case, compare with Ven Bodhi’s translation, which is the source for these numbers. He gives a much-abbreviated translation of this passage, hence we only have the range (54-72) to work with, and number the segments per range. Another approach would have been to try to isolate each item within the range and number that, but that would be a lot of work, and would likely also result in problematic situations.

Yep, it’s plain ol JS inside.

Sounds reasonable.

I’m afraid that’s not gonna work. I haven’t studied the manuscripts in detail, but from what I understand, modern editions—specifically our edition—is far more consistent in applying punctuation than any manuscript. You can see this for yourself by looking at the sanitized version of bendall-cv:

github.com

sujato/bendall-cv/blob/master/bendall-cv-sanitized.txt







ca deseti ubho saṁmukhabhūtā honti
ayaṁ tattha puggalasaṁmukhatā
kiṁ ca tattha paṭiññātakaraṇasmiṁ
yā paṭiññātakaraṇassa kammassa kiriyā karaṇaṁ upāgamanaṁ ajjhupāgamanaṁ adhivāsanā apaṭikkosanā
idaṁ tattha paṭiññātakaraṇasmiṁ
evaṁ vūpasantaṁ ce bhikkhave adhikaraṇaṁ paṭiggāhako ukkoṭeti ukkoṭanakaṁ pācattiyaṁ
evaṁ ce taṁ labhethā iccetaṁ kusalaṁ
no ce labhetha tena bhikkhave bhikkhunā saṁvahule bhikkhu upasaṁkamitvā ekaṁsaṁ uttarāsaṁgaṁ karitvā vuḍḍhānaṁ bhikkhūnaṁ pāde vanditvā ukkuṭikaṁ nisīditvā aṁjaliṁ paggahetvā evaṁ assu vacanīyā
ahaṁ bhaṁte itthaṁnāmaṁ āpattiṁ āpanno
taṁ papaṭidesemīti |
vyattena bhikkhunā paṭivalena te bhikkhu ññāpetavvā
suṇātu me āyasmantā
iyaṁ itthaṁnāmo bhikkhu āpattiṁ sarati vivarati uttānikaroti deseti
yadāyasmantānaṁ pattakallaṁ ahaṁ itthaṁnāmassa bhikkhuno āpattiṁ paṭigaṇheyyanti

This file has been truncated. show original

And searching on the page for āroces. This will find all the instances of a stock phrase in Vinaya texts, where the Buddha makes an announcement. There’s 19 cases. Of these, 13 are followed by ||, 3 by |, and 3 by no punctuation.

Now compare this with the Mahasangiti text of the same passage, which you can find here:

https://raw.githubusercontent.com/sujato/bendall-cv/master/mahasangiti-files/ms-kd14--kd15-segmented.txt

Here there are 25 occurrences of āroces (more than bendall-cv because that text is incomplete). All of these are followed by a full stop, except in cases where there is abbreviation, in which case they have ellipsis.

Thus the modern edition has been corrected and made more consistent as compared to the manuscript. This is only one example, and bendall-cv is a very unusual case, but you get the point.

In the future, we’ll segment the whole of the Pali canon, using the Mahasangiti edition.

That’s the plan.

The whole point of this was to avoid this kind of situation. Notes would be not be kept in HTML, but in a separate JSON file, or many JSON files for many sets of notes. Since every note is identified by the same segment number, it can be applied anywhere, to original texts or translations or whatever. If you want notes in your audio, great, grab them from the JSON.

Again, that is the point. But there is no single text that is “MN 1”. There are multiple Pali editions, each of which has a different version of MN 1 and must be distinguished by language, text number, and edition. I.e.:

MN 1 equals:
- MN 1 Pali Mahasangiti
- MN 1 Pali Buddha Jayanthi
- MN 1 Pali PTS
- MN 1 English Sujato
- MN 1 English Bodhi
- references …
- parallels ….
- audio files …
- ye vā pan’aññā (“or whatever else…”)

Now, currently we only have one Pali text, but that will change. The bendall-cv project is the first step towards working out how to integrate multiple Pali editions. And it is based on the fact that, for all their differences, at the semantic segment level different Pali texts can be aligned. So the idea will be to align each Pali text on the segments as already established via the Mahasangiti.

Have you seen the work done on standoff properties? This was by a group of programmers frustrated with the limitations of XML. Not just the contingent limitations of tooling and complexity, but the deeper problem that any markup system must represent a hierarchy, and in ancient editions we are often faced with multiple overlapping and inconsistent hierarchies. So essentially all markup is separated from text and maintained in JSON files. Standoff properties are awesome, and I believe, the future; but they are very much a work in progress.

You can see a brief discussion in our issue here (with the caveat that this issue lags behind the discussion here on Discourse!)

github.com/suttacentral/suttacentral

Use standoff data in segmented texts

opened 10:29AM - 20 Jun 18 UTC

closed 07:35AM - 11 Sep 19 UTC

sujato

Type: improvement P3

> It's the job that's never started as takes longest to finish. We have discu…ssed many times the desirability of using standoff markup for text data on SC. With our segmented texts we finally have a chance to do this. Just to clarify the terminology, as per Schmidt, section 1. [standoff-properties.pdf](https://github.com/suttacentral/suttacentral/files/2259114/standoff-properties.pdf) Standoff **markup** is mostly what we are talking about here. Standoff **properties**—which allow a lossless incorporation of metadata at any granularity—are more sophisticated, and proposing such a system is Schmidt's main thesis. These come into play in the final section of this page. ## The Problem Currently we keep our texts in either a translation format (PO) or a presentation format (HTML). Both of these involve serious compromises. Without going into too much detail, the basic problem is that such formats, being pressed into serving a purpose for which they were not designed, end up mixing various kinds of data. Instead, we should separate each kind of data cleanly and consistently using a dedicated data format, probably JSON. The advantage of standoff markup is that it allows an unlimited set of data to be associated with the source, without cluttering up the source files. This can include such things as: 1. Multiple references from different editions, with more that can be added over time. 2. HTML or other markup. 3. Notes. 4. Variant readings 5. Anything else! ## The Idea The data needs to preserve three things for each item: 1. The segment ID—a number that stands for the absract idea of the segment and is unique across the entire SC corpus. 3. The content—a string of utf-8 glyphs that represents a line of text, a reference number, and so on. 2. The attribute—the kind of thing that the data is; for example an English translation of the Pali by Sujato. These things should be preserved cleanly, simply, unambiguously, and locally (i.e. you should not have to go to another file to find out who the author is.) We can do this by making everything a JSON object. 1. The segment ID is the key of the JSON object. 3. The content is the value of the JSON object. 2. The attribute is defined in the file. Currently in DN 1 we have a PO file that records the following information about the first segment of DN 1. #. </h2><p> #. <a class="pts-cs" id="pts-cs1.1"></a> #. <a class="pts-vp-pli" id="pts-vp-pli1.1"></a> #. <a class="sc" id="sc1"></a> #. <span class="evam"> msgctxt "dn1:1.1.1" msgid "Evaṃ me sutaṃ—" msgstr "So I have heard." Notice that this breaks the rules on a number of fronts: 1. Different kinds of stuff is mixed: references, HTML, source text, translation text. In other segments, variant readings or comments might also be found. 2. It is verbose, including all the stuff necessary for HTML markup of references. 2. There is nothing to identify the edition or the author. Since each segment has a `msgctxt` number, we can split this into data sets something like the following. { "ms-pli":[ { "dn1:1.1.1":"Evaṃ me sutaṃ—" }, { "dn1:1.1.2":"ekaṃ samayaṃ bhagavā" } ] } The root key tells us the attribute, which is *universal* to this data set. Here it is `ms-pli`, i.e. the Mahasangiti edition of the Pali Tipitaka. The segment IDs and text strings form key/value objects. They are ordered, so form an array. A translation can be handled similarly. { "sujato-en":[ { "dn1:1.1.1":"So I have heard." }, { "dn1:1.1.2":"At one time the Buddha" } ] } This tells us that it is an English translation by Sujato; that the translation is of DN 1; the identity of each segment; and the content of the translation itself. For references we can include multiple editions within one array. It would also be possible to have each edition in a separate file. Here the key is, as always, the segment ID. This matches an object which is a set of key/value pairs, each representing a specific reference system + number. { "refs":[ { "dn1:1.1.1":{ "pts-cs":"1.1", "pts-vp-pli":"1.1", "msdiv":"1" } }, { "dn1:1.1.2":{ } } ] } In another file can have markup. Note that this can be done better than in PO, for in PO we can only add markup at the beginning of a segment, and hence must *close* the previous html tag where applicable—see above `</h2><p>`. This creates problems; for example, at the end of a sutta, there is nowhere for the final tags to be closed. Much better to do something like this: { "html":[ { "dn1:1.1.1":"<p><span class="evam">{{1stString}}</span>" }, { "dn1:1.1.2":"{{2ndString}}</p>" } ] } ## Usage Once the data is cleanly separated like this, we can mix and match it as we wish. Some use cases: - For search, just take the text. Or index the refs separately. - For online, import just the text and HTML, and bring in other data dynamically as requested. - For Pootle, leave out refs and HTML, but include notes and variants. - Presentation markup can be adapted for various cases. For example, different languages might handle paragraph conventions differently. - We have already seen cases where people have used SC's data for themselves. By cleanly separating the sources, such third-party applications can flexibly use whatever data is convenient for them. - Things we have not thought of! We set up elegant and powerful data, and the applications will suggest themselves. ## Note!!! There are some bugs in the current pootle data, when converting to JSON we should squash them: https://github.com/suttacentral/suttacentral/issues/1115 ## Standoff properties: do stuff inside segments All the above assumes that the stuff we need to remove can be located outside a segment. And in most cases, this is true. The structural markup works outside segments. Refs such as page break lose a little granularity by going outside the segment, but nothing serious—in most cases. Variants and notes can work fine that way, too. However certain things must live inside a segment. These include: - Certain kinds of textual emphasis or styling: *all* consciousness is not-self! - Some references would be awkward to handle outside segments. I am thinking primarily of line-breaks in the Taisho texts. We would commonly have two or more line breaks in one segment. - Reconstructed and other text marked with text-critical markup, especially in Sanskrit texts and other manuscripts. To handle such cases, a system called "standoff properties" has been developed using a JSON-based markup called STIL. https://discourse.suttacentral.net/uploads/default/original/2X/6/6056afc3c25fcf0e9b3e677c04ea4bc34b8151ab.pdf Essentially, each inner span would be marked with three pieces of information, which represent: 3. The kind of property (`name`) 1. The number of glyphs since the start of the segment (`reloff`) 2. The length of the inner span (`len`) Here is an example. ![stil](https://user-images.githubusercontent.com/6112010/43619593-4ab768f4-9701-11e8-803b-d4a60457a333.png) Unfortunately, work on the basic tools for such a system is still ongoing, and I don't think it is mature yet. I would suspect that for the time being we will have to fake it a bit. Perhaps we could use a set of defined plain text glyphs to represent text-critical info, and markdown for styling. BTW, this looks cool: https://github.com/dhlab-basel/Knora/pull/319

karl_lew · August 16, 2018, 5:03am

(reads pdf…)
Wow. Coming from the pristine world of canonical current documents, I now see the enormous challenge of representing historical documents with their embedded historical edits and variations (!).

I really like the stable plaintext Unicode manuscript annotated by standoff layer semantic markup. The plaintext itself is stable, and yet each standoff layer can faithfully represent an independent perspective. Hence we have multiple files. It’s all a bit head-exploding. And good.

As a consumer of this information for vision assistance, all I really need is a JSON representation of the “SC accepted” layer as a sequence of text segments. Ideally, a rudimentary paragraph-level grouping of text segments would be welcomed for assisted navigation. Currently, a grouping that exists in the source manuscript can be inferred from the text segment id numbering. However, if a translation standoff layer introduces additional hierarchy useful for navigation, we may wish to represent that additional hierarchy or grouping somehow in the JSON.

I’m having fun puzzling out voice assistance, but if you and Blake need assistance with any of this I will happily re-prioritize. Thank you both for all your patient explanation. I certainly have learned a lot today!

sujato · August 16, 2018, 7:56am

Well, the segment IDs essentially give you a section, which by default is a “paragraph”. But since in some cases (i.e. DN and MN) there may be multiple paragraphs per section, adding a layer of “paragraph” information would be possible. In addition, the HTML in the texts does encode various other kinds of information: whether something is a heading, a verse, a statement summarizing a text, and so on. I suspect that most of this would be of little value for a text reader, but the information is there if needed.

karl_lew · August 17, 2018, 2:49pm

@Blake, I’ve been parsing PO files and have found the following JSON sutta structure quite adequate for voice assistance:

var sutta = {
  meta: { ... },
  segments: [{
    scid: 'mn1:1.1',
    pli: '...Pali...',
    en: '...English',
  },{
     ...
  }],
}

Oddly, the direct key lookup via scid we discussed previously was not needed. I am simply searching segments using the following line of Javascript:

 var result = segments.filter(seg => /root of suffering/.test(seg.en));

Here is how I find all the segments of section 2 in mn1:

 var result = segments.filter(seg => /mn1:2.*/.test(seg.scid));

And here is how I search the Pali:

 var result = segments.filter(seg => /viharati/.test(seg.pli));

Although this is an exhaustive O(N) search, it’s actually quite fast and good enough for all my needs. In the spirit of Donald Knuth’s warning about premature optimization, I’ll be shying away from mapping scids to segments unless it proves critically necessary.

Basically, I’d prefer the above JSON object with a simple array of segments as the JSON response for a REST API.

sujato · August 20, 2018, 12:50am

To be clear, PO is deprecated and as soon as possible will be completely absent from our system. The JSON structure we have outlined will be our new data format. So don’t build on PO!

karl_lew · August 20, 2018, 1:12am

The po-parser is already written and took only a day. Its sole purpose is to generate the JSON I described above so that we can get a prototype in front of people. The rest of the code does require the JSON I outlined above. If SuttaCentral returns some other JSON schema, I will need to spend about a day converting that JSON to this JSON. I’d rather not spend that day, but there may be other consumers with separate requirements.