SuttaCentral, the Bilara edition: from markup to JSON, a journey!

sujato · November 2, 2020, 11:31am

The Updates section has been slow for a long time now, and I wanted to rectify that by talking of some of the changes we’ve been working on. We’ve undertaken a rather major revision of the project, and it has unfortunately become deeply intertwingled, so we have to get a massive bunch of changes all at the same time. Even though we’re not ready to go live yet, i thought I’d make some posts outlining the major changes. These posts will have the following disclaimer:

These updates are a work in progress. You can check how we’re going on our staging site, but expect things to be broken. We’ll be seeking feedback at some point, but for now, we’re just trying to keep our heads down and work through the issues.

Many of these issues are things that have been discussed before, but the discussions are often preliminary and disorganized. So here I’ll aim to present a more accessible survey of the major changes.

Okay!

The past

The most fundamental change we are making is to put all of our text and other main data in JSON rather than HTML. Here I’ll explain what that means and why it is cool.

Mahasangiti XML/HTML

Let me take as an example SN 1.1. Here is more-or-less what it looked like in our source files from the Mahasangiti project (“more-or-less” because there are a few versions).

<div>
   <CRUMBS><a href="/">Home</a> » <a href="/tipitaka/12S1">12S1 Sagāthāvaggasaṃyuttapāḷi</a> » <a href="/tipitaka/12S1/1">1 Devatāsaṃyutta</a> » <a href="/tipitaka/12S1/1/1.1">1.1 Naḷavagga</a></CRUMBS>
   <heading>1.1.1 Oghataraṇasutta</heading>
   <div class="hidden">Devatāsaṃyutta</div>
   <div class="hidden">Naḷavagga</div>
   <div class="hidden">Oghataraṇasutta</div>
   <div class="q" id="p_12S1_2">
      <div class="divNumber">1.</div>
      <div class="p"><span class="pN">2</span>Evaṃ me sutaṃ—  ekaṃ samayaṃ bhagavā sāvatthiyaṃ viharati jetavane anāthapiṇḍikassa ārāme. Atha kho aññatarā devatā abhikkantāya rattiyā abhikkantavaṇṇā kevalakappaṃ jetavanaṃ obhāsetvā yena bhagavā tenupasaṅkami; upasaṅkamitvā bhagavantaṃ abhivādetvā ekamantaṃ aṭṭhāsi. Ekamantaṃ ṭhitā kho sā devatā bhagavantaṃ etadavoca—  “kathaṃ nu tvaṃ, mārisa, oghamatarī”ti? “Appatiṭṭhaṃ khvāhaṃ, āvuso, anāyūhaṃ oghamatarin”ti. “Yathākathaṃ pana tvaṃ, mārisa, appatiṭṭhaṃ anāyūhaṃ oghamatarī”ti? “Yadāsvāhaṃ, āvuso, santiṭṭhāmi tadāssu saṃsīdāmi; yadāsvāhaṃ, āvuso, āyūhāmi tadāssu nibbuyhāmi. Evaṃ khvāhaṃ, āvuso, appatiṭṭhaṃ anāyūhaṃ oghamatarin”ti. </div>
   </div>
   <div class="q" id="p_12S1_3">
      <div class="p">
         <table class="singleColumn">
            <tr>
               <td>
                  <div class="pN">3</div>
                  <div class="G">“Cirassaṃ vata passāmi,</div>
               </td>
            </tr>
            <tr>
               <td>
                  <div class="G">brāhmaṇaṃ parinibbutaṃ;</div>
               </td>
            </tr>
            <tr>
               <td>
                  <div class="G">Appatiṭṭhaṃ anāyūhaṃ,</div>
               </td>
            </tr>
            <tr>
               <td>
                  <div class="G">tiṇṇaṃ loke visattikan”ti.</div>
               </td>
            </tr>
         </table>
      </div>
   </div>
   <div class="q" id="p_12S1_4">
      <div class="p"><span class="pN">4</span>Idamavoca sā devatā. Samanuñño satthā ahosi. Atha kho sā devatā—  “samanuñño me satthā”ti bhagavantaṃ abhivādetvā padakkhiṇaṃ katvā tatthevantaradhāyīti. </div>
   </div>
</div>
<div>

It’s an XML source. lightly transformed into HTML. It’s mostly clear, although there are some eccentricities that belie its era, like using a table to mark up verse.

Notice that this file combines a range of different kinds of data:

Application logic: It starts with a set of breadcrumbs that define the place in the canon.
Multiple headings: we find a repetition of the headings, and it’s not entirely clear what prupose this serves.
Multiple references: we find the divNumber, the class q ID, and class pn (= “paragraph number”). These overlapping references all play distinct roles.
Mixed semantics and content: tags such as div demarcate the content, albeit in a vague and unsemantic way.
Style information: For example, the class singleColumn.
Text content: I almost forgot! There’s the actual text, too.

So that’s quite a lot. And this is a good example, a well-formed and considered markup system, so pretty much every other source is less good.

It’s generally a good idea to separate out concerns, especially when managing a large body of content, with a strict emphasis on precision and quality control. The idea that we can use “markup” to mix up content and code like this is an old one, going back to the early seventies at least, and it obviously works for many things. But it also leads to problems. For example, what do we do with the breadcrumbs; not everything in SuttaCentral is organized the same way, so we can’t rely on these.

SuttaCentral HTML

So when we adopted this text for SC, we transformed it to the following:

<section class="sutta" id="sn1.1">
<article>
<div class="hgroup">
<p class="division">Saṃyutta Nikāya 1</p>
<p>1. Naḷavagga</p>
<h1>1. Oghataraṇasutta</h1>
</div>
<p><a class="sc" id="1"></a><a class="pts1ed" id="pts1ed1.1"></a><a class="pts2ed" id="pts2ed1.1"></a><a class="ms" id="p_12S1_2"></a><a class="msdiv" id="msdiv1"></a><span class="evam">Evaṃ me sutaṃ—</span>   ekaṃ samayaṃ bhagavā sāvatthiyaṃ viharati jetavane anāthapiṇḍikassa ārāme. Atha kho aññatarā devatā abhikkantāya rattiyā abhikkantavaṇṇā kevalakappaṃ jetavanaṃ obhāsetvā yena bhagavā tenupasaṅkami; upasaṅkamitvā bhagavantaṃ abhivādetvā ekamantaṃ aṭṭhāsi. Ekamantaṃ ṭhitā kho sā devatā bhagavantaṃ etadavoca: “kathaṃ nu tvaṃ, mārisa, oghamatarī”ti? “Appatiṭṭhaṃ khvāhaṃ, āvuso, anāyūhaṃ oghamatarin”ti. “Yathākathaṃ pana tvaṃ, mārisa, appatiṭṭhaṃ anāyūhaṃ oghamatarī”ti? “Yadāsvāhaṃ, āvuso, santiṭṭhāmi tadāssu saṃsīdāmi; yadāsvāhaṃ, āvuso, āyūhāmi tadāssu <span class="var" title="nivuyhāmi (s1-3, km, mr)" id="note1">nibbuyhāmi</span>. Evaṃ khvāhaṃ, āvuso, appatiṭṭhaṃ anāyūhaṃ oghamatarin”ti.</p>
<blockquote class="gatha">
<p><a class="sc" id="2"></a><a class="verse-num-sc" id="verse-num-sc1"></a><a class="ms" id="p_12S1_3"></a>“Cirassaṃ vata passāmi,<br> brāhmaṇaṃ parinibbutaṃ;<br> Appatiṭṭhaṃ anāyūhaṃ,<br> tiṇṇaṃ loke visattikan”ti.
</p>
</blockquote>
<p><a class="sc" id="3"></a><a class="ms" id="p_12S1_4"></a>Idamavoca sā devatā. Samanuñño satthā ahosi. Atha kho sā devatā: “samanuñño me satthā”ti bhagavantaṃ abhivādetvā padakkhiṇaṃ katvā tatthevantaradhāyīti.</p>
</article>
</section>

Use pure HTML5 unixed with XML.
Use semantic tags (no tables!)
Retain MS reference data, and add some more.
Imported variant readings (not in this examples).

This helped us to remove some of the cruft in the original:

no application logic
no style information
no unused headings

This allows us to define the content more precisely and in a way that is easily consumed, since HTML is universal.

But we still have a number of limitations.

How can we match a portion of the text to a portion of a translation?
The reference numbers remain in this text and cannot be applied to translations.
As we add more kinds of descriptive data (variant readings, comments, etc.), the file becomes more clogged up.
data is described in a verbose way (<a class="verse-num-sc" id="verse-num-sc1"></a>), as HTML is not a data format.
with multiple translations it is easy for corruptions and variations to creep in.
And so on!

SuttaCentral PO

In particular, we needed to adapt the texts in a way that was suitable for translation. To do this, we used the PO file format, part of the open Gettext translation system. Now our text looked like this.

"Project-Id-Version: PACKAGE VERSION\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2018-06-18 13:15+0000\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"Language: templates\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"X-Generator: Translate Toolkit 2.2.5\n"
"X-Pootle-Path: /templates/sn/sn1.01.pot\n"
"X-Pootle-Revision: 17458\n"

#. <section class="sutta" id="sn1.1"><article><div class="hgroup"><p class="division">
msgctxt "sn1.1:0.1"
msgid "Saṃyutta Nikāya 1"
msgstr ""

#. </p><p>
msgctxt "sn1.1:0.2"
msgid "1. Naḷavagga"
msgstr ""

#. </p><h1>
msgctxt "sn1.1:0.3"
msgid "1. Oghataraṇasutta"
msgstr ""

#. </h1></div><p>
#. <a class="pts1ed" id="pts1ed1.1"></a>
#. <a class="pts2ed" id="pts2ed1.1"></a>
#. <a class="sc" id="sc1"></a>
#. <span class="evam">
msgctxt "sn1.1:1.1"
msgid "Evaṃ me sutaṃ—"
msgstr ""

#. </span>
msgctxt "sn1.1:1.2"
msgid ""
"ekaṃ samayaṃ bhagavā sāvatthiyaṃ viharati jetavane anāthapiṇḍikassa ārāme."
msgstr ""

msgctxt "sn1.1:1.3"
msgid ""
"Atha kho aññatarā devatā abhikkantāya rattiyā abhikkantavaṇṇā kevalakappaṃ "
"jetavanaṃ obhāsetvā yena bhagavā tenupasaṅkami; upasaṅkamitvā bhagavantaṃ "
"abhivādetvā ekamantaṃ aṭṭhāsi. Ekamantaṃ ṭhitā kho sā devatā bhagavantaṃ "
"etadavoca:"
msgstr ""

msgctxt "sn1.1:1.4"
msgid "“kathaṃ nu tvaṃ, mārisa, oghamatarī”ti?"
msgstr ""

# BB has “not halting “and” not straining”, and other translators have rendered similarly. However I am not quite happy with such translations, as they never appear to make the metaphor clear. The point, it seems to me, is that when one stands in a flood, when the water gets too deep you sink over your head; and if you swim, you get swept away. Perhaps we have been misled by the connotations of “sink” in English; it doesn’t quite work to say that “if you stand, you sink”. Rather, it means “get submerged”.
msgctxt "sn1.1:1.5"
msgid "“Appatiṭṭhaṃ khvāhaṃ, āvuso, anāyūhaṃ oghamatarin”ti."
msgstr ""

msgctxt "sn1.1:1.6"
msgid "“Yathākathaṃ pana tvaṃ, mārisa, appatiṭṭhaṃ anāyūhaṃ oghamatarī”ti?"
msgstr ""

msgctxt "sn1.1:1.7"
msgid "“Yadāsvāhaṃ, āvuso, santiṭṭhāmi tadāssu saṃsīdāmi;"
msgstr ""

#. VAR: nibbuyhāmi → nivuyhāmi (s1-3, km, mr)
msgctxt "sn1.1:1.8"
msgid "yadāsvāhaṃ, āvuso, āyūhāmi tadāssu nibbuyhāmi."
msgstr ""

msgctxt "sn1.1:1.9"
msgid "Evaṃ khvāhaṃ, āvuso, appatiṭṭhaṃ anāyūhaṃ oghamatarin”ti."
msgstr ""

#. </p><blockquote class="gatha"><p>
#. <a class="sc" id="sc2"></a>
msgctxt "sn1.1:2.1"
msgid "“Cirassaṃ vata passāmi,"
msgstr ""

#. <br>
msgctxt "sn1.1:2.2"
msgid "brāhmaṇaṃ parinibbutaṃ;"
msgstr ""

#. <br>
msgctxt "sn1.1:2.3"
msgid "Appatiṭṭhaṃ anāyūhaṃ,"
msgstr ""

#. <br>
msgctxt "sn1.1:2.4"
msgid "tiṇṇaṃ loke visattikan”ti."
msgstr ""

#. </p></blockquote><p>
#. <a class="sc" id="sc3"></a>
msgctxt "sn1.1:3.1"
msgid "Idamavoca sā devatā."
msgstr ""

msgctxt "sn1.1:3.2"
msgid "Samanuñño satthā ahosi."
msgstr ""

msgctxt "sn1.1:3.3"
msgid "Atha kho sā devatā:"
msgstr ""

msgctxt "sn1.1:3.4"
msgid ""
"“samanuñño me satthā”ti bhagavantaṃ abhivādetvā padakkhiṇaṃ katvā "
"tatthevantaradhāyīti."
msgstr ""

Now the content is more cleanly separated, with each text string labelled as msgid and assigned a unique number msgctxt. A blank space is left for the translation msgstr, although usually such files will have the string filled in, thus it is a bilingual file.

Because the reference data applies to both text and translation, this solves the problem of siloed references, as well as matching text and translation.

While this served its purpose for translation, it is not ideal for long term data storage.

We hacked the comments area to add HTML markup, variants, and comments; while this worked, it was messy and error-prone.
There’s a bunch of mostly useless metadata.
All the info is still included in the one file.

The present: Bilara JSON

Not finding an ideal solution off the shelf, we made our own: Bilara! Bilara is a translation system based on the plain-text JSON data exchange format. JSON is the native data exchange format of the web, so it works really well for us. Unlike traditional translation formats such as PO, Bilara files have a single type of content. All the of the content in the original files is separated and kept in different directories, resulting in a super-clean and flexible format.

This is no longer a markup system, as the HTML is completely separate. In fact there is no special need to use HTML at all, we could add markup in XML, LaTeX, or whetever we want. This approach is known as “standoff markup”.

Here’s the same file in Bilara:

{
  "sn1.1:0.1": "Saṁyutta Nikāya 1 ",
  "sn1.1:0.2": "1. Naḷavagga ",
  "sn1.1:0.3": "1. Oghataraṇasutta ",
  "sn1.1:1.1": "Evaṁ me sutaṁ—",
  "sn1.1:1.2": "ekaṁ samayaṁ bhagavā sāvatthiyaṁ viharati jetavane anāthapiṇḍikassa ārāme. ",
  "sn1.1:1.3": "Atha kho aññatarā devatā abhikkantāya rattiyā abhikkantavaṇṇā kevalakappaṁ jetavanaṁ obhāsetvā yena bhagavā tenupasaṅkami; upasaṅkamitvā bhagavantaṁ abhivādetvā ekamantaṁ aṭṭhāsi. Ekamantaṁ ṭhitā kho sā devatā bhagavantaṁ etadavoca: ",
  "sn1.1:1.4": "“kathaṁ nu tvaṁ, mārisa, oghamatarī”ti? ",
  "sn1.1:1.5": "“Appatiṭṭhaṁ khvāhaṁ, āvuso, anāyūhaṁ oghamatarin”ti. ",
  "sn1.1:1.6": "“Yathākathaṁ pana tvaṁ, mārisa, appatiṭṭhaṁ anāyūhaṁ oghamatarī”ti? ",
  "sn1.1:1.7": "“Yadāsvāhaṁ, āvuso, santiṭṭhāmi tadāssu saṁsīdāmi; ",
  "sn1.1:1.8": "yadāsvāhaṁ, āvuso, āyūhāmi tadāssu nibbuyhāmi. ",
  "sn1.1:1.9": "Evaṁ khvāhaṁ, āvuso, appatiṭṭhaṁ anāyūhaṁ oghamatarin”ti. ",
  "sn1.1:2.1": "“Cirassaṁ vata passāmi, ",
  "sn1.1:2.2": "brāhmaṇaṁ parinibbutaṁ; ",
  "sn1.1:2.3": "Appatiṭṭhaṁ anāyūhaṁ, ",
  "sn1.1:2.4": "tiṇṇaṁ loke visattikan”ti. ",
  "sn1.1:3.1": "Idamavoca sā devatā. ",
  "sn1.1:3.2": "Samanuñño satthā ahosi. ",
  "sn1.1:3.3": "Atha kho sā devatā: ",
  "sn1.1:3.4": "“samanuñño me satthā”ti bhagavantaṁ abhivādetvā padakkhiṇaṁ katvā tatthevantaradhāyīti. "
}

As you can see, no markup, application logic, style information, reference or other metadata, or literally anything else. Just a series of segments in key/value pairs where:

The key is the Bilara segment ID. This is globally unique in the corpus and serves to identify any instance of this segment.
The content, in this case the root text.

This system has a lot of advantages for us, not least that we can continue adding more and more information on the same set of IDs, without creating messy data. Think multiple translation languages, multiple different translations in the same language, different sets of comments, extra reference details, and so on. We can even associate each segment ID with an MP3 file for audio, which is how SC-Voice works. And many more possibilities.

The future? Standoff properties

Is this the end of the line? Is it the best we can do? Maybe not! The major limitation of the Bilara system is that we can only specify content at the level of the segment or range of segments, not within the segment or across parts of more than one segment. Currently this doesn’t affect us greatly. For example it means that a comment or variant reading can only be attached to the segment as a whole, not to the exact portion we are commenting on. To deal with this, we must include the lemma of the variant in the variant note.

It gets harder when you have complex text-critical data on, for example, fragmentary manuscripts. This is the case for some of our Sanskrit texts. In these cases you need to specify exactly the range of characters to which a set of metadata applies.

The technical solution for this is called “standoff properties”. Essentially you define the range of characters and type of data in a separate file. While gaining traction, this approach still faces complex challenges. It is, however, of interest to the web community and there is a standards proposal. We’re keeping an eye on this and hope it may be applicable in the future.