SuttaCentral, the Bilara edition: metadata, a journey!

sujato · November 25, 2020, 10:17pm

These updates are a work in progress. You can check how we’re going on our staging site , but expect things to be broken. We’ll be seeking feedback at some point, but for now, we’re just trying to keep our heads down and work through the issues.

In the second of my occasional series of posts on the upcoming improvements, I want to discuss the perhaps tedious topic of metadata. Put simply, this is “data about data”, or in our case, descriptive data that identifies important context about texts.

The ancient roots of metadata

Each sutta, at least in theory, begins with a statement like the following:

So have I heard. At one time, the Buddha was staying at Sāvatthī, Jeta’s Grove, Anāthapiṇḍika’s monastery. There he addressed the mendicants.

This is the first metadata for Buddhist texts. It is an identifying label that supplies a bit of crucial information. It tells you that this is a teaching of the Buddha; where it is, (kind of) when it is, and the manner of transmission, i.e. that it was “heard”.

We can represent this metadata in JSON. Fun!

{
"speaker": "Buddha",
"location": "Sāvatthī, Jeta's Grove, Anāthapiṇḍika’s monastery",
"occasion": true,
"transmission": "oral"
}

Okay, so we’re not getting much use out of the “occasion” data, but I think you see the point. The suttas from the beginning were recorded together with metadata. This tells us that the early Buddhist community cared about this, that they wanted to have a record about the teachings that would authenticate them and help organize and place them.

The records of the Councils show that the Buddhist traditions maintained this interest in recording the metadata about the texts. Starting with the accounts of the First and Second Council which are added as appendices to the Vinaya, we have records of the manner in which the Buddhist community organized itself, resolved disputes, and importantly, how it maintained its texts.

The account of the First Council sets up Ven Mahākassapa as questioning Ven Upāli on Vinaya and Ven Ānanda on the suttas. It emphasizes the questioning of metadata. (Brahmali’s translation)

“Where was the Prime Net spoken?”
“At the royal rest-house at Ambalaṭṭhikā, between Rājagaha and Nāḷanda.”
“Whom is it about?”
“The wanderer Suppiya and the young brahmin Brahmadatta.”
And Mahākassapa asked Ānanda about the origin story of the Prime Net and about the person.
“Where was the Fruits of the Ascetic Life spoken?”
“In Jīvaka’s mango grove at Rājagaha.”
“Whom is it with?”
“Ajātasattu of Videha.”
And Mahākassapa asked Ānanda about the origin story of the Fruits of the Ascetic Life and about the person.
In this way he asked about the five collections.
And Ānanda was able to reply to every question.

The curious thing, and I’m not sure that this has been noted before, is that this account only speaks of metadata. We assume that the actual texts were recited, but it doesn’t actually say that. As far as the Pali record goes, the Council was purely an exercise in recording metadata. I’m not sure if this is significant at all, we’d have to check the other Vinaya accounts. But it is interesting!

As the tradition moved to manuscripts, we find metadata recorded there, too. I haven’t studied the matter deeply, and there is not much in the way of very old manuscripts, but typically we can find two kinds of metadata:

“page” numbers recording the number of leaves used.
a colophon giving details of the scribe, the sponsor, and perhaps other details.

Taking the Dutiya-Parakkamabāhu Cullavagga as example, the colophon was ably discussed by P.E.E. Fernando, and you can read it here. The colophon is one of the pieces of evidence that confirm that this manuscript dates to the thirteenth century, making it the oldest complete Pali manuscript. SuttaCentral is currently working on making a digital edition.

Here is Fernando’s translation of the colophon:

This is the Pāli book Suluvaga that the Venerable Great Lord Medhaṅkara of the Konduruvā forest caused to be transcribed by the Grand Thera Sumedha of Beligala as a gift to the saṅgha, after collating a whole nikāya, being satisfied (with regard to its accuracy) after consultation (with competent scholars), with the patronage of King Parākramabāhu, the Sovereign of Laṅkā and the participation of fellow-monks living the holy life, such as Theras and Grand Theras, for the purpose of transcribing (providing) one book for each monk as a gift to the venerable saṅgha living in the island of Laṅkā.

As you can see, there is a fair amount of interesting detail here, and it gives us useful background for understanding the text and its production. We could attempt to transcribe and update the metadata something like this:

{
  "title": {
    "sinhala": "Suluvaga",
    "pali": "Cullavagga"
  },
  "sponsor": "King Parākramabāhu (ii)",
  "scribe": "Sumedha of Beligala",
  "manager": "Medhaṅkara of Konduruvā",
  "project": "Free distribution for monks of Lanka",
  "method": "collation of nikāya",
}

Metadata in print

The modern era introduced a new technology, print, and with it a new set of standards for metadata. I’m sure you’re all familiar with the publication page at the front of books, which lists information such as author, publisher, date, and address. Such information continues the concerns of the ancient scribes to annotate their texts with contextual data. In addition, we see the introduction of new forms of metadata, such as the idea of dedicated metadata systems like ISBN, which help facilitate a central or universal understanding of metadata.

Digital metadata

The digital arena would seem to be an excellent context in which to maintain detailed and accurate metadata. No longer must we rest content with vague statements about time such as “at one time” as in the suttas; or “a hundred years after the parinibbana” as with the Second Council; or even the year as in modern publications. We can record the exact time of a publication down to the second.

The reality has turned out be much less optimal. True, the digital platform offers the potential for recording metadata. But at the same time, because it lowers the bar for content creation, it creates a culture that really doesn’t seem to care about keeping records. people seem to assume that somehow the machines will keep track. Sadly, unless you tell them to, they don’t.

Thus in the realm of Buddhist texts, we constantly find files that have poorly specified metadata, if they have any at all. There are multiple editions, even of a huge work like the Pali canon, where we can hardly identify who was responsible or what the scope of the project was. In other words, we frequently have worse metadata on digital texts than we did on manuscripts.

When we started bringing texts on to SuttaCentral, having no experience in the field, and faced with a wide range of divergent cases, I decided that we would not attempt to maintain machine-readable metadata, but would simply include an unstructured metadata statement with every text. That would include all the relevant information. The information was typically derived from the source files, and thus is an complete and accurate as they are. In some cases we were able to enhance the metadata in a more descriptive way. But still, it is only useful for human readers, and is far from systematic and complete.

Bilara metadata: `_publication.json`

We have been developing our own translation engine for new translations, and this gives us the chance to make our own metadata specification.

Since our texts live on Github, we automatically leverage the impressive record-keeping of git. Each edit records the author, the time, and the exact details of the edit. This data is kept (theoretically) in perpetuity and is available for anyone to review. Here is an example of a recent edit done by Sabbamitta to her German translation.

However the problem now is too much information. We need a convenient form to record the essential information that will be useful for readers, libraries, and so on.

To do this we create a file called _publication.json which lives in our Bilara repo. This contains metadata about all translations in that repo. The file is a simple JSON structure with relevant fields for author, publication, and so on. Here is a typical example, with annotations.

  "scpub2": {
    "publication_number": "scpub2", /* simple increment for tracking our internal publications */
    "root_lang_iso": "pli", /*records the language of the texts on which the translation is based */
    "root_lang_name": "Pali",
    "translation_lang_iso": "en", /* The target language */
    "translation_lang_name": "English",
    "source_url": "https://github.com/suttacentral/bilara-data/tree/master/translation/en/sujato/sutta/dn", ?* This gives a canonical and stable reference for the source files. It unambiguously identifies the content of the publication. */
    "author_uid": "sujato",
    "author_name": "Bhikkhu Sujato",
    "author_github_handle": "sujato",
    "text_uid": "dn", /* UID as used on SuttaCentral */
    "translation_title": "Long Discourses",
    "translation_subtitle": "A translation of the Dīgha Nikāya",
    "root_title": "Dīgha Nikāya",
    "translation_process": "Primary source was the digital Mahāsaṅgīti edition of the Pali Tipiṭaka. Translated from the Pali, with reference to several English translations, especially those of Bhikkhu Bodhi.", /* this is an unstructured field whose purpose is to record the way the translator did their work. */
    "translation_description": "This translation was part of a project to translate the four Pali Nikāyas with the following aims: plain, approachable English; consistent terminology; accurate rendition of the Pali; free of copyright. It was made during 2016–2018 while Bhikkhu Sujato was staying in Qimei, Tawian.", /* An unstructured field to record the general information about the translation. */
    "is_published": "true",  /* This field determines whether the text is published or not. Once a text is published it is pushed to the `published` branch and made available for apps, including SuttaCentral. */
    "publication_status": "Completed, revision is ongoing.", /* An informal field to clarify the ongoing status */
    "license": { /* the license is specified unambiguously */
      "license_type": "Creative Commons Zero", 
      "license_abbreviation": "CC0",
      "license_url": "https://creativecommons.org/publicdomain/zero/1.0/",
      "license_statement": "This translation is an expression of an ancient spiritual text that has been passed down by the Buddhist tradition for the benefit of all sentient beings. It is dedicated to the public domain via Creative Commons Zero (CC0). You are encouraged to copy, reproduce, adapt, alter, or otherwise make use of this translation. The translator respectfully requests that any use be in accordance with the values and principles of the Buddhist community." /* Informal field to describe the license and the author's wishes. */
    },
    "edition": [ /* This records the manner of publication, for example as website, ebook, print, etc. */
      {
        "edition_number": "1",
        "publication_date": "2018",
        "publisher": "SuttaCentral",
        "publication_type": "website",
        "edition_url": "https://suttacentral.net/dn"
      }
    ]
  },

As you can see, this gives a far more detailed and satisfactory account of the metadata.

In some cases, long-standing metadata forms have dropped away. For example, we do not specify a “place”, as that is hard to determine. Is it the location where the translator works? (but they may work in different places). Is it where the servers live? Where the organization is registered? The reality is that in the digital sphere, with a highly distributed corpus like ours, physical location is almost meaningless. Still, if there is a case for it, we can add it. And likewise, we can add and expand these fields as we go.

The publication data lives in the same repo as our texts, so anyone who accesses the texts has this information. It should be made available to readers in a suitable way. The plan for the main SuttaCentral website is here:

github.com/suttacentral/suttacentral

Add publication info to segmented texts

opened 06:21AM - 29 Oct 20 UTC

closed 10:53PM - 11 Mar 21 UTC

sujato

Legacy texts have the licensing and other info hard-coded in the file, eg. ``…` <footer> <p>Translated by Venerable <span class='author'>Dhammarakkhita</span>.</p> <p>Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.</p> <p>Prepared for SuttaCentral by Bhikkhu <span class='editor'>Sujato</span>.</p> </footer> ``` This is then displayed via the "info" button as a dialog (this may change, but anyway there is some means of displaying it). We need a way of displaying similar info for the segmented texts. ## Data source The relevant information is in `bilara-data/_publication.json`. Here I give an HTML structure. - json keys as class names. - `{{data}}` is indicated like this - In the case of `edition` and `author` there may be more than one, so check this. - If a field is empty in `_publication.json` then omit it. - Include rdfa metadata Just put the following on the page, insert the `{{data}}` from `_publication.json`, and we're good to go! ``` <footer> <h2>About this text</h2> <section class='text_metadata' xmlns:dc='http://purl.org/dc/elements/1.1/' about='{{source_url}}'> <dl class='main_details'> <dt class='translation_title'>Translation title</dt> <dd class='translation_title' property='dc:title'>{{translation_title}}</dd> <dt class='translation_subtitle'>Translation subtitle</dt> <dd class='translation_subtitle' property='dc:title'>{{translation_subtitle}}</dd> <dt class='translation_language'>Translation language</dt> <dd class='translation_language' property='dc:language'>{{translation_language}}</dd> <dt class='root_title'>Root title</dt> <dd class='root_title' property='dc:title'>{{root_title}}</dd> <dt class='root_language'>Root language</dt> <dd class='root_language'>{{root_language}}</dd> <dt class='author_name'>Translator</dt> <dd class='author_name' property='dc:creator'>{{author_name}}</dd> </dl> <dl class='descriptive_details'> <dt class='translation_description'>Translation description</dt> <dd class='translation_description' property='dc:description'>{{translation_description}}</dd> <dt class='translation_process'>Translation process</dt> <dd class='translation_process' property='dc:description'>{{translation_process}}</dd> </dl> <dl class='metadata_details'> <dt class='text_uid'>Text identifier (UID)</dt> <dd class='text_uid' property='dc:identifier'>{{text_uid}}</dd> <dt class='source_url'>Source</dt> <dd class='source_url'><a href='{{source_url}}' target='_blank'>{{source_url}}</a></dd> <dt class='publication_status'>Publication status</dt> <dd class='publication_status'>{{publication_status}}</dd> <dt class='publication_number'>SuttaCentral publication number</dt> <dd class='publication_number' property='dc:identifier'>{{publication_number}}</dd> </dl> <dl class='edition'> <dt class='edition_number'>Edition</dt> <dd class='edition_number'>{{edition_number}}</dd> <dt class='publication_date'>Publication date</dt> <dd class='publication_date' property='dc:date'>{{publication_date}}</dd> <dt class='publisher'>Publisher</dt> <dd class='publisher' property='dc:publisher'>{{publisher}}</dd> <dt class='edition_url'>URL</dt> <dd class='edition_url'>{{edition_url}}</dd> <dt class='publication_type'>Publication type</dt> <dd class='publication_type' property='dc:format'>{{publication_type}}</dd> <dt class='number_of_volumes'>Number of volumes</dt> <dd class='number_of_volumes'>{{number_of_volumes}}</dd> </dl> </section> <section class='license'> <p class='license_type' xmlns:dc='http://purl.org/dc/elements/1.1/' property='dc:rights'>{{license_type}} <span class='license_abbreviation'>{{license_abbreviation}}</span></p> <p class='creative_commons' xmlns:dct='http://purl.org/dc/terms/' xmlns:vcard='http://www.w3.org/2001/vcard-rdf/3.0#'> <a rel='license' href='{{license_url}}'> <img src='http://i.creativecommons.org/p/zero/1.0/88x31.png' style='border-style: none;' alt='CC0' /> </a> <br> To the extent possible under law, <a rel='dct:publisher' href='https://suttacentral.net/'> <span property='dct:title'>{{author_name}}</span></a> has waived all copyright and related or neighboring rights to <span property='dct:title'>{{translation_title}}</span>. This work is published from: <span property='vcard:Country' datatype='dct:ISO3166' content='AU' about='https://suttacentral.net/licensing'> Australia</span>. </p> <p class='license_statement'>{{license_statement}}</p> </section> </footer> ```

This will be part of the rollout of the Bilara texts.

This is our first venture in this area, and there is much room for improvement.

Use RDFa standards
Automate metadata creation as much as possible
Build UI for user input of metadata (currently we just edit the JSON file by hand.)

We’ll get there!

Vimala · November 27, 2020, 11:11am

Maybe not entirely (un)related but as I am currently doing a lot of research, one feature I found on various sites has proved very helpful: the CITE button, which gives me a choice of different ways to cite a text, including BibTex. It saves me from having to search for lots of info to create a BibTex entry. Would something like that be doable?

SuttaCentral, the Bilara edition: metadata, a journey!

The ancient roots of metadata

Metadata in print

Digital metadata

Bilara metadata: _publication.json

Bilara metadata: `_publication.json`