Description of markup in Methodology static page

sujato · December 7, 2017, 2:36am

Following the review of the text markup that Ayya Vimala and I are doing for SC Next, it would be good to give some more explanation for this on the main site.

I propose that we include a short section in the “Methodology” static page explaining the basic approach. This can include links to the /zz pages where the markup is maintained in detail.

Here is a first draft.

Markup

SuttaCentral texts are provided with a clear, accurate, and detailed markup using valid, semantic, standards-compliant HTML5. One of our major tasks has been to transform the many texts we have inherited from diverse sources and put them all in the same format. This is exceedingly complex, as the source texts rarely have a clear semantic markup of any sort, but range from unmarked plain text to verbose messes produced by word processors. We aim to render all this complexity in clear and simple markup, without losing detail.

Some sister projects to SuttaCentral—notably CBETA—use the academic Text Encoding Initiative (TEI) XML standard for markup. While we appreciate the advantages of an XML approach, we have found that modern HTML allows us to do everything we need, and is substantially easier to work with and deploy. Our source texts can be displayed natively in a browser without any preprocessing. Nevertheless, while we don’t use TEI, we do adopt many of the TEI names and semantics for text-critical markup.

Here I will give the basic outlines of some of the less obvious details.

References

Our text files include complex reference data. Currently we list more than a hundred different reference sources. There is a lack of a centralized, detailed, and uniform reference system for Buddhist texts, so we have done the best we could. The data is encoded as empty tags, and may be display optionally via javascript.

See list of references here.

Text-critical Information

Our original-language texts include a range of text-critical information of interest to scholars. We try to present this in a way that will not be intrusive to readers, but is informative when researching. Traditional digital files often use print-based conventions such as [brackets] to indicate such matter. But this is ugly, interrupting the reading flow. Worse, it’s uninformative. It is often hard to find out exactly what these various marks mean; in fact we have sometimes had to resort to educated guesswork.

Text-critical information is marked in the files with classes such as “var” for variant readings, “supplied” for supplied text, and so on. Typically we indicate this visually with styles, and display a tooltip explaining the meaning. Most of this may be activated optionally.

See list of text-critical classes here.

Structural and other

We use around a hundred different classes to indicate the detailed structure of texts. These include such things as:

Headings, with proper structured headings using hX tags. Pali texts have such a detailed hierarchy that in some cases we use all the tags as far as h6.
Uddānas, which summarize vaggas and similar structural elements.
Various kinds of numbers, such as rule numbers.
Verses.
End of sutta or end of section titles. Buddhist texts usually have the title at the end of a section rather than the modern heading.

See list of structural and other classes here.