Markup redux

sujato · June 18, 2019, 7:18am

Background

In our legacy texts, we supplied a simple and consistent HTML structure, with the idea that the text can be pretty much plonked into a web page and work as-is. With the shift to JSON-based texts we have, so far, not deeply considered how this might be taken advantage of. This obviously has implications for our legacy texts as well as segmented.

This is not concerned with the internal markup of texts, which is fine. I am thinking about how best to represent to texts as a whole, and sections of texts.

We made a number of compromises and assumptions in working in a web-based context. Moving towards a headless future, we should consider creating a more powerful and flexible representation of content, which can be processed for:

web pages
ebooks
PDF via LaTeX
voice

And so on.

Since the basic idea of SC is the sutta parallel we have always organized our content with the sutta as the primary unit. In our navigation, we therefore have a complex set of overlapping hierarchies, as the sutta parallels intersect with the structures of the texts themselves. For example, sometimes a sutta will parallel a part of another sutta, or a part of one sutta will parallel a range of other suttas.

In some cases the idea of a sutta is inadequate:

In some series of short texts where the individual text is only nominally extant (see an1 and an2).
Collections of verses, eg. Dhammapada.
In vinaya, the relation between the patimokkha and the vibhanga.
Certain long texts are included as division-length texts, not divided into suttas.

A simple case

So my first thought is this: let us eliminate “wrapping” from text markup.

Currently we have:

<section class="sutta" id="mn1"><article>

But whether something is a section or article is a property of a web page, not the text itself. The class of “sutta” is universal to all texts, so there is no need for it. The id can be inferred from the file name, or the segment numbers or whatever, there is no need for it.

So this markup need not be part of the text itself, it can be added as part of the pre-processing if needed.

Thus the normal situation would be that our JSON markup files, or indeed the rejigged legacy text files, would simply start with <header>. (However, see below for an alternative suggestion.)

But what about …

A distinct advantage of the existing system is that considerations about the texts can be reflected on by humans and baked into the files. A disadvantage is that humans are dumb and make bad decisions. Or at least, we make decisions that are valid within a certain set of concerns.

Consider a case such as DN 16. The longest sutta, we nonetheless present the whole thing as one HTML file. This is costly, and creates issues on mobiles, if using text controls for example. But we accept that cost for the sake of our own sanity. But it is also easy to see that different uses will want to handle it differently, for example by paginating or extracting.

Similarly, consider the Dhammapada and related texts. There are in fact at least four possible display options:

the whole text
by vagga
by verse
by story (or thematic verses, eg the first two verses of Pali dhp)

One can even imagine more baroque scenarios: mixing root text and translation, or mixing verses from different versions. There’s no right or wrong way, just more or less suitable in different media.

So basically my proposal is that such decisions should not be made at the level of text markup. Markup should only indicate things that are intrinsic to the texts, not how they are organized.

How then should we proceed? Let us see!

Basic principles

Structural info is in JSON.
Flexibly create HTML or other markup structure from JSON.
Implement well-defined standards to handle all cases found on SC.

1. Structural info is in JSON

Rather than inferring structure from ambiguous HTML, explicitly define structural elements in JSON. In fact we already have much of this in place.

Let us say we want to display mn1. We can look it up in sutta.json, where essential information is found. Note however that sutta.json does not identify this explicitly as a “text”; rather it is inferred from the file name.

A more complex case, if we look up an1.1-10 in sutta.json, it does not exist. Instead each sutta is listed independently, and nothing corresponding to the vagga is found. The data for an1.1-10 is contained in an.json, where it occurs only as the final element of the “path”. This seems unfortunate: should not such information be marked more unambiguously? Here, nikayas as labelled as “division”, in-between structures as “div”, and texts as “text”. Again, perhaps we would be better off simply having “node” and “text”. A text corresponds to a single sutta, whereas a node is a point in the hierarchy that includes a defined set of texts (and possibly other nodes).

Unambiguously define UIDs in JSON, indicating type of text or node, and locating them in the hierarchy.

Note that the text/node distinction is the same as that used for the main navigation on SC: a node is anything that appears in the sidebar, a text is anything that has a suttaplex.

A more serious issue is that the segment IDs in an1.1-10.json are based on the vagga range. Thus we start with an1.1-10:0.1 , the last segment of the first sutta is an1.1-10:1.2.3, the first segment of the second sutta is an1.1-10:1.2.4, and the second is an1.1-10:2.1.1. This is all predicated on the assumption that the vagga is primary. It conflicts with the uid as given in sutta.json, which assumes each sutta is an1.1, an1.2, etc. Surely this is a better approach.

Thus for the end of the first and start of the second sutta we have:

  "an1.1-10:1.2.2": "Itthirūpaṃ, bhikkhave, purisassa cittaṃ pariyādāya tiṭṭhatī”ti.",
  "an1.1-10:1.2.3": "Paṭhamaṃ.",
  "an1.1-10:1.2.4": "2",
  "an1.1-10:2.1.1": "“Nāhaṃ

And we should change to:

 "an1.1:2.2": "Itthirūpaṃ, bhikkhave, purisassa cittaṃ pariyādāya tiṭṭhatī”ti.",
  "an1.1:2.3": "Paṭhamaṃ.",
  "an1.2:0.1": "2",
  "an1.2:1.1": "“Nāhaṃ

The numbers before the colon in the segment number must always equal the text UID.

Now, if that is so, then the fact that a given set of text is included within the same file is purely arbitrary. We move away from one file = one sutta = one text UID, instead we simply have one text UID = one sutta.

A corollary of this is that hyphenated UIDs (ranges) can only be used for texts when the text cannot be further defined, usually when it is so abbreviated that the actual text content is missing. For example, the peyyala suttas of AN and SN such as AN 4.304-783.

We could in principle put the whole of AN into one JSON file and it would make no difference. We would simply query the string an1.1: and return all matching segments. They can then be wrapped in whatever HTML or whatever is appropriate.

Likewise we could query an1.1-10. Our JSON UID file would tell us this is a node, it contains ten children, the UIDs of the children are an1.1 … an1.10. We could then query the segments for the child texts, return the text, and wrap them in HTML.

Then the flexibility of returning text would be entirely a question of the app requirements, the data makes no such assumptions.

Present the Dhammapada per verse? No worries. Per vagga? Sure. Whole thing? Why not! All that needs to be done is define an appropriate HTML template for each scenario.

2. Flexibly create HTML or other markup structure from JSON

To me it makes sense to maintain a consistent distinction between node and text. Let us assume that that is also represented in the HTML, with the following convention:

text = <article>
node = <section>

An “article” in HTML represents a single coherent work, while a “section” is a part of a larger whole, which does not have a more precise name.

So in the simple case, we would wrap a sutta thus:

<article id="mn1">
<header>ABC</header>
…
</article>

For a vagga-sutta:

<section id="an1.1-10">
<header>ABC</header>
    <article id="an1.1">
    <header>ABC</header>
    …
    <article id="an1.10">
    <header>ABC</header>
    …
    </article>
</section>

If were present say the Dhammapada as a full text, we could nest sections:

<section id="dhp">
<header>Dhammapada</header>
    <section id="dhp_vagga1">
    <header>Yamaka Vagga</header>
        <article id="dhp1">

In any case, the article is always the most fundamental level, and is never nested.

Remembering that the text UID is simply that which comes before the colon in the segment id, we can construct the texts like this.

When requesting a text:

Pull all segments that match that UID.
Wrap in <article id="text-UID"> … </article>

When requesting a node:

Determine the text-UIDs of the texts and possibly nested nodes belonging to that node.
Pull all segments that match the text-UIDs.
Wrap the whole thing in <section id="node-UID"> … </section>
Wrap each text in <article id="text-UID"> … </article>

How do we determine where to insert <article>? The most obvious way would be whenever the text-UID increments. But this is problematic, since it means a fundamental piece of information can only be established by comparing it to something else. Better to hard-code it, but how?

Go back to one sutta = one file. Okay, but when extracted into a database this needs to be represented somehow anyway.
Put <article><article> in the markup file? Okay, but this is HTML-dependent, not explicit.
Define this data explicitly in JSON: {"uid" :"mn1", "start": "mn1:0.1", "end": "mn1:172-194.32",},

What about the case of long texts such as DN 16? This would need some more attention, but let us assume we can define parts of the text based on say the presence of <h2>, then wrap each part in a <section>. Thus we can nest sections in articles and articles in sections, but not articles in articles. Each section can be then presented as an individual web page, etc. if desired. One issue with this is that we have to see whether the segment IDs neatly map on the heading structure; it is not obvious that they do.

3. Implement well-defined standards to handle all cases found on SC

The normal way we handle edge cases is to build to the simple case (one sutta = one file = one UID) and then act surprised when other cases don’t work and scramble to fix them. Maybe there’s a better way!

Here are the various different cases.

Simple: A division contains several texts. one text = one UID = one file (currently). This is default.
Concatenated texts: several texts presented as one. AN 1, AN 2, Dhammapadas. See here for full list.
Undivided texts: spp, lzh-mi-vs, lzh-mi-bu-kv, lzh-dg-vs, lzh-dg-ve, lzh-dg-bu-kv, lzh-dg-bu-kv-2, lzh-dg-bi-kv, lzh-sarv-mt, lzh-sarv-bu-kv, /lzh-sarv-bu-kv-2, san-sarv-sm-div, san-sarv-bu-kv, san-sarv-bi-kv, lzh-mu-khv, lzh-mu-bu-kv, san-mu-bu-kv, xct-mu-khv, xct-mu-utg, xct-mu-bu-kv, sammitiya-sastra, pariprccha, om
Undivided plus: text is represented in suttaplexes as both a whole division and as parts of that division. This applies to the patimokkhas and san-mu-mpt-bu-pm-div. However it is poorly implemented, as the text applies to the division only, while parallels are in the parts only. Also, it does not work for partial patimokkhas: san-bu-pm-qizil, pgd-pm-bf13

We should specify a way of handling the variations. Not only will this enable us to better control what we are doing now, it will let us envisage and generalize different presentations for different use cases, for example:

Any long text can be subdivided based on the presence of <h2> tags.
Any texts may be concatenated at any node level.
The user may control how the text is presented.
Make explicit APIs for clients to consume as they wish.

Aminah · June 18, 2019, 12:18pm

Thanks for all this.

It’s great to go over all this. I’m just wondering if this has been written with reference to the draft reforms from earlier in the year:

The idea had been to amend the legacy texts inline with template given in the wiki by script (having made sure any necessary adjustments to dependant code had been addressed). Does the above now supersede anything previously determined?

Though, I haven’t really dealt with them, I think this is all good with respect to the JSON’s as an info file goes along with them. In terms of the legacy texts though, additional rejigging would be needed to include the author which is currently given in the <head> (likewise the language attribute is currently given in the opening <div>, which was due to be moved to the <article> now being eye-ed up for the chop?).

Hooraaah!

Are there any implications of this for the legacy range files?

While I understand the will to make neat, consistent distinctions between text types corresponding to nodes and such, from an HTML POV, taking the example of an1.1-10, vagga-suttas would more naturally fit the <article> element: they form a coherent whole that could, in principle, be syndicated. Further, it is perfectly semantically correct to nest <article>s.

<section> elements are for something “which doesn’t have a more specific semantic element to represent it”. I haven’t read them extensively enough to know how far this holds, but from a cursory look vagga-suttas seem to be meaningfully grouped texts that revolve around a common refrain, or explore a single detail from several aspects which combined give a complete examination of that detail and stands as equivalent to other suttas that systematically analyse a given point (ie. an <article>).

Are there any other text types in the non-Pali texts?

sujato · June 19, 2019, 7:41am

No, just what I have in front of me. So as per the other thread also, I have forgotten more than I know!

Okay, fair enough, it seems reasonable to include these details.

Probably. Best to consider how to treat the JSON first, then we can work out how it applies to the HTML files.

But they don’t, not all of them. In fact there are several cases where the natural grouping of texts crosses over the vagga boundaries (see my guide to AN for examples). In other cases, as you say, they correspond closely to a coherent whole. In such cases they may in fact be a single sutta split into five or ten, and indeed some of our parallels are defined as such. But that can’t be assumed.

Maybe! We’ll have to check; the main case is the so-called “division-length text” which may, however, be handled the same as the Dhammapada, so is not a separate type.

Aminah · June 19, 2019, 7:51am

Yeah, absolutely. I was already uncertain as to how far the point held; I don’t know the canon well enough. I was just speaking to at least those cases where I’ve seen it which is at least enough to know that whatever rule is applied, the truth is messier.

karl_lew · June 19, 2019, 2:31pm

This is so above my paygrade that I must necessarily move tangentially in a most peculiar direction.

For Voice, if it looks like a duck, it is a duck. And by duck, I mean spoken segment. Here is a duck:

1 {
2 “api”: “aws-polly”,
3 “apiVersion”: “v4”,
4 “audioFormat”: “mp3”,
5 “voice”: “Amy”,
6 “prosody”: {
7 “rate”: “-30%”,
8 “pitch”: “-0%”
9 },
10 “language”: “en-GB”,
11 “text”: “<prosody rate="-30%" pitch="-0%">who fears nothing from any quarter,”,
12 “guid”: “02a4eebe5fc6bc54cf27e86c2939d6e2”,
13 “volume”: “kn_en_sujato_amy”
14 }

The duck’s name is 02a4eebe5fc6bc54cf27e86c2939d6e2. Each duck names itself. Ducks are shared–there is no sutta reference. There is no segment reference. Voice searches for ducks.

Everything presented by Voice is a line of ducks with the occasional pause between duck groups. We are working on MichaelH’s family of Bhante Sujato ducks for v1.5. Other than aggregates of ducks and the element of space, we have nothing to offer.