A more meaningful markup for segmented texts

sujato · September 9, 2019, 12:56am

Background

With the launch of the updated site a couple of years ago we introduced the concept of “segmented texts”, i.e. texts that are matched root to translation segment by segment. A segment is a meaningful chunk of text, such as a sentence or a line of verse.

When building the site, I proposed a range of views and functions to take advantage of this, and our diligent dev team was able to mostly implement them. These include:

Swapping between translation-only, line by line, side by side, and translation as tooltip.
Showing/hiding reference and other details per context.

All this is in addition to other functionality such as script-changing and dictionary lookup. All told, it means the text pages are quite complex, running a lot of javascript. That’s okay, but if we can reduce the complexity it would be great.

How it works

You can see the basic idea in the browser console tool.

Open it up with just translated text, and you have custom elements for each segment, inside the normal HTML markup for paragraphs and the like. Let us look at SN 1.1 as an example.

<sc-seg id="sn1.1:1.4" class="translated-text" lang="en">
<a class="textual-info-paragraph segmented-textual-info-paragraph textual-info-paragraph-inline sc" id="1.4_inline" href="#1.4" title="SuttaCentral segment number">1.4</a>
“Good sir, how did you cross the flood?”  
</sc-seg>

This contains three things:

The segment markup in <sc-seg>, with class translated-text.
Reference data in <a> . And a lot of CSS classes! This is to deal with the different views.
Text content.

Now, open up the tools and select “line by line”. This inserts a second set of elements, with classoriginal-text`. (Each word is wrapped in a span, this is so the lookup can work.)

Note that while it would be simpler to just load both sets of text at once, we do not do this. The reason is that most users will not use the Pali text, so it is inefficient to supply all users with double the amount of text (and HTML). Pali text is actually added or removed from DOM at user’s request, not merely hidden from view. This will remain the same.

When switching between side by side, line by line, and popup views, other changes are made to the styling and DOM.

What’s wrong with how it is now?

The current implementation works, but there are a number of issues.

There is no meaningful representation of the “segment”. The two text segments simply appear one after the other, with the same ID (!). But there is no relation between them per se.
References are not elegantly handled. Some references appear on the translation segment, others outside the segment on the paragraph markup. This is all simply due to display convenience; it’s not wrong, but it doesn’t really represent the situation: references should apply to the segment.
There are various bugs and infelicities; line breaks in verses are inelegant, sometimes references are not aligned, etc. Tooltip text is broken in Firefox.
Putting references outside the flow relies on an old-fashioned CSS hack (float:right; width:xyz;margin-right: -xyz) There are better ways.

New features

For upcoming revision, I would like to add:

Notes support.
Reduce reliance on tooltips.
More flexible display options, for example, taking advantage of wide screens.

A new approach: build from the data on up

Now that our data is sourced from bilara, we have a much clearer picture of the actual data included in the text pages. Let us build a direct pipeline so that data from bilara ends up in the page with minimum fuss.

(Bilara is flexible and might be extended in future, but let us just deal with what we have today.)

Bilara-data currently has the following data types. (Leave aside markup, which is outside the segments.)

reference
root
translation
comment
variant

These types are structured: root, translation, and reference apply directly to the segment as a whole. Variant and comment are children of root and translation respectively. That is to say, comments apply to a particular translation, not the segment as a whole. Likewise, variants apply to a particular edition.

reference
translation
- comment
root
- variant

For HTML, it is better to wrap each thing in its own span for ease of styling, so we can consider the “text” and the comment or variant as siblings that are children of the (abstract idea of) translation or root. Then we can represent this hierarchical structure literally in the HTML, simply assigning classes to spans.

<span class="segment" id="sn1.1">
    <span class="reference">
        <a> …
    </span>
    <span class="translation"> 
        <span class="text"> …
        <span class="comment"> …
    </span>
    <span class="root">
        <span class="text"> …
        <span class="variant"> …
    </span>
</span>

This gives us a very clean, one-to-one relationship between the data in bilara and the HTML produced for the page. It is immediately obvious exactly what each element is and how they relate. It removes the necessity for multiple elements with the same ID, and ensures references stay where they belong, on the segment.

Extra data

On bilara, we have some extra information on each segment, namely language and author (or edition in the case of root texts; both are identical from SC perspective). It would be good to retain this information in the HTML, especially as we move towards building more complex and customizable layouts. Uses:

Pali is not by author “sujato”!
Comments by different authors: a reader in Italian may want to see comments in English.
Multiple translations: a bilingual user may want to read Sinhala and English.
Multiple editions: when available, reader may want to compare.

To make this granularity possible, assign such data to the last level of the hierarchy, even though this makes the markup somewhat more verbose. An example:

<span class="translation">
	<span class="text" lang="en" data-author="sujato">And when I swam, I was swept away.</span>
</span> 
<span class="root">
	<span class="text" lang="pli-Latn" data-edition="ms">yadāsvāhaṃ, āvuso, āyūhāmi tadāssu nibbuyhāmi.</span>
	<span class="variant"  lang="pli-Latn" data-edition="ms" title="variant reading" data-tooltip="🔀  nibbuyhāmi → nivuyhāmi (s1-3, km, mr)">nibbuyhāmi → nivuyhāmi (s1-3, km, mr)</span>
</span>

Next steps

The question then becomes, can we use this markup to achieve the various display scenarios we want? I believe we can, and this will be the next post.

Question: Do we need a custom element for this? It seems to be the kind of thing that traditional HTML handles just fine.

@blake @aminah @hongda