Fixing HTML on legacy texts

Aminah · August 14, 2019, 9:54pm

I guess it’s about time to update with the latest on this front. Tinkerings to-date have been pushed to the html-clean5 branch for checking and and can wait there until Hongda is ready to update the site code as necessary.

A for the first person to report a batch-edit-boo-boo!

What’s been done, in brief:

convert em to i
discover & correct of sutta ID errors
add full sutta ID numbers in titles where only partial ID was given (eg. “53. Title” > “8.53. Title”)
moved sutta ID numbers from sub/division p to sutta title h1.
periods added to/removed from sutta titles as suitable
in SN & AN vagga-suttas (where relevant) convert:
- h2 to h3
- h1 to h2
Add <article> to individual vagga-sutta suttas
HTML-spec-aligned markup conversion

You can also check out my enthralling commit history from mid April onwards for more overview (although I have to admit I wasn’t always systematic about what was committed under what message, so all sorts of bits and bobs might also be included in commits as I meandered along)

In detail:

`<em>` & `<i>`

The conversions made here mostly concern pli/san words, but I’ve since found some instances where a second look might be needed. A closer peek is also needed before adding lang attributes and can’t (accurately) be done in a blanket fashion.

ID number corrections

Initially this just concerned some odd discoveries here and there, but when trying to make the agreed heading amendments I found the VI title numbers were out of whack so used the original source site (SN, AN) to bring them in sync with SC numbering. As far as my wits and Google Translate can be relied on, I think the numbers are now right, but it might be good if some Vietnamese friends could double check (I wonder, @phineas-pta, would you be able and so kind as to do a few spot checks later on?).

Also, I believe the .fakechapter has been made redundant in the process. Following the original source, as best as I can tell, these did not “[Indicate] what appears to be erroneous or misplaced titles in Vietnamese texts”, but instead matched the vaggas pretty well. There where a few clarification headings the translator seems to have added and they remain as .subheading.

Add full sutta ID numbers in (some) titles

This got more ‘involved’ then I’d have hoped. I’ve covered as many “broad-sweep” pattern variations as I can think of, but as I went about it found that in many cases language folder (and sub-folders) had to be targeted individually and this is definitely not a complete process.

From where I sit, the whole <div class="hgroup"> (now <header>) across the different language translations just require individual attention to treat the many slight variations correctly, so to my mind it’s better to leave this as a bumbling mission for later on.

Convert vagga-sutta headings as relevant (`h1` > `h2` > `h3` > etc.)

This was limited to just Pali AN, and SN as a broader survey suggested outside of this it probably wasn’t applicable and definitely was too messy. Even with this limitation the amendments had to be done by individual language folder as there was just too much inconsistency to be able to target accurately otherwise.

The purpose of this was to produce a single “range-title” h1 per file with individual suttas titles becoming h2 as previously agreed.

Add `<article>` to individual vagga-sutta suttas

So, where a vagga-sutta contains several suttas under h2 titles, these have each been wrapped in <article> tags.

It should be noted, that:

there are instances where translators’ handling of a vagga-sutta really doesn’t lend itself to the “every sutta gets an <article>” scheme (eg. DE an2.180-229, sn17.13-20.html, or FR an2.98-117.html).
there are some “vagga-suttas” that are more just “multiple sutta” files (they contain some of the suttas within the vagga-sutta, but not all) (eg. NL an2.11-20, PT sn53.45-54, or FR an2.1-10).
even though it isn’t exactly in accordance with the definition (on account of not falling into a .hgroup) in FR an1.394-574.html and DE an2.130-140.html I’ve use .subheading for clarifying headings the translator has inserted to group smaller clusters of suttas within the vagga-sutta. Personally, I think it wouldn’t be a bad idea to extend the definition (in zz3) as there is a use case (albeit minor, there are maybe a handful, or so of places where it might be used in this way), but if not it’s only been applied in these two instances so can easily be removed.

HTML-spec-aligned markup conversion

The changes follow the template and guiding principles wiki drafted on the basis of what was discussed a little while back. The following tweak mentioned in the “Markup redux” thread has also been incorporated:

In single sutta files:

~~<div id="text" lang="en"> <section class="sutta" id="kv1.1"> <article>~~

has become:

<article id="kv1.1" lang="en">

~~<div class="hgroup"> … </div>~~

has become:

<header>
…
</header>

~~</article> <aside id="metaarea"> … </aside> </section> </div>~~

has become:

<footer class="metaarea">
…
</footer>
</article>

In vagga-sutta files with individual suttas combined into a single sutta (eg. ID an10.156-166):

As above.

In vagga-sutta files with multiple suttas in articles:

~~<div id="text" lang="it"> <section class="sutta" id="an1.31-40"> <div class="hgroup"> … </div>~~

has become:

<section id="an1.31-40" lang="it">
<header>
…
</header>

~~<aside id="metaarea"> … </aside> </section> </div>~~

has become:

<footer class="metaarea">
…
</footer>
</section>

As mentioned, I can’t see a way to batch handled <header> supertitles, so I think they’ll just have to be plodded through later on.

Other notes about “html-clean5”:

Apart from the HTML-spec-aligned markup conversion, most of the amendments made in this round do not apply to the SI files as they’re a bit of a mess and newer files are pending (I am about to follow up with the chap who’s preparing them for his own site).
I yet have a healthy list of other bibbly-bobbly-cleany things that can be done, but y’know; rainy day 'n’all.
One not “bibbly-bobbly” bit I’m keen to look at, that nevertheless hasn’t been approached yet is adding structured data (perhaps with schema.org’s vocabulary). Of course, this probably overlaps with the RDF ticket, which I’m also quite interested in.

As an aside, while having (almost) endless fun with the Vietnamese texts, I found their publication dates on the source site so put them in. It turns out the AN was published over two years so I put the span in a span (<span class="publication-date">1980–1981</span>). I was then curious how that would be handled using Schema’s datePublished property. I discovered all dates follow the ISO_8601 format, and time intervals just use a /.

ISO_8601 also has a nice solution for approximations which would be relevant in for the VI SN. The ‘early 1980s’ is given as the date of publication and there is a pretty ~ to denote the approximate date (although I should mention, I’m going by a review draft of the standard; maybe it’s just down to my being economically limited, but I just find there’s something perverse about having to pay to know how to write the date correctly! It’s only a step or two on from charging for the alphabet.)

General notes about the HTML template

What with the rolling of on time and plenty of mulling-over opportunity, it might be good to revisit what was set out in the template and guiding principles wiki. It was put together as a draft so it might be nice to make a “finalized” reference point. Certainly one thing that would be good to check over, tweak and confirm is the the supertitle section.

Fun fact: you probably already know, but WHATWG has now become the sole keeper of the HTML standard. The agreement between W3C and WHATWG was signed at the end of May. The passing relevance here is that zz3 page notes that the <hgroup> is a deprecated element and this is no longer true as the element does exists in the WHATWG spec. Thing is, the WHATWG spec’s a bit rubbish on this, and just as our trusty friend Bruce Lawson says, the reason why <hgroup> wasn’t in the W3C spec is because “it’s completely useless”. This of course, is all happily sidestepped with the implementation of <header> for titles and supertitles.

Aaaaanyway,

Regarding the principle of using Pali terms for structural attribute values, as you noted this “gets complicated when considering non-Pali texts”. While on the one hand stating “we should use the same basic principle of following the naming as used in the sidebar”, it’s also mentioned that “all classes in supertitles should be in Pali.” I don’t see how both of these guiding principles can be followed in the case of non-Pali texts.

Is the basic idea to map non-Pali structure onto the Pali model? In which case, would it worth explicitly setting out Pali equivalences for the non-Pali texts something like the following (just a quick, rough presentation to sketch the idea):

 <header>
   <p class="nikaya handle">[ DA | MA | SA | EA | Minor Chinese ]</p>
   <p class="nipata handle">[ - | - | - | Jātaka; Sūtrasannipāta; Dhāraṇī; Sūtravyākaraṇa ]</p>
   <p class="pannasaka handle">[ - | - | - | - | - ]</p>
   <p class="vagga handle">[ - | - | SA 1–100 | EA 4 | - ]</p>
   <h1 class="sutta handle">4.4. Sutra title</h1>
 </header>

or…

<header>
  <p class="agama handle">Saṃyuktāgama 1–100</p>
  <h1 class="sutra handle">1. Discourse on Impermanence</h1>
</header>

<header>
  <p class="minor-chinese handle">Jātaka</p>
  <h1 class="sutra handle">152.87 摩調王</h1>
</header>

or…???

I still have a niggling reservation over .handle for supertitles. It works okay, but particularly in view of your point re disambiguating what do you think about .text-title, or just the very straightforward .supertitle [Added: okay, strictly speaking .supertitle wouldn’t fit as the class would also be applied to the h1 title, which thus by definition is not “super”. An other alternative might be .headertitle] ?

Of course, brevity over being long-winded is better; but at the same time, I lean towards clarity over brevity. Both options suitably circumvent any muddle with the title attribute, and are also one character shorter than appellation which you put forward as a viable option.

Also,

In the Markup Redux thread, you were looking for code to cull from the text files and where interested in just beginning at <header>. I pointed out this wouldn’t necessarily work for the legacy files as the author name is given in the <head> area. However, <meta charset="UTF-8"> can certainly get the chop.

And just to remind,

might be good to remember these related ticket when the work on updating the site code begins:

social sharing issue

adding <main>
It was previously mentioned that it would probably be better to make the URLs of pannasakas and such more clearly separate, i.e. an3-dutiya-pannasaka.
It was previously mentioned that we shall have to review the CSS to ensure everything in the new markup is clearly distinguished.

Fixing HTML on legacy texts

What’s been done, in brief:

In detail:

<em> & <i>

ID number corrections

Add full sutta ID numbers in (some) titles

Convert vagga-sutta headings as relevant (h1 > h2 > h3 > etc.)

Add <article> to individual vagga-sutta suttas

HTML-spec-aligned markup conversion

In single sutta files:

In vagga-sutta files with individual suttas combined into a single sutta (eg. ID an10.156-166):

In vagga-sutta files with multiple suttas in articles:

Other notes about “html-clean5”:

General notes about the HTML template

Also,

`<em>` & `<i>`

Convert vagga-sutta headings as relevant (`h1` > `h2` > `h3` > etc.)

Add `<article>` to individual vagga-sutta suttas