Fixing HTML on legacy texts

Yeah, that one’s above my head, and I also spot that there’s now a html_text (copy) folder on master with Tidy-ifed texts. When I went to bed yesterday, there was a simple conflict with a few amendments to the MVK that have been made to master. I’ve no idea why it’s suddenly got more complicated. :going_back_to_bed_emoji:

Okay, I merged master into html-clean and handled the conflicts and the above pull request how has the green light to be merged.

@sujato as it is now 1am AEDT I trust it is safe to handle the ones you mentioned above before the beginning of the moratorium. I will be sure to down tools in an hour or two.

Just to update:

Removed:

<u> & </u>
data-name
data-type
target="_blank" (again :D, crept back in with the merge from master)
<li style="list-style: none">
<meta author="…"> (again, again :smiley: )

Amended

<span style="visibility: hidden"> changed to class="hidden" as per https://suttacentral.net/zz3/zz/test, but reading the description I’m not completely confident it fits.

Left unchanged

hidden - I understand that of course it should be gotten rid off, but with regards to the note above of the pt sn3.13 text I’m still not clear how this should be treated and with respect to the span hiddens they seem to contain notes for which I want to confirm the hidden class is suitable (with the en Vinaya texts I’m guessing it’s more or less a moot point)

hidden

en/pli/vinaya/pli-tv-bu-vb/pli-tv-bu-vb-ss/pli-tv-bu-vb-ss8.html:36
en/pli/vinaya/pli-tv-bu-vb/pli-tv-bu-vb-ss/pli-tv-bu-vb-ss8.html:124
en/pli/vinaya/pli-tv-bu-vb/pli-tv-bu-vb-ss/pli-tv-bu-vb-ss8.html:134
en/pli/vinaya/pli-tv-bu-vb/pli-tv-bu-vb-ss/pli-tv-bu-vb-ss7.html:26
en/pli/vinaya/pli-tv-bu-vb/pli-tv-bu-vb-ss/pli-tv-bu-vb-ss7.html:71
en/pli/vinaya/pli-tv-bu-vb/pli-tv-bu-vb-ss/pli-tv-bu-vb-ss6.html:19
en/pli/vinaya/pli-tv-bu-vb/pli-tv-bu-vb-ss/pli-tv-bu-vb-ss6.html:79
en/pli/vinaya/pli-tv-bu-vb/pli-tv-bu-vb-ss/pli-tv-bu-vb-ss9.html:19
en/pli/vinaya/pli-tv-bu-vb/pli-tv-bu-vb-ss/pli-tv-bu-vb-ss13.html:19
en/pli/vinaya/pli-tv-bu-vb/pli-tv-bu-vb-ss/pli-tv-bu-vb-ss4.html:36
en/pli/vinaya/pli-tv-bu-vb/pli-tv-bu-vb-ss/pli-tv-bu-vb-ss4.html:39
en/pli/vinaya/pli-tv-bu-vb/pli-tv-bu-vb-ss/pli-tv-bu-vb-ss4.html:58
en/pli/vinaya/pli-tv-bu-vb/pli-tv-bu-vb-ss/pli-tv-bu-vb-ss10.html:45
en/pli/vinaya/pli-tv-bu-vb/pli-tv-bu-vb-ss/pli-tv-bu-vb-ss10.html:47
en/pli/vinaya/pli-tv-bu-vb/pli-tv-bu-vb-ss/pli-tv-bu-vb-ss2.html:35
en/pli/vinaya/pli-tv-bu-vb/pli-tv-bu-vb-ss/pli-tv-bu-vb-ss2.html:128
en/pli/vinaya/pli-tv-bu-vb/pli-tv-bu-vb-ss/pli-tv-bu-vb-ss3.html:29
en/pli/vinaya/pli-tv-bu-vb/pli-tv-bu-vb-ss/pli-tv-bu-vb-ss1.html:44
en/pli/vinaya/pli-tv-bu-vb/pli-tv-bu-vb-ss/pli-tv-bu-vb-ss1.html:44
en/pli/vinaya/pli-tv-bu-vb/pli-tv-bu-vb-ss/pli-tv-bu-vb-ss1.html:253
en/pli/sutta/mn/horner/mn86.html:18
en/pli/sutta/mn/horner/mn86.html:39
en/pli/sutta/mn/horner/mn86.html:47
en/pli/sutta/mn/horner/mn87.html:44
en/pli/sutta/mn/horner/mn89.html:22
en/pli/sutta/mn/horner/mn89.html:31
en/pli/sutta/mn/horner/mn89.html:32
en/pli/sutta/mn/horner/mn82.html:73
en/pli/sutta/mn/horner/mn85.html:49
en/pli/sutta/mn/horner/mn85.html:73
en/pli/sutta/mn/horner/mn85.html:163
en/pli/sutta/mn/horner/mn70.html:36
en/pli/sutta/mn/horner/mn70.html:53
en/pli/sutta/mn/horner/mn66.html:41
en/pli/sutta/mn/horner/mn88.html:23
en/pli/sutta/mn/horner/mn88.html:24
en/pli/sutta/mn/horner/mn88.html:24
en/pli/sutta/mn/horner/mn79.html:46
en/pli/sutta/mn/horner/mn79.html:47
en/pli/sutta/mn/horner/mn79.html:50
en/pli/sutta/mn/horner/mn79.html:51
en/pli/sutta/mn/horner/mn79.html:59
en/pli/sutta/mn/horner/mn79.html:72
en/pli/sutta/mn/horner/mn75.html:88
en/pli/sutta/mn/horner/mn90.html:26
en/pli/sutta/mn/horner/mn90.html:36
en/pli/sutta/mn/horner/mn73.html:41
en/pli/sutta/mn/horner/mn71.html:22
en/pli/sutta/mn/horner/mn71.html:23
en/pli/sutta/mn/horner/mn71.html:24
pt/pli/sutta/sn/sn3/sn3.13.html:33
pt/pli/sutta/sn/sn3/sn3.13.html:34
pt/pli/sutta/sn/sn3/sn3.13.html:35
pt/pli/sutta/sn/sn3/sn3.13.html:36

Also,

before pushing again the the html-clean branch I ran HTML Tidy over all the files as per your given options:

find . -name '*.html' -type f -print -exec tidy --doctype html5 --output-html 1 --tidy-mark 0 --quiet 1 --output-encoding utf8 -w 0 --show-warnings 0 -m '{}' \;

(actually I ran it at the start as well, but also as my last action).

I’ll now not touch html_text until word from you.

Random bad things:

 <a class="sc" id="sc&nbsp;1">

Eek!

 </p>\n</aside>

Eew!

In html_text/sr/pli/sutta/mn/mn55

I have full confidence there’ll be a bunch more “Eek!” and “Eew” to be found (mostly misfired regexes would be my guess), but the first target of this process has been HTML elements with stripped attributes.

Still would be good to keep a list of other things to look out for once that’s done. I’ll willing to bet a bag of kumquats that some &gt; and &lt;s will be lying around.

1 Like

Another one: sometimes (eg /home/user/html_text-date/html_text/it/pli/sutta/sn/sn9) the metaarea is at the start of the file, not the end. It’s not necessarily wrong per se, but it is inconsistent.

1 Like

My inner compulsive wants to make attribute orders uniform (eg. narrow down <a href="…" class="…" title="…"> and <a href="…" title="…" class="…"> to one). Aside from with respect to my immediate task where it would have minimal benefit, it has not benefit whatsoever.

Also <span> in metaarea. As far as I know, span should always have a class, else what’s the point?

1 Like

Righto, ready for another round? Here is my plan for a next batch of edits. If you reckon the plan, or part of it seems reasonable, I’ll go ahead.

Firstly just taking up the most recently above mentioned points.

Regex misfire check

  • &nbsp;
  • &gt;
  • &lt;
  • \n
  • \s

For consistency & ease of reading:

  • move all metaareas to file end
  • element list duplication/attribute order as per @mdo’s code guide:
    • <a class="…" data-uid="…" id="…">
    • <a class="…" id="…" data-uid="…">
    • <a href="…" title="…" class="…">
    • <a href="…" class="…" title="…">
    • <p lang="…" class="…">
    • <p class="…" lang="…">
    • <section id="…" class="…">
    • <section class="…" id="…">
    • <span class="…" title="…" id="…">
    • <span class="…" id="…" title="…">

Comments

The below constitute items that are not elements but nevertheless may carry some information that may as well be preserved using the comment tag:

  • <again 2.65 is missing in the English text. The following is included in 2.64 there.>
  • <chapter 2>
  • <chapter 3>
  • <end of chapter 2>
  • <end of first chapter>
  • <The English text is missing chapter 2.28. All the following is included in 2.27 there>

Bits & bobs:

<a href="…">
<a href="…" target="…">
This one had already been done, but managed to get reintroduced somewhere along the way.

<article style="…">
<article dir="rtl">
In all instances the style attribute is used to set a rtl direction in Hebrew translations.

<aside>
<aside id="metaarea">

<div dir="…">
Shift dir attribute into suitable semantic element (eg <h1 dir="rtl">)
This only concerns 5 cases in 2 files.

<div id="…">
<main id="…" lang="…">

  • A couple hundred files are missing the lang attribute. See below for explanation of main. BUT,

  • There are also approx one hundred files with a random “toc” div that should be removed eg.:

    <body>
    <div id="text" lang="de">
    <div id="toc"></div>
    <section class="sutta" id="pli-tv-bu-pm">
    <article>
    

    and even more oddly in a one Norwegian file :

    <p>Slik talte Mesteren, og glade til sinns tok munkene imot Mesterens ord.</p>
    div id="DsDKxRy2ord4kgmX"></div>
    <div id="RVJr4ZVZ2plxlffJl6DeJ"></div>
    <div id="RIbPC0YJDXzJvkVzI4Nr"></div>
    </article>
    

<em id="Cunda">
This only occurs once (sr/pli/sutta/dn/dn16.html:336)

<en>
<em>
This also only occurs once.

<span>
As per above a span without an attribute is pointless (although, I will do a few spot checks with Google Translate of the relevant 221 files, to see if perhaps an add class was meant, or some such thing. All but 11 of these files concern Slovenian translations.

Basic structure:

As well you know, currently the basic legacy text structure is as follows:

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<meta name="author" content="Bhikkhu Bodhi">
<title></title>
</head>
<body>
<div id="text" lang="en">
<section class="sutta" id="an4.4">
<article>
<div class="hgroup">
<p class="division">Aṅguttara Nikāya</p>
<p>The Book of the Fours</p>
<h1>4. Maimed (2)</h1>
</div>
<p>
…
</p>
</article>
<aside id="metaarea">
<p>…</p>
</aside>
</section>
</div>
</body>
</html>

Based on my understanding of info primarily found in MDN docs and on w3schools pages, I think it makes sense to employ HTML5’s semantic elements. Setting aside the satisfactions of having a clearer logical flow that conforms a little more closely to the recommended usage of some of the new elements HTML5 introduced, one really significant potential benefit would be to make these pages a smidgen more screen reader friendly, should the rest of the site be reviewed further down the line to make it accessible to those using such devices.

My suggested amendments would be as follows:

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<meta name="author" content="Bhikkhu Bodhi">
<title></title>
</head>
<body>
<main id="text" lang="en">
<article class="sutta" id="an4.4">
<header class="hgroup">
<p class="division">Aṅguttara Nikāya</p>
<p>The Book of the Fours</p>
<h1>4. Maimed (2)</h1>
</header>
<p>
…
</p>
<footer id="metaarea">
<p>…</p>
</footer>
</article>
</main>
</body>
</html>

details:

  • <div id="text" lang="en"> to <main id="text" lang="en">:
    div is meaningless, main is meaningful. More significantly, main would clearly indicate where the main page content begins which in turn could potentially be used to support those using screen readers getting straight to the content they want (see eg. here).

  • <section class="sutta" id="an4.4"> removed and contained attributes shifted to article:
    I can’t see any especially good reason to have the section element in our particular texts, so why not keep things as trim as possible? There is one slight variation on this concerning files containing multiple suttas, where a <section id="vagga"> has been used in addition to the primary section. Here, I’d suggest following a similar patten of shifting the section attributes into articles with each sutta article nested in a vagga article .

  • <div class="hgroup"> to <heading class="hgroup">:
    div is meaningless, heading is meaningful.

  • </article> moved after copyright and other translation info:
    the rule of thumb I’ve been pointed to several times is that an article is a complete/discrete item that in theory could be independently distributable or reusable elsewhere. The </article> makes sense where it currently is, but I think the suggested new location also makes sense, and is (1) consistent with descriptions of how the article should be used, and (2) allows the section to be removed.

  • <aside id="metaarea"> to <footer id="metaarea">:
    Use of aside here doesn’t seem to conform very well with how the element was intended to be used, where as footer does.

    As a side suggestion, in the same way that we have introduced the publication-date class into the aside/footer (although not with such a clear and present use on the cards as the publication date field), we may also wish to add the address element here too, in instances where the translator has given contact info. Naturally, It’s not all that important, and it’s definitely not something I’d do in this round of edits.

Here endeth my plan. Wha’dya think; reasonable?


Questions:

<a class="…" id="…" pi="…">

files concerned

vi/pli/sutta/sn/sn18/sn18.22.html
vi/pli/sutta/sn/sn18/sn18.21.html
vi/pli/sutta/sn/sn1/sn1.37.html
vi/pli/sutta/sn/sn2/sn2.8.html

<b>
There are quite 453 of these, I’m guessing they’ve been used semantically?

<i>
From what I can tell from a quick scroll through the 12,591 instances, the <i> consistently used meaningfully. However, while it usually surrounds a Pali/Sanskrit/Chinese (I think :confused: ) word, sometimes, it surrounds (what I’m guessing is) a definition in the local language. Also, it’s not exactly a big deal, but there’s a slight inconsistency with sometimes using the <i> with attributes, and sometimes not.

<span a="… + several other spans listed in the OP containing a super wild attribute list.
My first thought is to just strip them of all non-gloabl attribute ‘attributes’, but I’m a bit cautious that in and amongst it there is some information that maybe should be retained.

Of course. Just to clarify, we have always used HTML5 semantics. It’s just that the definitions and usage have evolved somewhat: we designed the page at a time when HTML5 was rolling out.

Indeed. In the legacy site, <main> was used as part of the site page structure, hence why it is not in the text file. But so far as I can see the current site doesn’t use <main>.

Again, yes. (Originally <section> was defined as the container of an <h1> element, hence our usage here, but it has since changed.)

In such cases, I think it would make more sense to think of a vagga as a <section>. It is a " thematic grouping of content", and it is not a “a self-contained composition in a document, page, application, or site” as a <article>. Also it just seems clearer semantically to avoid nesting articles.

Yes. (It is <header> tho!)

Indeed.

Nah, sounds great. Of course we’ll have to see how it might impact the rest of the site JS and CSS. I’d suggest ensuring that everything else is fixed first. Then we can write a simple script to make those transformations, roll it onto staging, and figure out how they impact the JS/CSS.

Of that I can be absolutely certain, (1) because I’m still smarting for being reprimanded for using bold rather than a heading (2) because it’s really evident in the texts. :grinning: All I meant by it was to explain why I wanted to suggest changing the divs.

Yes, I can very well see the case, and in fact, was originally close to suggesting as much. I dithered a little over the point and my eventual test was “would an article here pass the syndication test?”, I believe it would (and that this is essentially how it is being treated on SC), and went with it to keep inline with the idea that (by the proposed changes) an article would contain meta info, but I agree, avoiding nesting would be preferable.

Mmmrrumblemumble. Yup, it most surely is! Slip of the mind-fingers, and y’know, I’m gonna take it as a win that there aren’t more! :grin:

Oooh, the perks of hanging out nearby the clever kids!

Okay, so for now are you happy for me to go ahead with that listed under:

  • Regex misfire check
  • For consistency & ease of reading
  • Comments
  • Bits & bobs (except with the last one <div id="…">, I’ll convert it to <div id="…" lang="…"> for now.)

If so I will check in with Ayya @Vimala to see if they are working with any of the legacy files at the moment before doing anything.

Also, as an aside, I have some new Norwegian files to code, should I stick with the current legacy structure for now?

No, I’m not. Just the parallels.json file. I have about 5000 possible parallels between pali texts to go over and it will take some time!

1 Like

Of course you should not have said that!!! Speak of the devil and I find an error.
ja541 is not displayed on the site. I cannot readily find out what is wrong with it. It is also the same on Staging. Maybe the problem is that there is a note in the heading:
<span class="var" id="note1216" title="nemirājajātakaṃ (s2, s3)">
So maybe you can check that out or make a ticket for it.

Thanks. Done: ja541 is not displayed on the site · Issue #1242 · suttacentral/suttacentral · GitHub

1 Like

Crumbs! Forgot to include one other amendment I think we agreed on ages ago: to convert all endsubsection classes into endsection (thereby deleting endsubsection). I would include this under the first half of the plan if it suits.

1 Like

Something I just remembered:

Okay. At first glance that seems clear enough. With the <i> and <cite> that seems straightforward, but the issue heading mentions “titles”. I haven’t looked yet, but to memory, at least in some cases Pali titles have just been put into <p>s with no other attributes that would facilitate targeting. How should this be handled?


Coincidentally, today, I got lost on quite a fascinating sidetrack and discovered that many elements don’t need to be closed.

https://google.github.io/styleguide/htmlcssguide.html#Optional_Tags

https://html.spec.whatwg.org/multipage/syntax.html#optional-tags

(the w3 spec does note that this is only possible when “the document is conforming, in particular, that there are no content model violations”)

I assume you’re talking about the supertitles here?

They can be targeted with a regex, something like <div class="hgroup">\n<p>

I just did a spot check such titles appear to be pretty consistently given like that. However, there is some variation as to exactly what is given in that div. eg:

sv:

<div class="hgroup">
<p class="division">Majjhima Nikāya</p>
<h1>23. Liknelsen om myrstacken</h1>
h1&gt;</div>

no:

<div class="hgroup">
<p class="division">De mellomlange tekstene</p>
<p>Bhayabheravasutta</p>
<h1>4. Frykt og redsel</h1>
</div>

pt:

<div class="hgroup">
<p class="division">Majjhima Nikāya 4</p>
<p>Bhayabherava Sutta</p>
<h1>Medo e Terror</h1>
</div>

A bug, no?

Yes, the supertitles are inconsistent, depending on the sources.

The idea was that when labelling the specific levels, they could be targetted for linkification, providing another navigation aid. I.e.

<p class="division">Majjhima Nikāya</p>

Could be automagically transformed to;

<p class="division"><a href="/mn">Majjhima Nikāya</a></p>

But we never got around to implementing this. Still, it might be a nice idea!

Divisions and vaggas should, in that case, be labelled, but discourse titles need not, as, well, you’re already there.

In any case, I see the problem, we can’t always detect where Pali is used. That’s okay, just apply it where we can, otherwise never mind.