Fixing HTML on legacy texts

Aminah · January 26, 2019, 10:46pm

Of course, it will be included.

Okay.

Aminah · January 31, 2019, 2:12am

There were:

14532  
457 >
28 <

Most of these were introduced by HTML Tidy (they were already in the UVS files). The quote-nbsp option can be set to 0 to prevent them being converted (although, perhaps, shouldn’t be), but there doesn’t look to be any way to stop the > and <conversion and I guess this is actually kind of the point of HTML Tidy.

Initially, I had thought these would have been extremely straightforward to deal with, but alas no! Just finding an actual NBSP character was harder than it should be as on every online resource I visited it’s just given as a regular space (I forgot about Linux’s Character Map, where I eventually found it).

Then I picked through mixed usage of the NBSP in the files: different kinds of errors (now cleared) and few different kinds of intended/meaningful uses. Then I discovered the “Narrow no-break space” (U+202F /  ) character, which the linked article highlighted is the correct character for the French punctuation—depending on whose French you want to go by—but alas I further discovered that U+202F isn’t always supported and can render funny in some browsers.

Anyway, whichever character is used, this article that makes the reasonable point that:

If you believe in the principle that source code should be optimized for readability—I do—then you should use the   escape code, as it makes the nonbreaking space visible and explicit.

HTML Tidy’s default seems to agree with this view. The article further suggests that it might be better to find an alternative solution to the NBSP in the UVS files.

So you’ll need to decide on those points, before the remaining  s are treated (regarding French texts, if they are to be preserved, then those texts that don’t currently have the proper French punctuation spacing probably ought to be updated).

Most of the >and >where errors and have been deleted, but there are a few instances where the angle brackets were intended, and will have (or will) reverted following Tidy-fication. Ones that may want looking at include “< 11 >”, “<22–23>”, “<470-72>”, “<492>” which are page refs to PTS texts (I’m not sure why they haven’t been handled in the usual way).

Praise be to sed!

In this process I bagged two bonuses: (1) noticing that in some cases the publication date attribute has been applied to the text the translator has translated from, not the translation (I haven’t investigated this any further), (2) finding 956 instances of </article> and </section> s in places that deviate from the standard template. Most of these weren’t wrong, per se, but as with the other consistency points, were worth standardizing to make sure nothing gets omitted by the element updating script later on.

Most weren’t wrong, some were (resulting from a regex mishap by the look of things) and I think this one might need to be followed up on, as it looks like some of the text might be missing: sc-data/html_text/vi/pli/sutta/sn/sn53/sn53.45-54.html

This has been added to <cite>.

Also, with respect to division divs I realized it was possible to target at least those supertitles that included “Nikāya” (1/5 of the texts, or 13127 texts). In the process, I noticed that some included a lang atttribute and some did not, so I added it for all of them. Similarly, it was possible to add the translate attribute to Pali sutta name supertitles (well, at least 4964 of them; others may have dodged my search term). I’m sure this isn’t perfect, but does fulfil the “where possible” principle.

As for the <i>, the translate attribute has been added to all the elements with a pre-existing lang attribute (and in addition <span>s being used in this way have been converted to <i> and had the attribute added). But! When initially replying to this point, I forgot about this:

Looking at the ‘plain’ <i>s again (i-element.zip), while again the vast majority of these are used to mark off root language words there are enough instances of deviation to prefer sorting these out before adding the translate attribute. Some just look like plain wrong applications, but others I’m not sure about.

The rest:

All the other items mentioned above (save the hanging questions points, and of course, the basic structure update) have now been done, and can be reviewed on the new ‘html-clean4’ branch I’ve pushed.

Btw, the zz/zz3.html should be updated to remove the listing for endsubsection.

So, what’s the final damage?

Well, there remain some questions, and no doubt there’ll be some other things that pop up before handing over to They Who Write Clever Scripts, but it certainly seems to be looking more orderly than when I began: SChtmlTags-swp4-complete.txt.zip (1.2 KB)

If it looks good to you, perhaps you can merge the pull request:

sujato · January 31, 2019, 6:01am

Okay, that’s looking terrific, I have merged it.

Sometimes I cheat and copy exotic spaces from the Wikipedia article on spaces!

Personally I prefer the Unicode rather than the escape code: escape code is specific to HTML, Unicode is, well, universal. But so long as it’s consistent.

That should probably be on a 2-do list for checking in the future.

Hallelujah!

I have added this to the list of issues with Vietnamese texts.

Done.

Aminah · January 31, 2019, 7:57am

Exactly! That’s what I wanted to do, but at least on the page I checked (and then a bunch of others), it just gives a regular space. It can, indeed, be tough to find a little space in this world!

regular space: " "
Narrow no-break space: " "
No-break space: " "
Figure space: " "

Fair-dos re the preference for Unicode; I agree.

I do have to confess though that I didn’t have a clue about French punctuation style and that at first I just converted all the  s to regular spaces. It was just fortunate flash of second thought that led me to recognise and undo my mistake. It kind of smushes my face in the value of having a clear indication in the code that there is a distinct and purposeful character being used (which should be all the more clear now that wrong uses have been removed). It’s also very trivial to convert them if migrating to another language.

In any case, I’m 100% with you on the being consistent point, so whichever way you think best.

It only concerns 4 files, so won’t be too taxing:

en/pli/sutta/sn/sn11/sn11.2.html:17:<470-72>
en/pli/sutta/sn/sn11/sn11.10.html:45:<492>
en/pli/sutta/sn/sn1/sn1.20.html:34:<22–23>
en/pli/sutta/sn/sn1/sn1.11.html:19:< 11 >

There have been four Vietnamese files that have been haunting me ever since this process began (and have fooled me into rechecking them several times:

vi/pli/sutta/sn/sn18/sn18.22.html
vi/pli/sutta/sn/sn18/sn18.21.html
vi/pli/sutta/sn/sn1/sn1.37.html
vi/pli/sutta/sn/sn2/sn2.8.html

I think it’s also about time to finally clear off those gobbledegook spans. The concern these files:

pli/vinaya/pli-tv-bi-vb/pli-tv-bi-vb-pc/pli-tv-bi-vb-pc42.html:24
pli/vinaya/pli-tv-bi-vb/pli-tv-bi-vb-pc/pli-tv-bi-vb-pc35.html:25
pli/vinaya/pli-tv-kd/pli-tv-kd10.html:29
pli/vinaya/pli-tv-kd/pli-tv-kd17.html:18
pli/vinaya/pli-tv-kd/pli-tv-kd1.html:211
pli/vinaya/pli-tv-kd/pli-tv-kd1.html:212

Would I be right to think that these wish be removed from html_text soonish? In turn, can these ones just be written off?

Wouldn’t you know it! By the looks of things the <em>s are being used for the same function as the <i>s (mark of a root language word), and this should be corrected—by MDN this is just a poor application of the element—but I’ll have a closer look at instances of this before doings so.

sujato · February 12, 2019, 8:15am

I’ve pushed the fixes mentioned here, also a bunch of malformed pts refs in the Vietnamese SN.

I couldn’t identify the issues in the last set of texts (pli/vinaya/pli-tv-bi-vb/pli-tv-bi-vb-pc/pli-tv-bi-vb-pc42.html:24 etc.), I guess you’ve gone ahead and fixed them?

Is there anything else left to do on this?

Aminah · February 12, 2019, 8:53am

Brill! I’ll pull and then do the <em> and add nipata numbers (and I think there was one more thing that came up on the list from yesterday… I was just about to clean up the structure principles that emerged yesterday… but then may have got a bit distracted by my morning “what’s new” adventures: New JavaScript Features That Will Change How You Write Regex — Smashing Magazine )

Aminah · June 16, 2019, 2:03pm

As per discussion elsewhere this tasks plods on (currently having got itself embroiled in correcting VI SN numbers). An update will follow in due course.

However, I wanted to quickly check up on detail. Prompted by the follow in connection to po/json texts:

What is the correct way for the legacy texts? Currently, a couple of ways can be found:

<a class="sc" id="1">
(this mostly only appears in root texts, but also can be found elsewhere: nl/…/snp/dubois/, en/…/thig/thanissaro/, en/…/thig/others/, in vagga suttas and also in all but one of the dhp translations, <a class="sc" id="1" data-uid="dhp1">)
<a class="sc" id="sc1"> (this is most common across the legacy translations—where there are ids—and as <a class="sc" id="sc1" data-uid="dhp1"> in the IT dhp)

Would, indeed, be good if this could be explained.

With regards to that AN ones and twos in the majority of cases (de, en, es*, fr, hu, it, my, nl, ru, si, sl, vi), it’s not applied at all and suttas are just separated by <h1> (well, now <h2> in my working branch), or have been translated as one sutta. In the instances where it is used (ca, cs, es – 1 suttas, id, no – 2 suttas, pt, th) there are a couple of slight variations in application:

<div id="text" lang="cs">
<section id="vagga">
<section class="sutta" id="1" data-uid="an1.1-10">

<div id="text" lang="th">
<section id="vagga" data-uid="an1-10">
<section class="sutta" id="1" data-uid="an1.1">

<div id="text" lang="id">
<section id="vagga">
<section class="sutta" id="1" data-uid="an1.1">

Trying to synthasize what has previously been discussed (particularly with referrence to your outline of how vagga-suttas should be handled), and the impression I have from looking over the files that use data-uid, the following is my current thought what migth need to be done (before the great reform by script):

Current (live site)

<div id="text" lang="th">
<section id="vagga" data-uid="an1-10">
<section class="sutta" id="1" data-uid="an1.1">
<article>
<div class="hgroup">
<p class="division">อังคุตตรนิกาย</p>
<p class="subdivision">เอกนิบาต</p>
<h1>1.1</h1>
</div>
<p …</p>
</article>
</section>
<section class="sutta" id="2" data-uid="an1.2">
<article>
<div class="hgroup">
<h1>1.2</h1>
</div>
<p>…</p>
</article>
</section>
…
<aside id="metaarea">
…

As pressently on working HTML branch

<div id="text" lang="th">
<section class="vagga" id="an1.1-10">
<article>
<div class="hgroup">
<p class="division">อังคุตตรนิกาย</p>
<p class="subdivision">เอกนิบาต</p>
<h1>1.1–10</h1>
</div>
<h2 id="an1.1">1.1</h2>
<p>…</p>
<h2 id="an1.2">1.2</h2>
<p>…</p>
…
</article>
<aside id="metaarea">
…

Deducded/proposed

…
<div id="text" lang="th">
<section id="vagga" data-uid="an1-10">
<article>
<div class="hgroup">
<p class="division">อังคุตตรนิกาย</p>
<p class="subdivision">เอกนิบาต</p>
<h1>1.1-10</h1>
</div>
<article>
<h2 class="sutta" id="sc1" data-uid="an1.1">1.1</h2>
<p>…</p>
</article>
<article>
<h1 class="sutta" id="sc2" data-uid="an1.2">1.2</h1>
<p>…</p>
</article>
…
</section>
<aside id="metaarea">
…

(I didn’t yet put in the articles for for vagga-sutta suttas as I figured it could be handled by the pending script; but I may as well add them in now if I’m going to revisit these suttas anyway).

sujato · June 17, 2019, 7:33am

The system was built to accommodate a degree of flexibility, so either option would work. However I think it’s best to standardize on one markup and force consistency.

The preferred form is class="sc" id="sc1" as this is less ambiguous.

Okay, so I have a few thoughts on more general things. Forgive me if I’m stating the obvious or obviously wrong, I am revisiting this after some time!

Why are we using divs for text markup at all? Should we not just put the lang attribute on the section or article?

IIRC there was talk of using <main> for this. But it seems to me that <main> is more a property of the overall HTML page: the sutta is the “main” content on the page. It is possible to imagine, however, a context where the content was used an it was not the “main” part of the page; for example, a Dhammapada verse illustrating an image. So <main> should be part of the application logic, not the text markup.

I think it’s a mistake to use id in this kind of case. I believe, IIRC, that I originally used this not really understanding the proper use of classes and IDs. The only reason to use an ID is if you want to link to that thing in the page, but that isn’t the case. Since sutta is a class, vagga should also be one. The same applies to the ID on metaarea: it should be a class. It doesn’t really cause problems ATM, but in principle if we combined material differently we may end up with two identical IDs on the same page.

Also, why do we still have the ugly <div class="hgroup"> Didn’t we decide to replace that with <header>?

Use classless HTML where possible

From the beginning of SC, my inclination has been to use classes as minimally as possible, relying on HTML semantics instead. As the semantics grow more precise, and with the use of shadowDom to isolate elements of a page, the idea of classless HTML is getting some traction. It doesn’t mean classes can’t be used, obviously, but that the core structure is defined as best as possible without classes. That keeps the markup clean, and enhances the reusability of the content.

Considering our current case of a sutta inside a vagga, we can do this with classes, nesting article or section with class “sutta” and “vagga”. But it is only applicable in our system. Also, it is brittle, in the sense that other cases will operate differently. Consider Dhp, where we have a vagga inside a “sutta”.

So let us consider things a little more in terms of the basic semantics. I’m just free-associating here!

An <article> is a coherent piece of text. So let us consider that an article is a sutta: that works pretty well.

Then a group of articles would be a <section>. This might be a vagga, a pannasa, etc. The point is that a section is less coherent; it is part of a document outline (if the nikaya or tipitaka is considered as a document), but does not necessarily contain any thematic coherence.

Then we can use the two together without classes, yet fully articulate the content. Of course, if classes are required for purely styling purposes they can be injected at the front end, they do not need to be in the document itself.

Use `id` to define texts, only use `data-uid` if needed

I can’t see the need for data-uid at all in this example. The current schema seems unnecessarily complex and inconsistent. I would propose the following general principle when defining the overall text in SC:

id is always the canonical SC UID.
Only use data-uid where id does not suffice.

This is implicitly already followed in the “normal” sutta-suttas, eg.:

<section class="sutta" id="mn21">

Which may become a classless article:

<article id="mn21">

The vagga-suttas should also do the same thing. We shouldn’t put text-defining info in <hX> tags (because the ID refers to the whole text, not just the heading).

<section id="an1.1-10">
 <header><h1> …
<article id="an1.1">
<h2> …

Boom.

There may be cases where data-uid is needed, maybe in Dhp, but I can’t see the point here.

Aminah · June 18, 2019, 3:26pm

Yes, but this has already been covered, and you suggested should be left for Blake or Hongda to convert by a script (in the first instance, I believe, so as to avoid breaking any code that requests stuff, or whatever). Of course, I could implement those changes too and they can just sit in my working branch until they are ready to be merged. That said, given the markup reducx thread, looks like what was previously discussed might need some tweeking.

Yes and no, there was talk of using main (#1365), but not for this (and yes as per the ticket, that would be added to the application page). The idea had been to shift the lang attribute into the <article>.

Okay, great. Thanks.

Yes, again, this was already agreed (see first point of this reply).

Overall, I think this is perfectly excellent, but as with my comment on the markup reducx thread while in many instances the proposed <article>, <section> divisions work well, in at least the case of vagga suttas, I think it might be overly forcing the definition of the elements to fit a nice scheme.

Likewise, thinking on the fly, I might wonder about assuming <article> equals a sutta (ie. no class), unless a vagga class is given (ie <article class="vagga">) (no need to spend any time shooting it down, I mention it already full well knowing it isn’t a pretty and satisfying as the markup alignment you want to make; I’m just sayin’…)

In any case, nice link! Nice articles (well, the couple that I read). Actually, though both bring out a really cool balance of the point, (1) push back against wild class proliferation is better understood in the context of popular frameworks with utterly mental class usage (2) the goal of classlessness is a wonderful idea in terms of social structure (okay, I added that bit), but is a bit OTT in terms of HTML. A better goal is to use classes that “are explicit and reasoned”, not just get rid of them all lest naming pandemonium breaks out as hilariously described in the article.

Neat, because I couldn’t either, but this was just based on looking at what I found in the files and is why I was keen for someone to explain the purpose of its presence.

Good point!

Very good!

Aminah · July 5, 2019, 8:52am

At least Bruce Lawson understands me.

Aminah · August 14, 2019, 9:54pm

I guess it’s about time to update with the latest on this front. Tinkerings to-date have been pushed to the html-clean5 branch for checking and and can wait there until Hongda is ready to update the site code as necessary.

A for the first person to report a batch-edit-boo-boo!

What’s been done, in brief:

convert em to i
discover & correct of sutta ID errors
add full sutta ID numbers in titles where only partial ID was given (eg. “53. Title” > “8.53. Title”)
moved sutta ID numbers from sub/division p to sutta title h1.
periods added to/removed from sutta titles as suitable
in SN & AN vagga-suttas (where relevant) convert:
- h2 to h3
- h1 to h2
Add <article> to individual vagga-sutta suttas
HTML-spec-aligned markup conversion

You can also check out my enthralling commit history from mid April onwards for more overview (although I have to admit I wasn’t always systematic about what was committed under what message, so all sorts of bits and bobs might also be included in commits as I meandered along)

In detail:

`<em>` & `<i>`

The conversions made here mostly concern pli/san words, but I’ve since found some instances where a second look might be needed. A closer peek is also needed before adding lang attributes and can’t (accurately) be done in a blanket fashion.

ID number corrections

Initially this just concerned some odd discoveries here and there, but when trying to make the agreed heading amendments I found the VI title numbers were out of whack so used the original source site (SN, AN) to bring them in sync with SC numbering. As far as my wits and Google Translate can be relied on, I think the numbers are now right, but it might be good if some Vietnamese friends could double check (I wonder, @phineas-pta, would you be able and so kind as to do a few spot checks later on?).

Also, I believe the .fakechapter has been made redundant in the process. Following the original source, as best as I can tell, these did not “[Indicate] what appears to be erroneous or misplaced titles in Vietnamese texts”, but instead matched the vaggas pretty well. There where a few clarification headings the translator seems to have added and they remain as .subheading.

Add full sutta ID numbers in (some) titles

This got more ‘involved’ then I’d have hoped. I’ve covered as many “broad-sweep” pattern variations as I can think of, but as I went about it found that in many cases language folder (and sub-folders) had to be targeted individually and this is definitely not a complete process.

From where I sit, the whole <div class="hgroup"> (now <header>) across the different language translations just require individual attention to treat the many slight variations correctly, so to my mind it’s better to leave this as a bumbling mission for later on.

Convert vagga-sutta headings as relevant (`h1` > `h2` > `h3` > etc.)

This was limited to just Pali AN, and SN as a broader survey suggested outside of this it probably wasn’t applicable and definitely was too messy. Even with this limitation the amendments had to be done by individual language folder as there was just too much inconsistency to be able to target accurately otherwise.

The purpose of this was to produce a single “range-title” h1 per file with individual suttas titles becoming h2 as previously agreed.

Add `<article>` to individual vagga-sutta suttas

So, where a vagga-sutta contains several suttas under h2 titles, these have each been wrapped in <article> tags.

It should be noted, that:

there are instances where translators’ handling of a vagga-sutta really doesn’t lend itself to the “every sutta gets an <article>” scheme (eg. DE an2.180-229, sn17.13-20.html, or FR an2.98-117.html).
there are some “vagga-suttas” that are more just “multiple sutta” files (they contain some of the suttas within the vagga-sutta, but not all) (eg. NL an2.11-20, PT sn53.45-54, or FR an2.1-10).
even though it isn’t exactly in accordance with the definition (on account of not falling into a .hgroup) in FR an1.394-574.html and DE an2.130-140.html I’ve use .subheading for clarifying headings the translator has inserted to group smaller clusters of suttas within the vagga-sutta. Personally, I think it wouldn’t be a bad idea to extend the definition (in zz3) as there is a use case (albeit minor, there are maybe a handful, or so of places where it might be used in this way), but if not it’s only been applied in these two instances so can easily be removed.

HTML-spec-aligned markup conversion

The changes follow the template and guiding principles wiki drafted on the basis of what was discussed a little while back. The following tweak mentioned in the “Markup redux” thread has also been incorporated:

In single sutta files:

~~<div id="text" lang="en"> <section class="sutta" id="kv1.1"> <article>~~

has become:

<article id="kv1.1" lang="en">

~~<div class="hgroup"> … </div>~~

has become:

<header>
…
</header>

~~</article> <aside id="metaarea"> … </aside> </section> </div>~~

has become:

<footer class="metaarea">
…
</footer>
</article>

In vagga-sutta files with individual suttas combined into a single sutta (eg. ID an10.156-166):

As above.

In vagga-sutta files with multiple suttas in articles:

~~<div id="text" lang="it"> <section class="sutta" id="an1.31-40"> <div class="hgroup"> … </div>~~

has become:

<section id="an1.31-40" lang="it">
<header>
…
</header>

~~<aside id="metaarea"> … </aside> </section> </div>~~

has become:

<footer class="metaarea">
…
</footer>
</section>

As mentioned, I can’t see a way to batch handled <header> supertitles, so I think they’ll just have to be plodded through later on.

Other notes about “html-clean5”:

Apart from the HTML-spec-aligned markup conversion, most of the amendments made in this round do not apply to the SI files as they’re a bit of a mess and newer files are pending (I am about to follow up with the chap who’s preparing them for his own site).
I yet have a healthy list of other bibbly-bobbly-cleany things that can be done, but y’know; rainy day 'n’all.
One not “bibbly-bobbly” bit I’m keen to look at, that nevertheless hasn’t been approached yet is adding structured data (perhaps with schema.org’s vocabulary). Of course, this probably overlaps with the RDF ticket, which I’m also quite interested in.

As an aside, while having (almost) endless fun with the Vietnamese texts, I found their publication dates on the source site so put them in. It turns out the AN was published over two years so I put the span in a span (<span class="publication-date">1980–1981</span>). I was then curious how that would be handled using Schema’s datePublished property. I discovered all dates follow the ISO_8601 format, and time intervals just use a /.

ISO_8601 also has a nice solution for approximations which would be relevant in for the VI SN. The ‘early 1980s’ is given as the date of publication and there is a pretty ~ to denote the approximate date (although I should mention, I’m going by a review draft of the standard; maybe it’s just down to my being economically limited, but I just find there’s something perverse about having to pay to know how to write the date correctly! It’s only a step or two on from charging for the alphabet.)

General notes about the HTML template

What with the rolling of on time and plenty of mulling-over opportunity, it might be good to revisit what was set out in the template and guiding principles wiki. It was put together as a draft so it might be nice to make a “finalized” reference point. Certainly one thing that would be good to check over, tweak and confirm is the the supertitle section.

Fun fact: you probably already know, but WHATWG has now become the sole keeper of the HTML standard. The agreement between W3C and WHATWG was signed at the end of May. The passing relevance here is that zz3 page notes that the <hgroup> is a deprecated element and this is no longer true as the element does exists in the WHATWG spec. Thing is, the WHATWG spec’s a bit rubbish on this, and just as our trusty friend Bruce Lawson says, the reason why <hgroup> wasn’t in the W3C spec is because “it’s completely useless”. This of course, is all happily sidestepped with the implementation of <header> for titles and supertitles.

Aaaaanyway,

Regarding the principle of using Pali terms for structural attribute values, as you noted this “gets complicated when considering non-Pali texts”. While on the one hand stating “we should use the same basic principle of following the naming as used in the sidebar”, it’s also mentioned that “all classes in supertitles should be in Pali.” I don’t see how both of these guiding principles can be followed in the case of non-Pali texts.

Is the basic idea to map non-Pali structure onto the Pali model? In which case, would it worth explicitly setting out Pali equivalences for the non-Pali texts something like the following (just a quick, rough presentation to sketch the idea):

 <header>
   <p class="nikaya handle">[ DA | MA | SA | EA | Minor Chinese ]</p>
   <p class="nipata handle">[ - | - | - | Jātaka; Sūtrasannipāta; Dhāraṇī; Sūtravyākaraṇa ]</p>
   <p class="pannasaka handle">[ - | - | - | - | - ]</p>
   <p class="vagga handle">[ - | - | SA 1–100 | EA 4 | - ]</p>
   <h1 class="sutta handle">4.4. Sutra title</h1>
 </header>

or…

<header>
  <p class="agama handle">Saṃyuktāgama 1–100</p>
  <h1 class="sutra handle">1. Discourse on Impermanence</h1>
</header>

<header>
  <p class="minor-chinese handle">Jātaka</p>
  <h1 class="sutra handle">152.87 摩調王</h1>
</header>

or…???

I still have a niggling reservation over .handle for supertitles. It works okay, but particularly in view of your point re disambiguating what do you think about .text-title, or just the very straightforward .supertitle [Added: okay, strictly speaking .supertitle wouldn’t fit as the class would also be applied to the h1 title, which thus by definition is not “super”. An other alternative might be .headertitle] ?

Of course, brevity over being long-winded is better; but at the same time, I lean towards clarity over brevity. Both options suitably circumvent any muddle with the title attribute, and are also one character shorter than appellation which you put forward as a viable option.

Also,

In the Markup Redux thread, you were looking for code to cull from the text files and where interested in just beginning at <header>. I pointed out this wouldn’t necessarily work for the legacy files as the author name is given in the <head> area. However, <meta charset="UTF-8"> can certainly get the chop.

And just to remind,

might be good to remember these related ticket when the work on updating the site code begins:

social sharing issue

adding <main>
It was previously mentioned that it would probably be better to make the URLs of pannasakas and such more clearly separate, i.e. an3-dutiya-pannasaka.
It was previously mentioned that we shall have to review the CSS to ensure everything in the new markup is clearly distinguished.

phineas-pta · August 19, 2019, 1:58pm

@Aminah so I’ll have to check all translated texts?

Nadine · August 27, 2019, 10:04pm

@phineas-pta,
Aminah has been away from the forum for several days while on retreat. She will return in roughly a week.

Aminah · September 3, 2019, 1:34pm

Hey, thanks so much for getting back, much appreciated! Likewise, huge thanks for being willing to help out!

To answer your question, no, I don’t think it’s necessary to check everything, just some random checks in the Samyutta Nikaya to make sure that the title numbers are correct. For now, however, there’s no need to do anything; it can wait until the texts have been updated on the site and I’ll let you know when that’s been done.

sujato · September 10, 2019, 8:18am

Sorry about the delayed response, let me go through these.

Okay, cool. As noted in our meeting, revising the text pages is the next step for Hongda. I have focused on the segmented texts, but the legacy texts are included as well. When Hongda is ready to start working on this, we should get him to use this branch.

We should probably assume <i> = <i lang="pli"> unless otherwise defined. It won’t have a practical implication immediately, but I am still looking at, for example, a terminology dictionary where this would be useful.

great, thanks.

Indeed, let’s scrap it.

Okay.

Great, and difficult cases are noted. We should ensure that these cases at least display reasonably sanely, but if the system is a bit wonky, never mind.

Does this need a class? There will only ever be one footer in a sutta. “Footer in a sutta”

Will individual suttas in such cases have IDs?

To make sure we’re on the same page, and include @blake in this, there are two cases:

Where a vagga-sutta has no individual suttas, use the range for ID only.
Where a vagga-sutta includes individual suttas, use the range for the file ID, but sutta IDs internally.

I made a list of these debakable texts.

I will take a look at the sutta segments and ensure they are all correct.

Note similar issues are encountered in the Vinaya Sekhiyas and Adhikaranasamathas, and elsewhere, umm, maybe the bhikkhuni patidesaniyas.

In the case of legacy texts, this may well fall into the category of “too messy to fix”, which is fine, but we should be clear about it.

Well, let’s get the HTML finished and live, then we can look at this. Turns out Anders is not looking likely to get this done!

Good to know.

Yeah, not good.

Yes, indeed, we should preserve a proper spec on the wiki.

Best leave it where it is for now, and let’s implement the bilara texts properly, then we can see if there are lessons to be learned for the legacy texts.

One possibility is to create the supertitles dynamically based on bilara-data. Then we could even throw a nice little display:none on the hard-coded supertitles in legacy texts and deliver the same supertitles as the segmented texts. clutches pearls

So sweet, everyone is getting along!

The main point, I think, is simply that we use the same naming system as found in the sidebar. For Pali texts, this will be in Pali, in other cases … let’s see!

Okay, in SA we immediately have a problem, the grouping is simply by the hundred suttas. this is purely arbitrary, as the Taisho doesn’t represent the samyuttas properly, and the text is in any case disordered… Actually this should be structured to follow the Samyutta structure, but it seems unlikely that this will happen in the Taisho anyway.

For EA, the sidebar number is just ea2, etc. This is not really useful.

Hmm, well perhaps this is a problem in search of a solution. It is only really in the Pali texts that the hierarchy is both spelled out in detail and acyively used. Why not limit the scope of this to apply only to Pali texts? In any case, these are the only segmented texts, so other cases are purely theoretical for now. Let’s leave it until we create other segmented texts.

Not really, no.

Umm, not sure about this now. The idea is to have a simple way to target all things that are titles, which would include things both inside the <header> and outside. If it is just stuff in the <header> we can simply scope it that way. But if it is to include titles that are included within vagga-suttas then headertitle or whatever doesn’t work.

Maybe we’re overthinking it. Can we just ignore this idea and hope it goes away?

Indeed.

Aminah · September 10, 2019, 12:40pm

Great. In the time being I will be adding new legacy texts to this branch.

This does make fair sense for pli folders, but if going by assumption I would assume san for translations from Sanskrit. In lzh folders, I wouldn’t be sure. In any case, the point remains that a margin of error would have to be accepted for this approach because I know <i> is not always being used in this way.

Cool. Maybe it would be good to have a general review of zz3. The other thing I have in mind at the moment in the various end classes. I’m now coding some Bengali texts, and either what exactly these classes include needs to be defined more precisely, or more some new classes might be needed.

This being so, then no, I don’t think the class is necessary and will remove it.

Yes.

Eg

</header>
<article id="an1.21">
<h2>1.21</h2>
<p>“I don’t envision a single thing that, when undeveloped, is as unpliant as the mind. The mind, when undeveloped, is unpliant.”</p>
</article>
<article id="an1.22">
<h2>1.22</h2>
<p>“I don’t envision a single thing that, when developed, is as pliant as the mind. The mind, when developed, is pliant.”</p>
</article>

As per discussion in one of the other threads, I’ve used the regular ol’ id. One of the “bibbly-bobbly bit” still on the to-do list is to come back to have a look at some remaining data-uids (mostly in the KN to memory).

As a principle this is excellent. But again, just to note, some translators’ handlings in some cases will not conform to how this list breaks things down.

Oooh! Fascinating! Sounds neat; looking forward to hearing more further down the road.

As a side point to this, the sidebar uses diacriticals; should the corresponding classes?

I don’t quite follow you on this one, there are several legacy translations from non-Pali texts…

That said, without checking my guess would be that most/all of them only give an āgama level title. If this is so, then I guess it doesn’t need to be thought through beyond replacing the division line in the following example with <p class="agama handle">Saṃyuktāgama</p> (and of course adding a .sutta to the heading)…?:

<article id="sa1" lang="en">
<header>
<p class="division">Saṃyuktāgama</p>
<h1>1. <span class="add">Discourse on Impermanence</span></h1>
</header>

LOL a deft demonstration of solid leadership skills—and I’m not even being sarcastic.

Anyway, in fact, in a meeting a month or so ago I suggested that with the removal of the old .sutta it was now possible to just drop the handle/X class altogether and just follow the guide of only assigning .nikaya/vagga/sutta/etc exclusively to super/titles, but you said that would be too brittle …

sujato · September 10, 2019, 11:37pm

I guess so.

Indeed.

I agree, and the same issue comes up for the Vinaya. Let’s discuss it.

Great!

Sure.

We should use the slug form, i.e. what appears in the URL. These don’t have diacriticals.

My main point here is that, since for non-Pali texts there is usually either no hierarchy, or the hierarchy is not really actually used, then there is not much advantage to be gained from implementing such a system. As you point out, most of them will simply point to the Agama. And fine, if we can add such a class, great, and it can be linkified. But if it is a hassle, it can be left, I think.

Ahh, but the question is, who is right, past me or present me?

Aminah · September 11, 2019, 6:13am

If by “me” you mean me then both past me and present me who have consistently been of the feeling there’s now no need for the handle class are right.

Thanks for the rest.

Aminah · September 13, 2019, 9:02pm

Out of curiosity (mostly, perhaps it has application as well): the Chaṭṭha Saṅgāyana edition has some samyuttas with no vaggas; just the samyutta and the suttas (eg SN13, SN16, SN20, or SN21). However, in all the cases I’ve casually checked on the SC sidebar, the pattern is a vagga has been included with the same name as the samyutta (so in the first of the above examples it goes “…/Abhisamaya Saṃyutta/Abhisamaya Vagga”) is this a feature of the Mahasangiti edition? If not, were these vaggas added for a particular purpose?

sujato · September 13, 2019, 10:57pm

It is, yes. They nest the folders with a samyutta and a vagga, and in addition the vagga is acknowledged in the file:

<heading>2.1.1 Nakhasikhāsutta</heading>
<span class="hidden">Abhisamayasaṃyutta</span>
<span class="hidden">Abhisamayavagga</span>
<span class="hidden">Nakhasikhāsutta</span>

Note that these source files have something similar to our supertitles, mixing text with structure.

Fixing HTML on legacy texts

The rest:

Current (live site)

As pressently on working HTML branch

Deducded/proposed

Use classless HTML where possible

Use id to define texts, only use data-uid if needed

What’s been done, in brief:

In detail:

<em> & <i>

ID number corrections

Add full sutta ID numbers in (some) titles

Convert vagga-sutta headings as relevant (h1 > h2 > h3 > etc.)

Add <article> to individual vagga-sutta suttas

HTML-spec-aligned markup conversion

In single sutta files:

In vagga-sutta files with individual suttas combined into a single sutta (eg. ID an10.156-166):

In vagga-sutta files with multiple suttas in articles:

Other notes about “html-clean5”:

General notes about the HTML template

Also,

Use `id` to define texts, only use `data-uid` if needed

`<em>` & `<i>`

Convert vagga-sutta headings as relevant (`h1` > `h2` > `h3` > etc.)

Add `<article>` to individual vagga-sutta suttas