Fixing HTML on legacy texts

While generally the quality of our HTML files is good, over the years a number of mistakes and inconsistencies have crept in. @Aminah has started fixing these, and I paste her Github post below.


I’ve now done a preliminary sweep through /sc-data/html_text, pushed to a new branch html-clean and opened a pull request it you wanted to have a glance before merging.

The following have been amended:

<!-- gaiji, … -->
<!doctype html>
<!DOCTYPE HTML>
</ p>
</a">
</A>
</blockquotes>
</br>
</H3>
</P>
<504>
<</h1>
<</p>
<a  href="…">
<a class="…" id="…" id="…">
<a href="…"  target="_blank">
<a href="…" target="_blanc">
<a href="…" target="blank">
<a href="…" target='_blank'>
<a href="…" target=blanc>
<A HREF="…">
<a href='http://a-buddha-ujja.hu/Info/Balint-Gyongyi'>
<a href='http://www.accesstoinsight.org/tipitaka/kn/thag/thag.08.01.bodh.html' rel='nofollow'>
<a href=http://obo.genaud.net/dhamma-vinaya/pts/mn/mn.080.horn.pts.htm target="_blank">
<A id="…" class="…">
<A NAME="…">
<a title=’Pali: cetana’>
<blockquote clas="gatha">
<blockquote class="…"<p>
<br />
<br/>
<bR>
<BR>
<div  id="…" lang="…">
<H3>
<hr/>
<lI>
<meta author="…">
<meta charset="utf-8"/>
<meta charset="utf-8">
<meta charset="utf8">
<meta>
<P>
<section class="…"  id="…">
<span <span class="…">
<span class="…" >
<span class="…" title=">
<span class="…" title='…>
<span claass="…">
<span clas="…">
<span style="“text-decoration:" underline>
<table  class="…">
<|blockquote>

There were a few other items that seemed obviously not quite right, but I had a question mark over how they were to be handled so have left them for the time being. They are marked with an indent and “???” in the new whittled down list of unique tags found in /sc-data/html_text:
SChtmlTags-swp2.txt. They are as follows with the relevant file/s given (line no. correspond to the .txt):

fr/lzh/sutta/lzh-nbs/t1670b/t1670b2.50.html
Line 45: <2.50 is missing in the English text. The following is included in 2.49 there>

vi/pli/sutta/sn/sn18/sn18.22.html
vi/pli/sutta/sn/sn18/sn18.21.html
vi/pli/sutta/sn/sn1/sn1.37.html
vi/pli/sutta/sn/sn2/sn2.8.html
Line 51: <a class="…" id="…" pi>

fr/lzh/sutta/lzh-nbs/t1670b/t1670b2.65.html
Line 80: <again 2.65 is missing in the English text. The following is included in 2.64 there.>

fr/lzh/sutta/lzh-nbs/t1670b/t1670b2.45.html:
fr/lzh/sutta/lzh-nbs/t1670b/t1670b2.19.html:
Line 110: <end of chapter 2>
Line 111: <end of first chapter>
Line 93: <chapter 2>
Line 94: <chapter 3>

es/pli/vinaya/pli-tv-bu-pm.html
de/pli/sutta/kn/ja/ja313.html
de/pli/sutta/kn/ja/ja313.html
de/pli/sutta/kn/ja/ja313.html
Line 112: <font face="Times New Roman"> (used to wrap “ā” character, or troll :smiling_imp:)

pli/vinaya/pli-tv-kd/pli-tv-kd10.html
Line 159: <span a="…

pli/vinaya/pli-tv-kd/pli-tv-kd17.html
Line 160: <span ag="…

pli/vinaya/pli-tv-kd/pli-tv-kd1.html
Line 161: <span ariyasaccan="…

pli/vinaya/pli-tv-bi-vb/pli-tv-bi-vb-pc/pli-tv-bi-vb-pc42.html
Line 162: <span asa="…

pli/vinaya/pli-tv-bi-vb/pli-tv-bi-vb-pc/pli-tv-bi-vb-pc35.html
Line 163: <span bhikkhuvibha="…

pli/vinaya/pli-tv-bi-vb/pli-tv-bi-vb-pc/pli-tv-bi-vb-pc35.html
pli/vinaya/pli-tv-pvr/pli-tv-pvr17.html
Line 166: <span class="…" dissati=""…

pli/vinaya/pli-tv-pvr/pli-tv-pvr16.html
Line 169: <span class="…" id="…" itip="" kesuci="" ni="" pariv="" potthakesu="" title="…">

fr/lzh/sutta/lzh-nbs/t1670b/t1670b2.28.html
Line 194: <The English text is missing chapter 2.28. All the following is included in 2.27 there>

I’m guessing the Vinaya ones may just about to become moot.

2 Likes

I will check the cases you mention in a little while. Meanwhile:

What’s been done with these? They should be okay, if non-standard HTML.

I’m wondering what situation this shows up in. Technically it is possible to have multiple IDs on the same element, but it’s probably good to avoid it unless there is a need.

There are few things that still should not be here.

</center>, </font>: These are deprecated tags and should be removed/fixed.

</st1:country-region>, </st1:place>: These are not HTML, but bastard spawn of Microsoft Word.

</style> is valid HTML5, but should not exist in our files: all styles are supplied via CSS stylesheets. If these tags have any associated CSS in the <head>, that too should be deleted.

</subhead-ending>, this is not proper HTML; well, technically it could be a custom element, but I think it’s just a mistake.

</sup> is odd, I don’t think we should use it, but maybe in the metadata?

</u> is valid HTML5, but I don’t think we use it. Unless it is in the text-critical markup?

<a class="…" data-uid="…" id="…">, <a class="…" id="…" data-uid="…">, <section class="…" id="…" data-uid="…">: I think data-uid is no longer used by us. We used to use this to mark “embedded parallels”, but this was replaced by Vimala when she put everything into parallels.json. Maybe check with her and/or Blake and see if these can be removed?

<a data-name="…" data-type="…" id="…">, <a data-type="…" id="…">: not sure what data-name and data-type are doing, so far as I know we don’t use them so I suspect they should be deleted. Best check with Blake.

<a name="…" id="…">, <a name="…">: The name attribute is deprecated in HTML5, it should be replaced with id.

target="_top": I don’t think this should be there, best delete.

target="_blank": I’m not sure that it’s a good idea to hard code this. It makes the site policy hard to enforce consistently. And there are good arguments to be had that it’s simply a bad idea:

They missed one further advantage of not using target="_blank": watching a Mac user open things in a new tab. Try it, it’s hilarious! I’m like, “middle click, it opens in a new tab”. “Sure”, they say smugly, “I can do that. You just wiggle in the shiny thing here, open that and invoke this; sacrifice a chicken in the dead of night, stand on your head and recite the entire Torah, and it’s all good.”

How about we just delete all target="_blank", and if users riot we can apply them client side.

<article style="direction: rtl;">, <div dir="rtl">, <em dir="rtl">: I’m not familiar with the appropriate conventions for marking right-to-left text (eg. Hebrew, Arabic). Can you check this and ensure we are following best practices?

<em lang="…">, <span lang="…">: this doesn’t seem right. If a text is quoting an alternate language (usually Pali), then use <i lang="…">.

<img src="http://ad.wz.cz/openx/www/delivery/avw.php?zoneid=27&amp;cb=123&amp;n=a5977468" alt="">: What the dickens! There are no images in our files, and this looks like malware.

<li style="list-style: none">: Normally this should not be needed.

<meta: I am not all that familiar, but again, please check and make sure we are following best standards.

<p align="center">, <p align="left">. Yeah, no. Align is not valid HTML.

<p hidden>. Also no.

<p lang="…" class="…">, <p lang="…">: Is there a valid reason to override the document language? Normally this is only needed for occasional foreign words, which should take an <i> tag, I can’t imagine why there would be a whole paragraph in a different language.

<p style="text-indent: 0em; margin-left: 0em">,
<span style="text-decoration: underline">, <span style="visibility: hidden">: And … no. There should be no inline styles whatsoever. But it would be worth checking why these are present.

<span style="background: var(--sc-…">, <span style="color: var(--sc-…">: these are weird, they use the Polymer mixins for styles, but what are they doing? I think they should go.

<table class="…" style="background-color: white">: and once more with the inline styles. As for tables, there’s not much use for them at all, although they are used occasionally in Abhidhamma translations.

<td width="…" valign="TOP">: Once more with feeling … no.

2 Likes

Just removing a space so there’s only one listing for it. Originally there were two:

<!-- gaiji, … -->

and

<!--gaiji, … -->

Not really necessary, but well, y’know…

There were two instances of these:

<a class="sc" id="ms-pa" id="ms-pa1714">

and

<a class="sc" id="ms-pa" id="ms-pa1779">

It’s good that you highlight it though, because I just checked more closely and see that the surrounding context show the change wasn’t right, which produced a double ms-pa and I also see there was a bit of a glitch in text as well. I’ve now amended:

Original pli-tv-bu-vb-ss11, ln 36:

<p><a class="ms-pa" id="ms-pa1714"></a> Other monks,<a class="sc" id="ms-pa" id="ms-pa1714"></a> Other monks,<a class="sc11"></a> those who see it or hear about it …

To:

<p><a class="sc" id="sc11"></a><a class="ms-pa" id="ms-pa1714"></a>Other monks, those who see it or hear about it …

Original pli-tv-bu-vb-ss13, ln70:

<p><a class="ms-pa" id="ms-pa1779"></a> Other monks,<a class="sc" id="ms-pa" id="ms-pa1779"></a> Other monks,<a class="sc27"></a> those who see it or hear about it …

To:

<p><a class="sc" id="27"></a><a class="ms-pa" id="ms-pa1779"></a>Other monks, those who see it or hear about it

Much thanks, as you know, the impetus for this exercise was my efforts to learn more about the basics.

Much thanks for all the rest, I’ve only done an extremely quick skim, but will go through it all more closely shortly.

2 Likes

Oh, okay.

Good, thanks.

Me too! I never knew that before.

1 Like

@Aminah I’m having another look at these cases, and I’m not 100% clear about what remains to be done. When you get the chance, can you repost the outstanding issues?

1 Like

Yeah, 'fraid that had to go on hold, so it isn’t any further along than as reported above. Now that I mostly have a desk again, I can resume over the weekend.

No worries, I have 700 corrections to get through on Pootle, mostly just looking for something—anything!—more interesting.

2 Likes

:joy:

Several world contractions and expansions later…

Finally getting back round to this list. Ayya @vimala & @blake do you have any input on the above quoted items?

1 Like

No, the data-uid are not used for embedded parallels. I have removed all the code for those. The data-uid are used for marking things in files like the Dhammapada.

Can be removed.

1 Like

So they should be retained? If so, can you write an explainer somewhere (maybe on the/zz page?) to say what they do? :pray:

@sujato I’ve now updated the html-clean branch and picked off some of the above. The related Pull request has been automatically updated:

Naturally, you and Blake will have the best of idea how to proceed from here.

In terms of what’s changed:

Removed

</center>
</font>
</st1:country-region>
</st1:place>
</style>
</subhead-ending>
<a target="_blank" href="…"> (target="_blank" removed)
<a target="_top" href="…"> (target="_top" removed)
<center>
<li style="list-style: none">
<p align="center">
<p align="left">
<p style="text-indent: 0em; margin-left: 0em">
<font face="Times New Roman">
<td width="…" valign="TOP">

Replaced

<em lang="…"> replaced with <i …>
<a name="…" id="…"> replaced with class="…"
<a name="…"> replaced with id attributes, but I’m a slightly suspicious of these ones altogether

relevant files

fr/pli/sutta/kn/iti/iti83.html:18:<a name=“devas83”>
vi/pli/sutta/sn/sn12/sn12.50.html:20:<a name=“_Toc430081987”>
id/pli/sutta/sn/sn35/sn35.234.html:17:<a name=“udayi_234”>
id/pli/sutta/sn/sn56/sn56.10.html:18:<a name=“MemutarRodaDhamma”>

Partial amendments & questions

(some instances of the below tags have been cleared or amended, but a Q remains over others and they’ve been left unchanged)

:question: </sup>

used in several instances in the Patna Dharmapada. Eg. SuttaCentral. Necessary special character?

:question: </u>

used in:

  • Several Udānavarga files in metre marking. Eg. SuttaCentral

  • Also one instance found in pli-tv-kd1/de/ in what probably should belong to an add class: (seltene Konstruktion des Textes, in Singh. Version: Wenn (sie)<u>nicht</u> den tiefen, unübertroffenen Bereich ...) (as an aside the MT reference numbers for this are a bit confusing, what’s the point of having several MT 1s in the same text?)

:question: <p class="…" lang="…">

This is used in Tamil snp texts to indicate a Pali title (Eg. snp4.16). It is a title which is why it isn’t in <i> tags. I don’t know if this answers to your ‘valid reason’, but it at least makes sense. That said, I think I seem to recall at one point recently-ish you suggested Pali titles should not be included in translations???

:question: <p lang="…"> – used in:

  • Sindarin kp9; I’m wondering if it’s in some way necessary for the special script?

  • san-lo-bu-pn14.html;

    <p lang=“xct” class=“scribe”>
    <a class=“sc” id=“7”></a>
    ’phag-spa dge-’dun-phal-chen-pai ’jig rten-las-’das-par-smra-bai dge- sloṅ-gi
    ’dul-ba sil-bu rdzogs-so ||</p>

  • ko/pli/sutta/mn/mn63.html; again for a Pali sutta title

:question: <p hidden>

This only concerns the pt sn3.13 file. Checking it against the en and pli I’m wondering if there is a valid reason why the parahraphs at the end of this sutta were hidden. Perhaps @Gabriel_L can help?

:question: <span style="visibility: hidden">

used in san dk files to hide asterixks. Eg. dk5 ln.25. Seems to be serving some deliberate purpose.

:question: <span style="background: var(--sc-…"> & <span style="color: var(--sc-…">

These Polymer mixinsonly concern https://suttacentral.net/zz2/zz/test

:question: <table class="…" style="background-color: white">

Styling has been removed from tables, but in terms of where tables are used:

files with tables

(excluding zz files)
en/pli/abhidhamma/ds/ds2.1.2.html
en/pli/sutta/dn/dn14.html
en/pli/sutta/dn/dn14.html
de/pli/sutta/sn/sn34/sn34.46-49.html
de/pli/sutta/sn/sn34/sn34.28-34.html
de/pli/sutta/sn/sn34/sn34.35-40.html
de/pli/sutta/sn/sn34/sn34.12.html
de/pli/sutta/sn/sn34/sn34.50-52.html
de/pli/sutta/sn/sn34/sn34.14.html
de/pli/sutta/sn/sn34/sn34.17.html
de/pli/sutta/sn/sn34/sn34.15.html
de/pli/sutta/sn/sn34/sn34.20-27.html
de/pli/sutta/sn/sn34/sn34.13.html
de/pli/sutta/sn/sn34/sn34.18.html
de/pli/sutta/sn/sn34/sn34.16.html
de/pli/sutta/sn/sn34/sn34.41-45.html
de/pli/sutta/sn/sn34/sn34.53-54.html
de/pli/sutta/sn/sn34/sn34.55.html
de/pli/sutta/sn/sn34/sn34.19.html
de/pli/sutta/sn/sn34/sn34.11.html

In addition the Qs from the OP remain.

We’re down from the initial 253 items to 167 with still some easy ones to pick off! :grin:

Final tag list after this round

<2.50 is missing in the English text. The following is included in 2.49 there>
1 Like

Hmm, 'pparently a teeny-weeny 167 item long list is is still too long to post in a summary box so here’s a zip: html-clean2.zip (1.3 KB)

I would retain them for the time being. They are used in all division-length texts. It ensures that the links to parallels actually go to the right place. So for instance dhp42 = dhp#42. And it might do a few more things too having to do with the menu structure, etc. But I’ve been out of it for too long to really know the full implications at the moment. Maybe @Blake can say a bit more about this.
I’d be happy to put something on zz but as I do not know the full implications right now I’m a bit reluctant to do so.

2 Likes

Okay, no worries, they’re not hurting anyone, we can leave them.

Yes, leave this.

These are definitely mistakes, remove them.

Also remove.

None whatsoever!

Okay fine.

As a rule yes, but if they are present in legacy texts, never mind. It’s mainly for the segmented translations, because the pali is always available.

Okay, leave these.

Okay, but they should use a class rather than inline styles.

But they are hard! I wanted them to go away! :crying_cat_face:

I’ll get started with the dates.

Am I correct to think that we are GO as far as moratoriums on changes???

I dont see a reason why it would be hidden. Note that I was not involved in preparing the HTML files for those translations (my task was to move the translations to clear .txt files). Was it done by Venerable @Vimala ?

1 Like

Thanks, let’s assume it was a regex bug and remove it!

Oh so sorry, I tagged you in because, well… you speak Portuguese, but also because the file says that you and Bhante Sujato prepared it. Must be a mistake.

Yes. By yes, I meant there are changes waiting there in the html-clean branch which is still a branch, but that I will not be doing anything further until you indicate it is okay. I have not merged the branch because

  1. I do not want to make so many changed to sc-data master without any 2nd party review and as per above it’s best if you or Blake accept them.

  2. there was a conflict with the MVK before I updated the branch and now some new af snp conflicts (GH says they’re too complicated to view, so I’ll do a diff)

I checked the original translation by Beisert and it turns out that indeed the last paragraph was not translated !

2 Likes