Fixing HTML on legacy texts


Of course, it will be included.



There were:

  • 14532  
  • 457 >
  • 28 <

Most of these were introduced by HTML Tidy (they were already in the UVS files). The quote-nbsp option can be set to 0 to prevent them being converted (although, perhaps, shouldn’t be), but there doesn’t look to be any way to stop the > and <conversion and I guess this is actually kind of the point of HTML Tidy.

Initially, I had thought these would have been extremely straightforward to deal with, but alas no! Just finding an actual NBSP character was harder than it should be as on every online resource I visited it’s just given as a regular space (I forgot about Linux’s Character Map, where I eventually found it).

Then I picked through mixed usage of the NBSP in the files: different kinds of errors (now cleared) and few different kinds of intended/meaningful uses. Then I discovered the “Narrow no-break space” (U+202F /  ) character, which the linked article highlighted is the correct character for the French punctuation—depending on whose French you want to go by—but alas I further discovered that U+202F isn’t always supported and can render funny in some browsers.

Anyway, whichever character is used, this article that makes the reasonable point that:

If you believe in the principle that source code should be optimized for readability—I do—then you should use the   escape code, as it makes the nonbreaking space visible and explicit.

HTML Tidy’s default seems to agree with this view. The article further suggests that it might be better to find an alternative solution to the NBSP in the UVS files.

So you’ll need to decide on those points, before the remaining  s are treated (regarding French texts, if they are to be preserved, then those texts that don’t currently have the proper French punctuation spacing probably ought to be updated).

Most of the &gt;and &gt;where errors and have been deleted, but there are a few instances where the angle brackets were intended, and will have (or will) reverted following Tidy-fication. Ones that may want looking at include “< 11 >”, “<22–23>”, “<470-72>”, “<492>” which are page refs to PTS texts (I’m not sure why they haven’t been handled in the usual way).

Praise be to sed!

In this process I bagged two bonuses: (1) noticing that in some cases the publication date attribute has been applied to the text the translator has translated from, not the translation (I haven’t investigated this any further), (2) finding 956 instances of </article> and </section> s in places that deviate from the standard template. Most of these weren’t wrong, per se, but as with the other consistency points, were worth standardizing to make sure nothing gets omitted by the element updating script later on.

Most weren’t wrong, some were (resulting from a regex mishap by the look of things) and I think this one might need to be followed up on, as it looks like some of the text might be missing: sc-data/html_text/vi/pli/sutta/sn/sn53/sn53.45-54.html

This has been added to <cite>.

Also, with respect to division divs I realized it was possible to target at least those supertitles that included “Nikāya” (1/5 of the texts, or 13127 texts). In the process, I noticed that some included a lang atttribute and some did not, so I added it for all of them. Similarly, it was possible to add the translate attribute to Pali sutta name supertitles (well, at least 4964 of them; others may have dodged my search term). I’m sure this isn’t perfect, but does fulfil the “where possible” principle.

As for the <i>, the translate attribute has been added to all the elements with a pre-existing lang attribute (and in addition <span>s being used in this way have been converted to <i> and had the attribute added). But! When initially replying to this point, I forgot about this:

Looking at the ‘plain’ <i>s again (, while again the vast majority of these are used to mark off root language words there are enough instances of deviation to prefer sorting these out before adding the translate attribute. Some just look like plain wrong applications, but others I’m not sure about.

The rest:

All the other items mentioned above (save the hanging questions points, and of course, the basic structure update) have now been done, and can be reviewed on the new ‘html-clean4’ branch I’ve pushed.

Btw, the zz/zz3.html should be updated to remove the listing for endsubsection.

So, what’s the final damage?

Well, there remain some questions, and no doubt there’ll be some other things that pop up before handing over to They Who Write Clever Scripts, but it certainly seems to be looking more orderly than when I began: (1.2 KB)

If it looks good to you, perhaps you can merge the pull request:


Okay, that’s looking terrific, I have merged it.

Sometimes I cheat and copy exotic spaces from the Wikipedia article on spaces!

Personally I prefer the Unicode rather than the escape code: escape code is specific to HTML, Unicode is, well, universal. But so long as it’s consistent.

That should probably be on a 2-do list for checking in the future.


I have added this to the list of issues with Vietnamese texts.



Exactly! That’s what I wanted to do, but at least on the page I checked (and then a bunch of others), it just gives a regular space. It can, indeed, be tough to find a little space in this world!

regular space: " "
Narrow no-break space: " "
No-break space: " "
Figure space: " "

Fair-dos re the preference for Unicode; I agree.

I do have to confess though that I didn’t have a clue about French punctuation style and that at first I just converted all the &nbsp;s to regular spaces. It was just fortunate flash of second thought that led me to recognise and undo my mistake. It kind of smushes my face in the value of having a clear indication in the code that there is a distinct and purposeful character being used (which should be all the more clear now that wrong uses have been removed). It’s also very trivial to convert them if migrating to another language.

In any case, I’m 100% with you on the being consistent point, so whichever way you think best.

It only concerns 4 files, so won’t be too taxing:

en/pli/sutta/sn/sn1/sn1.11.html:19:< 11 >

There have been four Vietnamese files that have been haunting me ever since this process began (and have fooled me into rechecking them several times:


I think it’s also about time to finally clear off those gobbledegook spans. The concern these files:


Would I be right to think that these wish be removed from html_text soonish? In turn, can these ones just be written off?

Wouldn’t you know it! By the looks of things the <em>s are being used for the same function as the <i>s (mark of a root language word), and this should be corrected—by MDN this is just a poor application of the element—but I’ll have a closer look at instances of this before doings so.


I’ve pushed the fixes mentioned here, also a bunch of malformed pts refs in the Vietnamese SN.

I couldn’t identify the issues in the last set of texts (pli/vinaya/pli-tv-bi-vb/pli-tv-bi-vb-pc/pli-tv-bi-vb-pc42.html:24 etc.), I guess you’ve gone ahead and fixed them?

Is there anything else left to do on this?


Brill! I’ll pull and then do the <em> and add nipata numbers (and I think there was one more thing that came up on the list from yesterday… I was just about to clean up the structure principles that emerged yesterday… but then may have got a bit distracted by my morning “what’s new” adventures: New JavaScript Features That Will Change How You Write Regex — Smashing Magazine :heart_eyes:)