While generally the quality of our HTML files is good, over the years a number of mistakes and inconsistencies have crept in. @Aminah has started fixing these, and I paste her Github post below.
I’ve now done a preliminary sweep through /sc-data/html_text
, pushed to a new branch html-clean
and opened a pull request it you wanted to have a glance before merging.
The following have been amended:
<!-- gaiji, … -->
<!doctype html>
<!DOCTYPE HTML>
</ p>
</a">
</A>
</blockquotes>
</br>
</H3>
</P>
<504>
<</h1>
<</p>
<a href="…">
<a class="…" id="…" id="…">
<a href="…" target="_blank">
<a href="…" target="_blanc">
<a href="…" target="blank">
<a href="…" target='_blank'>
<a href="…" target=blanc>
<A HREF="…">
<a href='http://a-buddha-ujja.hu/Info/Balint-Gyongyi'>
<a href='http://www.accesstoinsight.org/tipitaka/kn/thag/thag.08.01.bodh.html' rel='nofollow'>
<a href=http://obo.genaud.net/dhamma-vinaya/pts/mn/mn.080.horn.pts.htm target="_blank">
<A id="…" class="…">
<A NAME="…">
<a title=’Pali: cetana’>
<blockquote clas="gatha">
<blockquote class="…"<p>
<br />
<br/>
<bR>
<BR>
<div id="…" lang="…">
<H3>
<hr/>
<lI>
<meta author="…">
<meta charset="utf-8"/>
<meta charset="utf-8">
<meta charset="utf8">
<meta>
<P>
<section class="…" id="…">
<span <span class="…">
<span class="…" >
<span class="…" title=">
<span class="…" title='…>
<span claass="…">
<span clas="…">
<span style="“text-decoration:" underline>
<table class="…">
<|blockquote>
There were a few other items that seemed obviously not quite right, but I had a question mark over how they were to be handled so have left them for the time being. They are marked with an indent and “???” in the new whittled down list of unique tags found in /sc-data/html_text
:
SChtmlTags-swp2.txt. They are as follows with the relevant file/s given (line no. correspond to the .txt):
fr/lzh/sutta/lzh-nbs/t1670b/t1670b2.50.html
Line 45: <2.50 is missing in the English text. The following is included in 2.49 there>
vi/pli/sutta/sn/sn18/sn18.22.html
vi/pli/sutta/sn/sn18/sn18.21.html
vi/pli/sutta/sn/sn1/sn1.37.html
vi/pli/sutta/sn/sn2/sn2.8.html
Line 51: <a class="…" id="…" pi>
fr/lzh/sutta/lzh-nbs/t1670b/t1670b2.65.html
Line 80: <again 2.65 is missing in the English text. The following is included in 2.64 there.>
fr/lzh/sutta/lzh-nbs/t1670b/t1670b2.45.html:
fr/lzh/sutta/lzh-nbs/t1670b/t1670b2.19.html:
Line 110: <end of chapter 2>
Line 111: <end of first chapter>
Line 93: <chapter 2>
Line 94: <chapter 3>
es/pli/vinaya/pli-tv-bu-pm.html
de/pli/sutta/kn/ja/ja313.html
de/pli/sutta/kn/ja/ja313.html
de/pli/sutta/kn/ja/ja313.html
Line 112: <font face="Times New Roman">
(used to wrap “ā” character, or troll )
pli/vinaya/pli-tv-kd/pli-tv-kd10.html
Line 159: <span a="…
pli/vinaya/pli-tv-kd/pli-tv-kd17.html
Line 160: <span ag="…
pli/vinaya/pli-tv-kd/pli-tv-kd1.html
Line 161: <span ariyasaccan="…
pli/vinaya/pli-tv-bi-vb/pli-tv-bi-vb-pc/pli-tv-bi-vb-pc42.html
Line 162: <span asa="…
pli/vinaya/pli-tv-bi-vb/pli-tv-bi-vb-pc/pli-tv-bi-vb-pc35.html
Line 163: <span bhikkhuvibha="…
pli/vinaya/pli-tv-bi-vb/pli-tv-bi-vb-pc/pli-tv-bi-vb-pc35.html
pli/vinaya/pli-tv-pvr/pli-tv-pvr17.html
Line 166: <span class="…" dissati=""…
pli/vinaya/pli-tv-pvr/pli-tv-pvr16.html
Line 169: <span class="…" id="…" itip="" kesuci="" ni="" pariv="" potthakesu="" title="…">
fr/lzh/sutta/lzh-nbs/t1670b/t1670b2.28.html
Line 194: <The English text is missing chapter 2.28. All the following is included in 2.27 there>
I’m guessing the Vinaya ones may just about to become moot.