Problems with marking up references for CBETA texts

sujato · April 24, 2016, 11:17am

In the Chinese texts we must form each text into a semantic unit. However they are not found as such in the CBETA source, which is organized by juan, not actual text. It seems that doing so exposes some weakness in our parallels data. Here is an example.

Here is the source code from http://tripitaka.cbeta.org/T04n0211_003 for the juncture of t211.30-t211.31

</p>
<a name="0599c19" id="0599c19"></a>
<a pin_name="30 地獄品"></a>
<a pin_name=""></a>
<a pin_name="31 象品"></a>
<p style="margin-left: 2em">
<span class="headname">法句譬喻經象品第三十一</span>
</p>
<a name="0599c20" id="0599c20"></a>
<a pin_name="31 象品"></a>
<p style="text-indent: 0em; margin-left: 0em">
<span class="linehead">[0599c20] </span>

I’ve put each tag on a separate line for clarity, but apart from that this is exactly what they have.

There’s a lot of cruft here, so removing that for clarity and using proper heading, we come closer to the markup used on SC:

</p>
<a id="0599c19"></a>
<a pin_name="30 地獄品"></a>
<a pin_name="31 象品"></a>
<h1>法句譬喻經象品第三十一</h1>
<a id="0599c20"></a>
<a pin_name="31 象品"></a>
<p>
<span class="linehead">[0599c20] </span>

So we create a juncture between the texts as marked here with the “pin_name”. Pin is Chinese for vagga. The first pin-name marks the end of the previous vagga, while the following pin_name marks the beginning of the next vagga. As you can see, it’s not very clear, but mostly it works okay.

The problem is that in this instance, our parallels data gives the reference as “T 599c19”. With such a fine-grained reference system, it is easy to see how a line number could drift like this. If we’re referencing the “native” CBETA file, it’s not such a problem, as we can easily see that what is meant is the beginning of the next text. However once the texts are split the line number now refers to an entirely different text.

Notice that this is case of the “overlapping hierachies” problem which is one of the issues that standoff properties are intended to resolve.

In this case it will be best if our data agrees with the now-universal text of CBETA. We should investigate to see how widespread this problem is and whether there is a simple programmatic fix.

In this case, the problem could be fixed by doing this:

Look for the reference
Scan ahead to find the next t-linehead
If the t-linehead is different, change the reference to match it.

Whether this can be applied generally is another matter, of course. I have examined a number of texts. In some cases the line number is in the previous file, as here, in some cases it is exactly the same as the t-linehead, while in still other cases it is in the correct text, but in a line not marked as t-linehead.

Also note that not all texts start with t-linehead; verses, for example, do not, so if a file starts with a verse this will not work.

Pending a more satisfactory fix, we should not rely on the SC Taisho vol/page data to reliably match the exact start of the relevant text.

yap · August 9, 2016, 5:33am

look closely to http://imgur.com/a/IO3qm
＜a pin_name="30 地獄品"＞ is found in every paragraph covered by 地獄品。
but line 0599c19 has two anchor, 30 地獄品 and 31 象品

it is not likely to have a 品 starting from middle of line.
let looks at the XML in CBETA CDROM 2016. filename: XML/T04/T04n0211_003.xml

＜lb n=“0599c17” ed=“T”/＞入山中殞命精進，思惟偈義，守一正心閑居
＜lb n=“0599c18” ed=“T”/＞寂滅，得羅漢道。＜/p＞＜/cb:div＞
＜lb n=“0599c19” ed=“T”/＞＜cb:div type="pin"＞＜cb:mulu level=“1” n=“32” type="品"＞31 象品＜/cb:mulu＞＜head＞＜title＞法句譬喻經＜/title＞象品第三十一＜/head＞
＜lb n=“0599c20” ed=“T”/＞＜p xml:id="pT04p0599c2001"＞昔者羅雲未得道時，心性麤獷言少誠信。
＜lb n=“0599c21” ed=“T”/＞佛敕羅雲：「汝到賢提精舍中住，守口攝意

while
＜cb:div type="pin"＞＜cb:mulu level=“1” n=“31” type="品"＞30 地獄品＜/cb:mulu＞
is on T04p0598c01 ( and ＜/cb:div＞ ends at T04p0599c18, not including 599c19 )

I guess the 地獄品 below 0599c19 is repetitious , and it might be inserted incorrectly by program during the conversion from XML to HTML.