In the Chinese texts we must form each text into a semantic unit. However they are not found as such in the CBETA source, which is organized by juan, not actual text. It seems that doing so exposes some weakness in our parallels data. Here is an example.
Here is the source code from http://tripitaka.cbeta.org/T04n0211_003 for the juncture of t211.30-t211.31
</p> <a name="0599c19" id="0599c19"></a> <a pin_name="30 地獄品"></a> <a pin_name=""></a> <a pin_name="31 象品"></a> <p style="margin-left: 2em"> <span class="headname">法句譬喻經象品第三十一</span> </p> <a name="0599c20" id="0599c20"></a> <a pin_name="31 象品"></a> <p style="text-indent: 0em; margin-left: 0em"> <span class="linehead">[0599c20] </span>
I’ve put each tag on a separate line for clarity, but apart from that this is exactly what they have.
There’s a lot of cruft here, so removing that for clarity and using proper heading, we come closer to the markup used on SC:
</p> <a id="0599c19"></a> <a pin_name="30 地獄品"></a> <a pin_name="31 象品"></a> <h1>法句譬喻經象品第三十一</h1> <a id="0599c20"></a> <a pin_name="31 象品"></a> <p> <span class="linehead">[0599c20] </span>
So we create a juncture between the texts as marked here with the “pin_name”. Pin is Chinese for vagga. The first pin-name marks the end of the previous vagga, while the following pin_name marks the beginning of the next vagga. As you can see, it’s not very clear, but mostly it works okay.
The problem is that in this instance, our parallels data gives the reference as “T 599c19”. With such a fine-grained reference system, it is easy to see how a line number could drift like this. If we’re referencing the “native” CBETA file, it’s not such a problem, as we can easily see that what is meant is the beginning of the next text. However once the texts are split the line number now refers to an entirely different text.
Notice that this is case of the “overlapping hierachies” problem which is one of the issues that standoff properties are intended to resolve.
In this case it will be best if our data agrees with the now-universal text of CBETA. We should investigate to see how widespread this problem is and whether there is a simple programmatic fix.
In this case, the problem could be fixed by doing this:
- Look for the reference
- Scan ahead to find the next t-linehead
- If the t-linehead is different, change the reference to match it.
Whether this can be applied generally is another matter, of course. I have examined a number of texts. In some cases the line number is in the previous file, as here, in some cases it is exactly the same as the t-linehead, while in still other cases it is in the correct text, but in a line not marked as t-linehead.
Also note that not all texts start with t-linehead; verses, for example, do not, so if a file starts with a verse this will not work.
Pending a more satisfactory fix, we should not rely on the SC Taisho vol/page data to reliably match the exact start of the relevant text.