Pali text integrity checking

sujato · June 21, 2020, 2:20am

Today I have the very pleasant task of announcing the completion of our process of text integrity checking for our Pali texts.

We have run a thorough set of tests comparing our Pali text with the original source files, and fixed the differences between them. The corrected texts are available on bilara-data and will be pushed to our site in due course.

Background

On the dissolution of the Mahasangiti project, SuttaCentral inherited from Ven Yuttadhammo’s Digital Pali Reader project a complete set of files for the Mahasangiti edition of the Pali Tipitaka. This was used as the basis for our Pali texts.

However, the form of the texts was quite different from what we wanted.

The markup was complex and sometimes suboptimal;
texts were divided in complex ways, we wanted to keep one sutta per page;
as a result, the heading structures needed changing;
the numbering system, while accurate and systematic, is unique to ms and not in use anywhere else.
various other details.

Blake therefore transformed the Pali files for use by SuttaCentral.

The Mahasangiti also had extensive files with variant readings and references. These were kept in separate XML files, and it was not easy to see how they were meant to work. Nonetheless, Blake succeeded in importing the variants, although there were some bugs. The references, however, have remained unused by SC.

This work was done around 2012.

Since then, there have been some minor changes, for example, due to ensuring that the numbering matched Ven Bodhi’s editions of AN and SN.

When I began my translation project, we needed to segment the texts. We therefore created segmented versions of the 4 nikayas. This process essentially broke the text on major punctuation (. ; : ! ? …). It was then modified and corrected by hand.

Gradually, we segmented the Vinaya, then Thera/Therigatha, then the remainder of the Pali texts. Since the segmenting was done in stages, various discrepancies arose.

All these segmented texts needed to be added to bilara-data so as to be consumed by Bilara for translations, and our new front end, as well as SC-Voice. In bilara-data, all the kinds of data are kept strictly separated and coordinated via segment ID numbers. Thus it is imperative that those numbers be correct and consistent.

In addition, I wanted to take the chance to import and finally make available the reference information from the Mahasangiti edition. I had done preparatory work for this, but the data needed to be added to bilara-data. In addition, we needed to check the correctness of the variant readings.

I worked for some time with @karl_lew on this project, but it became clear that it was too big and we needed more resources. So I approached STXnext, with whom we have worked successfully in the past, and they took on the job. I have been working with them and @Robbie for the past several weeks on this.

Accomplishments

Variant readings have been checked, and various problems resolved.
All legacy reference numbers have been imported (there are a couple of minor tasks remaining here.)
Inconsistencies of segment numbering between files and texts have been resolved

Detailed list here:

Text integrity

The most important task was to ensure that the text remains identical with the MS original. During the various changes in file format, it is hard to avoid certain errors creeping in, and in a corpus as large as the Pali, it is not easy to check.

STXnext devised a program that would match our text against the original source, relying on the ms ID numbers found in both texts.

The original ms edition is represented by a set of files called ms_yuttadhammo/html. We believe this is an uncorrupted set of original files. The only transformation we have applied is changed the XML to HTML and using Unicode instead of HTML entities to aid legibility. Please note that, while we have done our best to check and resolve every issue, we cannot guarantee that there is no corruption in our files. Thus if you need the ms edition, best use these files.

If you want to run the tests yourself, you can use the scripts:

Fortunately we did not find any major losses or corruptions. Here is a summary of issues, for full details see here.

Prior to the systematic checks we identified and fixed a number of miscellaneous issues, such as duplicated text. The most significant issue was the loss of the tassuddānaṁ and similar markers. These little snippets of text serve to introduce the uddāna, or summary of contents. Essentially they are like a heading saying “Table of Contents”. Due to what must have been a find/replace error at some point, all these had disappeared from our texts. I restored some 500 cases, and in a few instances, subsequent testing made further corrections. These all take class='uddana-intro'.

The scripts ignore certain kinds of differences:

file structure
headings
markup
punctuation
capitalization
ṃk/ṃg vs ṅk/ṅg

There are a number of areas where the tests could be improved.

They compare overly large clusters of text. In one case, the compared ranges included 43,000 words, the length of a small novel. The comparison should be on the smallest range possible.
There are still 101 bugs reported, these are false positives.

In the current round of testing, here are the main issues:

Text incorrect in MS, corrected in SC: In one instance we found textual corruption in the original files. At 15A1_449 we find ‘ānāmi passāmī’ti repeatedly. This error is not found in the source VRI text. It is present in all versions of ms that we have, so it seems it may have been a mistake introduced by the ms itself.
- Corrected in SC.
Text incorrect in MS, not corrected in SC: In one instance we found an error in ms that had already been corrected in SC: 18Sn_519 has eka samayaṃ. This spelling is, however found in the source VRI edition, so perhaps it should be regarded as a genuine variant.
- Reverted to original spelling eka samayaṃ in SC.
Apostrophes introduced in SC: eg. evam’idhekacce vs. evamidhekacce.
- retain apostrophe
Extra text in ms: Some words or phrases were missing in SC, eg. pharasuppahārānaṃ at 15A4_754.
- Missing text has been restored.
Spacing error: Sometimes words were spelled with inconsistent spacing in one or other edition. Eg. we find both me taṁ and metaṁ.
- Prefer spaced version, apply consistently.
Quote errors: Due to different handling of quote marks, some words were confused.
- quotes go before the end -ti, eg. pahātabbo’ti
Random spelling errors in SC: 4 or 5 times we find some random mistake in SC text, such as suṇohhi. In one case the word gihibhāvaṃ had become corrupted to agihibhāvaṃ. This is perhaps the only case I encountered that might result in a genuine change of meaning.
- Correct spelling in SC.
Heading errors: Given that headings are insertions by modern editors, and that our file structure demands a different heading structure, we opted to ignore headings for these tests. However in some cases the tests did not correctly identify headings, giving a false positive.
- No action.
Missing speaker tag in SC: In verse conversations, we sometimes find a little tag to identify the current speaker. In Sutta Nipata, these had all been lost.
- Restore speaker tag to Snp.
Missing pe: In quite a few cases, the word pe, indicating a section of text to be expanded, was missing in SC. Instead we found just ellipses. Note too that in at least one case, where pe had been used inconsistently in the MS edition, we retain the inconsistency and just have ellipses instead of adding pe.
- restore pe
Text corruption due to variants: In a few cases, there had evidently been a confusion with the variant readings, and text belonging to the variant reading had made its way into the mainline text.
- Restore mainline text and put variants in variants.
Translation in root: In one case, two words from the translation appeared in the root.
- Restore mainline text.

That is the end of this process. Hopefully we are settled on a stable file structure now and such work will not be needed in the future. Nevertheless, the scripts are available for future testing.

I was thinking of making a few systematic changes to our texts that would diverge from ms. These are mainly for aesthetics, and making search easier.

MS uses ṃk/ṃg and ṅk/ṅg in a bout a 2/3 ratio. I suggest we use ṅk/ṅg always.
yoniso manas* vs. yonisomanas* is inconsistent, I propose we always use yoniso manas*.
Change ṃ to ṁ

On investigation, yoniso manas* vs. yonisomanas* seems to depend on grammatical form and must be intentional, so I won’t change it. I have made the other changes.

Viveka · June 21, 2020, 2:43am

Sadhu! and congratutlations!

Gillian · June 21, 2020, 3:47am

Wow. Sadhu to everyone who worked so hard on this obviously huge task.

sabbamitta · June 21, 2020, 7:24am

Congratulations, this is great news!

karl_lew · June 21, 2020, 1:10pm

Wow. No wonder why Anagarika and I had trouble tracing back the origin of these files. Thank you, Bhante, for the wonderful news!

It is really great to be able to move forward with the Pali canon secured and whole.