Pali text on SuttaCentral

sujato · July 27, 2020, 12:08am

We have recently made a substantial set of transformations for the Pali text on SuttaCentral. This post is intended as a summary and history of the text for scholars and those who wish to use our Pali text. These issues have been discussed in various places, and here I am simply bringing them together for convenience.

Summary: use the Pali text from bilara-data for all applications.

Background

In the modern era, Pali texts have been edited and published as book editions. One of the most ambitious and influential such projects was the 6th Council (Chaṭṭha Saṅgāyana or Chaṭṭha Saṅgīti), held in Myanmar in 1954–1956, with participation of international bhikkhus. Several printed editions of this text followed. During the 1990s the official Burmese edition was digitized by the Vipassana Research Institute and published as CD-ROM and website.

The VRI digital edition was passed to a group called the Dhamma Society Fund in Bangkok. Between 1999 and 2004 they created the Mahāsaṅgīti edition based on the VRI. It is also called the “World Tipiṭaka Edition”, but the full name is “Mahāsaṅgīti Tipiṭaka Buddhavasse 2500”, or in English, “The B.E. 2500 Great International Council Pāḷi Tipiṭaka”.

I met with and discussed the project several times with the Dhamma Society team in the early 00s, and had the chance to visit their headquarters in Bangkok. They informed me that their text had been proofread three times orally, recited from beginning to end with auditors checking as they went, and three times visually. This was done using the first Burmese printed edition of the Chaṭṭha Saṅgīti as the primary source. It was checked against about a dozen other printed editions (many, but not all, of which were alternate versions of the Chaṭṭha Saṅgīti.) They did not, however, use Sanskrit, Tibetan, or Chinese sources at all, and did not go back to manuscripts.

The scholarship team was led by a mae chi, whose name I have sadly not been able to discover, with background in Pali and Sanskrit at Nalanda University. The stated editorial policy of the Dhamma Society was simply to rely on the opinions of experts in determining the best reading. The Dhamma Society itself did not aim to establish a new edition, but to faithfully represent the Chaṭṭha Saṅgīti text.

Comparison of the Mahasangiti with the source VRI text shows that the textual changes are well-considered, and attests to the Dhamma Society’s careful editorial work.

Unfortunately, in late 2012 the Dhamma Society dissolved and their online resources disappeared. SuttaCentral obtained a copy of their source text from Venerable Yuttadhammo, who has made it available in his Digital Pali Reader. These source files, with only light transformations, are maintained by SuttaCentral in this repository.

Adapting the Mahasangiti to HTML for SuttaCentral

The work of adapting the Pali texts for SuttaCentral was initially by Blake Walshe, continued by Ayya Vimala and Bhante Sujato, assisted by several volunteers.

The initial task was to transform the XML files into HTML. We made a number of basic decisions at that time.

Usually one sutta per file, (whereas the Mahasangiti split long suttas into many sections.)
Sometimes very short suttas were combined (eg. in the Anguttara Ones and Twos, and Dhammapada).
Use modern, semantic HTML throughout, for example, for headings.
The Mahasangiti sources kept the variant readings and cross-reference notes separate to the text files. There is an extensive set of files that link the two, but the system is extremely obscure and difficult to work with. Nevertheless, we succeeded in importing these notes into the HTML text files.
Use our own set of abbreviations and nomenclature. This is primarily required by the fact that we must deal with the parallels, and hence have a much wider corpus.

This created the set of HTML files that we used on SuttaCentral. These files are maintained in this repository.

From HTML to JSON for Bilara translation

While the HTML texts have served us well, they have a number of severe limitations. These are not details of implementation, but fundamental problems with the very idea of markup. These problems became exposed as we developed our own set of translations. We accordingly developed an entirely new system for maintaining the texts in the data exchange format JSON. This forms the basis for all our Pali texts going forward, and we aim to extend it to other languages as well.

To understand the reasoning here, it is necessary to appreciate the limitations of markup in dealing with text, especially complex text such as ours. The basic problem is simple: text with markup mixes up different things in the same file, and makes it hard or impossible to separate them when needed. These things might include:

Application logic (such as navigation)
Styles (fonts, colors, etc.)
Structure (eg. paragraphs, headings)
Metadata (eg. copyright, authorship)
Supplementary data (eg. variant readings)
Notes
References

Maintaining all of these in a consistent way across tens of thousands of texts is not impossible, but it is not easy. We have tried to keep things as clean as possible, for example by removing as many things as possible from the HTML files (including application logic and styles). But what remains is still unwieldy.

And there are certain things that are practically impossible with markup. This is particularly the case when it comes to translations. Much of the information that we have is equally application to each translation. For example, regardless of whether one is reading a translation in English, Spanish, or Japanese, you may want to check the underlying Pali text for this paragraph. You may also want to see if there are variant readings, or check where this passage appears in a particular printed edition. In markup files, all this information is siloed per file, and cannot be easily shared.

When undertaking our new translations of the Pali canon, we transformed our Pali files into the gettext PO format. We initially had no interest other than making it work for translation. The more we worked with this format, however, the more we realized the possibilities. However, PO itself is not well-specified for our requirements, so we developed our own specification based on JSON, the universal data-exchange format for the web.

This system is simple but flexible and powerful. The texts are maintained in the bilara-data repository.

Each text is divided into segments, where a segment is a minimum meaningful string of text (such as a sentence, a clause, a line of verse, or a doctrinal pericope.)
Each segment is assigned an ID that is unique within the corpus.
Each kind of content is maintained in separate files in separate folders.
Applications use the segment ID to combine content as appropriate.

Let us see a simple example. MN 1 Mulapariyaya Sutta begins with the stock phrase, represented in bilara-data thus:

"mn1:1.1": "Evaṁ me sutaṁ—",

Like every entry in bilara-data, this has two parts, the segment ID and the content.

The segment ID has two parts. The portion before the colon is the ID for the text, which is also the name of the current file. The portion after the colon is the number of the segment within this text. Thus mn1:1.1 means “Majjhima Nikaya discourse number one, segment 1.1”.

The content is the string of Pali text for this segment.

The various folders contain different content for the same string, as indicated by the folder names. There is no limit to how many folders or types of information are assigned. And because each form of data is kept strictly separate, applications can simply consume whatever content they require and ignore the rest.

Let us see what it contains.

In /reference we find:

"mn1:1.1": "bj10.2, cck12.1, csp1ed9.1, csp2ed9.1, dr12.1, ms9M_2, msdiv1, ndp9.3, nya1, pts-vp-pli1.1, sc1, sya12.1, vri12.1",

Thus this particular segment may be checked against the Pali text in this whole list of editions.

In /html we find:

 "mn1:1.1": "<p><span class='evam'>{}</span>",

The {} acts as placeholder for the text. Here, the string is the start of a paragraph, and in addition, is wrapped in a span for styling (represented as small-caps on SuttaCentral).

In /translation/sujato we find:

"mn1:1.1": "So I have heard. ",

Any number of other translations can be added, and mixed and matched as desired. On SuttaCentral, we might present the above data in HTML as follows:

<p>
    <span class='evam'>
        <span class='segment' id='mn1:1.1' data-reference='bj10.2, cck12.1, csp1ed9.1, csp2ed9.1, dr12.1, ms9M_2, msdiv1, ndp9.3, nya1, pts-vp-pli1.1, sc1, sya12.1, vri12.1'>
            <span class='root' lang='pli' translate='no'>Evaṁ me sutaṁ—</span>
            <span class='translation' lang='en'>So I have heard. </span>
        </span>
    </span>

Note that as of writing, these changes have not yet made their way to the actual SuttaCentral site.

Text integrity checking

In transforming the text through multiple formats, various errors and issues arose. We undertook a major project to test our text against the original source files and rectify any errors. Fortunately the errors discovered were minor, and we have fixed all outstanding issues. This project is discussed in more detail here.

This project is complete so far as the Pali text is concerned, and we are confident that our Pali text is letter-by-letter an accurate representation of the Mahasangiti edition. Of course it is still possible we have made mistakes, so if anyone notices any, please let us know.

Note that, in addition to the structural changes noted above, the following things are different in SuttaCentral’s texts:

Headings are handled differently.
Punctuation is sometimes corrected.
Class nasals are used consistently (i.e. we always have ṅg not ṁg, etc.). This is to make search easier.

Future changes to Pali texts

We do not anticipate any further chanages to the text content of the Mahasangiti text. However there will be some minor ongoing improvements in different areas:

Certain variant readings and references are still to be added. You can see the progress of these tasks on Github.
Certain segments may be split or combined for consistency and semantic correctness. This may affect a small number of segment numbers in restricted portions of text. These corrections are made piecemeal as we work with the texts, so we cannot predict the number, but it will be only a few cases.
In a few cases we use Pali texts outside the Mahasangiti edition. Currently we publish the two patimokkhas, which are derived from the VRI text. We are also working on digitizing manuscripts.

Making use of SuttaCentral Pali texts

We encourage you to make use of our texts for your projects. The primary source for our Pali (and other segmented texts) is in the published branch of the bilara-data repo.

We would recommend that you clone or fork this repo. Check back every six months or so to see if there are any changes.

In the .scripts folder of this repo we provide a tool called bilara i/o. This uses pyexcel to provide a handy interface for combining the content of bilara-data files and exporting in a convenient format, typically a spreadsheet. From there you can organize the data as you like, and export it in any form you wish.