Some issues with the Chinese texts

I had a bit of a chat a few days ago with Prof Lewis Lancaster. He’s one of the luminaries of modern Buddhist studies, and a pioneer in the area of digitizing the Chinese texts.

He mentioned some issues with the Chinese texts that I was not aware of so I thought I’d post them here. Perhaps some of our Chinese speaking users might have more to add on this.

Prof Lancaster said that when they printed the Taisho canon the fonts available to the printers did not have all the characters they needed to fully represent all that was in the manuscripts. So they kludged it to some degree, substituting modern versions of characters, or using the closest in form.

With the tens of thousands of Chinese characters, can you imagine what this job was like? Seeing a shape on a page—maybe blurry or indistinct—and trying to ascertain whether that exact form is in your font collection, and if not, which one is the closest substitute.

In most cases these variations won’t substantially affect the meaning. Nevertheless, it is still important to stay as close as possible to the original.

When the text was digitized, these kludges were not corrected. Moreover, at that time—over a decade ago—the Unicode specification was not as complete as today, and there were many characters missing. These are represented with special codes, or even with images.

Today the Unicode specification has expanded to represent a fairly complete view of Chinese characters (not so the Indic characters, but that’s another story). We have the excellent Noto fonts that can display the entire range of glyphs.

Yet the digitized versions of the Taisho are not yet completely updated to the new specification. According to Lewis Lancaster, the SAT version is more updated than the CBETA, which is the one we have adapted for SC.

(There is a further problem with the Taisho and derived digital texts, which is that while most of the texts are taken from the famous Tripitaka Koreana, not all of them are. Apparently the text did not identify these, and it was only recently discovered that they derive from an inferior privately printed edition. However, I think it’s unlikely that this affects any of the texts on SC.)

Clearly it would be better to sidestep these problems by using an edition that is based directly on the Tripitaka Koreana. Such an edition exists, it was digitized by Prof Lancaster himself.

However I can’t find it online. (Most of the links on the relevant Wikipedia pages are broken and outdated, a problem that infests most of the Buddhist text pages on Wikipedia.) All I can find is a site with the scanned images. (warning: flash plugin required!) Does anyone know anything about this?

According to Prof Lancaster, his edition is also not updated to the latest Unicode specification, and he said it was unlikely that he would do this himself. Nevertheless, it is still the most accurate digitized text.

Of course the Taisho edition is still the standard for referencing, so there’s no need to remove it. But it would be nice to update the site with the more authentic text if it is available.

1 Like

It’s a messy issue, but just one of many issues when reading and interpreting these texts. On CBETA, sometimes there are forms such as:


These are not so bad, and the basic components of the character are evident enough. At some point they could all be pretty easily identified and automatically replaced with the relevant Unicode characters. I would be more worried about mistakes that cannot be seen just by looking at the text, like problems in the original printed Taisho.

The issue of punctuation in the Taisho Tripitaka is really ugly too.

Occasionally I’ve had to ignore punctuation altogether when translating because it was obviously misplaced. While the Taisho is now viewed as the “standard” canon by scholars, traditionally China used the Longzang (Qianlong Tripitaka), which was woodblock printed and did not have any punctuation.

Scans of all the Longzang volumes can be found on the Internet (i.e. PDF’s of images), but I have not come across actual text data for either the Tripitaka Koreana or the Longzang.

Thanks for the feedback.

If you’re interested, let me know!

This is the thing: what’s the point in further correcting the CBETA text if the basis for it is dubious?

Thanks for the essay, it illustrates the issues nicely. (whinge: What’s with these PDFs with unembedded fonts?)

Apparently the Fo Guang is better punctuated. But it’s not digitized.

On an unrelated note, it is disappointing to see the authors perpetuate the confusion about the syntax of evam me sutam. Texts in the Pali show beyond doubt that it is used as an independent clause meaning “this is what I heard (passed down from someone who was there)”, and should not be run into the following ekam samayam. Thus the regular Chinese punctuation is in fact correct in this (presumably they were influenced by the Indological consensus at the time.)

The fact that some later commentaries explain it otherwise is easy to explain. As it stands, the phrase contradicts a basic tenet of textual authenticity, that Ānanda was present at each and every discourse. He was not, and the phrase is meant as a reminder that we are dealing with an oral tradition, not a first-hand account. Perhaps we should translate it “This is what I heard of”, or more idiomatically, “So I heard.”

Do you have a link? If we can get image sets for this and the Koreana, we could show them. All we’d need is a set of data to equate the image with the Taisho vol/page details.

I’ll try contacting Prof Lancaster and see how we go.

I agree on that point, and even in Chinese without punctuation, it is clear that “one time” is the beginning of the next sentence. It would read very awkwardly otherwise. There are also some methodological problems with simply changing punctuation in one language based on that in another.

All the volumes in PDF format are hosted on the creatively-named:

It looks to be about 168 volumes ~30 MB apiece, so about 5 GB of data. The scans themselves are just simple black and white images.

The Chinese Wikipedia page has an outline of the structure of the canon:

The canon is divided into an actual “Tripitaka” format of Sutra 經, Vinaya 律, and Abhidharma 論, which is neat.

1 Like

Thanks for this. I’ve had a look at It look like it’s unchanged since 2004, which must rank as one of the most stable websites ever!

The PDF files are reasonably well done. I’ve downloaded some of the Agama files, and I think we can do some useful things with them. They are not searchable or compressed, and the download on that site is extremely slow. We can make searchable compressed PDFs available at a decent download speed without much trouble, so that would be something.

More useful would be if we can make the images available while reading the Chinese texts, as we have done with the PTS. To do this, we need the following:

  1. Extract the images and convert to png (this is simple)
  2. Rename the images according to volume/page (this is simple as long as the source is reliable; if it omits or doubles pages it makes it harder.)
  3. Data that correlates these vol/page numbers with the Taisho vol/page/line reference.

It’s the last part that’s hard. It’s too much for us to do: might you know of any source that has this data already? We really need proper data, not scanned images, for this.

If we can get this data, we can make the images available.

It’s possible to do the same thing for the Tripitaka Koreana images. However, they are hidden behind a Flash web app, so I can’t download them. Do you know of any other sources?

I’m not sure where that might be found, or even if it can be found. Considering the Samyukta Agama, there are over a thousand individual sutras, which is a lot to go through. Instead, it could be divided up by fascicle, which would only be about 48 divisions. It’s still a fairly big project in all, though.

I wish I did, but I think these two canons have probably not had any big digitization effort as the Taisho has. I’m guessing the Longzang is just available because it is still widely available in printed format.

I was messing around with a PDF from the Longzang today out of curiosity, and I found that some of the pages are actually made from multiple images, and were scanned at a different resolutions. However, most pages are just from single monochrome images at about 150 dpi. It’s mostly readable, although it kind of looks like a xerox copy of a copy. The original printed volumes appear to be much cleaner.

It definitely has that classic 1990’s look.



Edited with Netscape 4.7 on Windows 98… Domain registered August 1999… These are the real heroes of the Internet. :anjal:

Okay, well, we’ll keep our eyes out.

In the volume I’ve checked, this is only the front matter, the bulk of the text is fine. But maybe not all are consistent.

Indeed, that’s a very nice image! It would be a shame to spend time on an inferior scanning. Given that it’s a simple mechanical task, perhaps we could sponsor a new scanning project. It’s probably quite easy to do in Taiwan.

Here’s an OCR-d PDF example of the Longzang text. I extracted the images, performed OCR with Tesseract, created the PDF and compressed it using smallpdf. Even compressed it is somewhat larger than the original, due to the presence of the OCR data.

I’d really appreciate it if you had a look at this and let me know if you think it’s useful. The OCR took a lot longer than I expected: about 18 hours. So before I go ahead and do the rest I’d want to know that it’s going to be worthwhile. It’s not such a big deal, the program runs in the background, but still.

Based on just a couple of searches I can see that it works pretty much, but is far from perfect. Anyway, have a look and let me know if you think this would be useful.

qt-49_ocr.pdf (40.9 MB)

On another note, I have had a further look at and I’d appreciate any guidance you could give. Frankly, I can’t make head or tails of the website. It just seems to go around in loops. There’s clearly a lot of very sophisticated stuff there, but I can’t figure out how you’re meant to find and read sutras, or even if they have actual text.

For that matter, the Tripitaka Koreana is also still available in printed form.


If you were to go down that route, and just interested in the texts for this site (EBT), they are all grouped together in the Tripitaka Koreana as…

小乗三蔵(No647-No978) "Hinayana Tripitaka", vols. 17-29
    小乗経蔵(No647-No888) "Hinayana Sutra Pitaka", vols. 17-20
        阿含部(No647-No800) Agama division, vols. 17-19
        単訳経(No801-No888) Single translated texts, vol. 20
    小乗律蔵(No889-No942) "Hinayana Vinaya Pitaka", vols. 21-24
    小乗論蔵(No943-No978) "Hinayana Abhidharma Pitaka", vols. 24-29

The Japanese Wikipedia page for that canon has the divisions. I’m not sure of the exact dimensions and format of the pages, though. It looks like different modern editions have somewhat different dimensions… some wide format, some smaller…

The characters identified are pretty unreliable (like maybe 50-60% correct), so it kind of defeats the purpose of scanning the canon. For example, in the OCR, these are all supposed to be the same…

弟子 and 弟于and 弟孑 and 弟乎 and 弟予 and 矛子 and 茅子

However, if SC were to go ahead and use scans of the Koreana or the Longzang, then maybe the OCR data could be used for identifying where the individual sutras are located.

Using pdftotext, I extracted the OCR data, which is plain Unicode text with an ASCII “form feed” character beginning each page. With some amount of manual effort, the OCR data might be helpful to find exactly where each sutra begins and ends. So for example, the Madhyama Agama has 222 sutras, and we might be able to tell that sutra 24 is located on pages 214-230 (not really, just an example). The OCR could help provide that data.

In the Madhyama Agama specifically, the texts are identified on the pages with just a number like 三四, which is really difficult to work with given the OCR quality. Better results can be had when grepping for 品, which gives some 159 results of the 222 total, which is really not too bad. Other texts such as the Samyukta Agama and Ekottarika Agama would probably be more challenging, though.

I don’t quite get that Tripitaka Koreana website either. I was able to view a few texts that were tagged for one reason or another (just images through flash, not actual text data). For example, here you can see a few. It may be necessary to sign up to access more features, though, because I see some features are only available to members.

What would solve a lot of these problems with the Chinese canon would be a major effort to do a careful and proper Unicode digitization of the Tripitaka Koreana. Then the characters could be easily compared between the two canons, and the Taisho Tripitaka could be corrected.

I did not know this. Is this a true printed edition, or a reproduction of the pencil rubbings from the woodblocks? the latter would be most interesting for us.

Ouch. Okay, I think we can file this under, “not a good idea …”

Maybe, it’s an interesting idea. But unless we can get a much better reliability, it’s going to need so much hand-checking that it may not save any time. To do the OCR on all these files, we’re talking about running one of my CPUs at 100% for three weeks …

Glad that it’s not just me!

I also managed to see the images you linked to, after borrowing a computer that has Flash installed. Through there, there seemed to be actual digitized text I could find, and a widget for breaking up and analyzing the characters. But I couldn’t get anything through the main links on the site.

I tried downloading the whole site, and stopped after a couple of hours of getting nothing but junk gifs and the like.

I’ve found out some more on the TK site. The project summary page is actually quite helpful:

It seems they do have a digitized text. This has both a “Variant Character” version, which uses their own in-house encoding to accurately represent the original, and a “Standard Character Version”, which approximates these in Unicode.

The standard character version was completed in 2004; since then there’s been over 10,000 characters added to the CJK Unicode standard.

They also have “collation data” of TK and Taisho in an Excel file. No idea where this is, though! Or, for that matter, how to access anything else.

I didn’t know that either because I always saw sets of the Longzang being sold instead. I just searched for the Tripitaka Koreana 高麗大藏經, and it came up with printed editions. It looks like different series are done in different formats, though, at different qualities. I don’t know exactly where the images came from…

On the matter of Tripitaka Koreana alternate characters, one thing that has been in the back of my mind is just that Chinese texts did sometimes use different forms of the same character. I don’t know how it breaks down, but I wouldn’t be surprised if there were quite a few alternate forms between canons.

I just noticed another revealing context for this. In AN 4.183, a brahmin is describing to the Buddha how he thinks you can avoid fault in speech. He says there’s nothing wrong with speaking about what you’ve seen, heard, thought, or cognized, as long as you say that you’ve seen, heard, thought, or cognized it.

Here, the normal evam me sutam is joined by the parallel phrases evam me dittham and so on.

The point is that the use of the phrases insulates you against the possibility of false speech by making clear what your sources are. It’s a bit like Wikipedia: you’re not saying these things are true, merely that they are attested.

So the original purpose of the phrase would seem to be to attest that a text is passed down by oral tradition, and therefore needs to be treated as such, not as a literal witness of the truth.

The traditions, by baselessly claiming that Ānanda literally heard all of the Suttas, falls into precisely the trap that the phrase was meant to guard against. The Buddha’s epistemological caution is abandoned in favor of fundamentalist absolutism. And so it goes.

In the book Spreading Buddha’s Word in East Asia, there is more material on the approach of the CBETA project, and some of it is quite interesting. The major takeaway, at least for me, is that CBETA is attempting to create a new and revised canon rather than just digitizing the Taisho Canon. A digitized Taisho Canon is more like their starting point.

They are not just adding punctuation, but also paragraph breaks, and also correcting mistakes that were present even in the printed Taisho, by comparing the CBETA versions to those in previous canons, including the Tripitaka Koreana.

Of course, all of this depends on volunteer efforts, and so it is not exactly clear how thoroughly this checking is done, and to what extent it has already been completed. However, their approach is to change the texts in the canon where they are found to be unreliable:

It seems their goal is to create a fully corrected edition of the canon through comparison, and keeping track of all the changes and versioning with metadata. For example, if a character was changed, the reader can flip between versions using the CBETA reader software.

So their goals are actually pretty ambitious and aimed at making a canon without errors. I’m not sure if or how they measure that sort of progress, though…

On a slightly different note, their goals are actually very broad and even some texts from the Tripitaka Koreana and other canons have been digitized and made available through the CBETA website.

Included here is…

漢譯南傳大藏經 Chinese Translation of the Southern Transmission of the Tripitaka

Basically, Chinese translations of the Pali Canon, translated in the 1980s in Taiwan. I’m not sure exactly how the canon is organized or exactly what it covers, though. Since the translations were done more recently, they are likely covered by ordinary copyright law and not in the public domain.

Thanks for the info. It’s huge job, I hope they can keep up the momentum. Hopefully while in Taiwan I’ll have the chance to talk to them about these things.

Apparently these translations are complete, but not of great quality. I think they were made from the Japanese. I believe the translation we use from are better.

That would be really interesting. After just this short discussion, I am left wondering more about how to verify the accuracy of these canons. It’s certainly a huge task.

For that matter, earlier imperial canons also had some issues with accuracy, so it’s nothing new. With such a large project and a huge number of potential characters, it is basically inevitable that some errors will work their way in. Yet mechanical verification of the canon would require something to check against: other digitized canons.

This type of information is interesting, and I am sure Pootle is able to capture these sort of common phrases. With the size of the Pali Canon and its antiquity, it seems that it would be a better source to look to on stock phrases such as this.

I’m slowly realizing how much very basic information is not readily available on the Web to a general readership, so I did a little write-up introducing some historical Chinese Buddhist canons like the Longzang and others.


Thanks so much, I 've promoted this to it’s own thread.

1 Like

Thank you llt.