Some issues with the Chinese texts

Thanks for this. I’ve had a look at It look like it’s unchanged since 2004, which must rank as one of the most stable websites ever!

The PDF files are reasonably well done. I’ve downloaded some of the Agama files, and I think we can do some useful things with them. They are not searchable or compressed, and the download on that site is extremely slow. We can make searchable compressed PDFs available at a decent download speed without much trouble, so that would be something.

More useful would be if we can make the images available while reading the Chinese texts, as we have done with the PTS. To do this, we need the following:

  1. Extract the images and convert to png (this is simple)
  2. Rename the images according to volume/page (this is simple as long as the source is reliable; if it omits or doubles pages it makes it harder.)
  3. Data that correlates these vol/page numbers with the Taisho vol/page/line reference.

It’s the last part that’s hard. It’s too much for us to do: might you know of any source that has this data already? We really need proper data, not scanned images, for this.

If we can get this data, we can make the images available.

It’s possible to do the same thing for the Tripitaka Koreana images. However, they are hidden behind a Flash web app, so I can’t download them. Do you know of any other sources?

I’m not sure where that might be found, or even if it can be found. Considering the Samyukta Agama, there are over a thousand individual sutras, which is a lot to go through. Instead, it could be divided up by fascicle, which would only be about 48 divisions. It’s still a fairly big project in all, though.

I wish I did, but I think these two canons have probably not had any big digitization effort as the Taisho has. I’m guessing the Longzang is just available because it is still widely available in printed format.

I was messing around with a PDF from the Longzang today out of curiosity, and I found that some of the pages are actually made from multiple images, and were scanned at a different resolutions. However, most pages are just from single monochrome images at about 150 dpi. It’s mostly readable, although it kind of looks like a xerox copy of a copy. The original printed volumes appear to be much cleaner.

It definitely has that classic 1990’s look.



Edited with Netscape 4.7 on Windows 98… Domain registered August 1999… These are the real heroes of the Internet. :anjal:

Okay, well, we’ll keep our eyes out.

In the volume I’ve checked, this is only the front matter, the bulk of the text is fine. But maybe not all are consistent.

Indeed, that’s a very nice image! It would be a shame to spend time on an inferior scanning. Given that it’s a simple mechanical task, perhaps we could sponsor a new scanning project. It’s probably quite easy to do in Taiwan.

Here’s an OCR-d PDF example of the Longzang text. I extracted the images, performed OCR with Tesseract, created the PDF and compressed it using smallpdf. Even compressed it is somewhat larger than the original, due to the presence of the OCR data.

I’d really appreciate it if you had a look at this and let me know if you think it’s useful. The OCR took a lot longer than I expected: about 18 hours. So before I go ahead and do the rest I’d want to know that it’s going to be worthwhile. It’s not such a big deal, the program runs in the background, but still.

Based on just a couple of searches I can see that it works pretty much, but is far from perfect. Anyway, have a look and let me know if you think this would be useful.

qt-49_ocr.pdf (40.9 MB)

On another note, I have had a further look at and I’d appreciate any guidance you could give. Frankly, I can’t make head or tails of the website. It just seems to go around in loops. There’s clearly a lot of very sophisticated stuff there, but I can’t figure out how you’re meant to find and read sutras, or even if they have actual text.

For that matter, the Tripitaka Koreana is also still available in printed form.


If you were to go down that route, and just interested in the texts for this site (EBT), they are all grouped together in the Tripitaka Koreana as…

小乗三蔵(No647-No978) "Hinayana Tripitaka", vols. 17-29
    小乗経蔵(No647-No888) "Hinayana Sutra Pitaka", vols. 17-20
        阿含部(No647-No800) Agama division, vols. 17-19
        単訳経(No801-No888) Single translated texts, vol. 20
    小乗律蔵(No889-No942) "Hinayana Vinaya Pitaka", vols. 21-24
    小乗論蔵(No943-No978) "Hinayana Abhidharma Pitaka", vols. 24-29

The Japanese Wikipedia page for that canon has the divisions. I’m not sure of the exact dimensions and format of the pages, though. It looks like different modern editions have somewhat different dimensions… some wide format, some smaller…

The characters identified are pretty unreliable (like maybe 50-60% correct), so it kind of defeats the purpose of scanning the canon. For example, in the OCR, these are all supposed to be the same…

弟子 and 弟于and 弟孑 and 弟乎 and 弟予 and 矛子 and 茅子

However, if SC were to go ahead and use scans of the Koreana or the Longzang, then maybe the OCR data could be used for identifying where the individual sutras are located.

Using pdftotext, I extracted the OCR data, which is plain Unicode text with an ASCII “form feed” character beginning each page. With some amount of manual effort, the OCR data might be helpful to find exactly where each sutra begins and ends. So for example, the Madhyama Agama has 222 sutras, and we might be able to tell that sutra 24 is located on pages 214-230 (not really, just an example). The OCR could help provide that data.

In the Madhyama Agama specifically, the texts are identified on the pages with just a number like 三四, which is really difficult to work with given the OCR quality. Better results can be had when grepping for 品, which gives some 159 results of the 222 total, which is really not too bad. Other texts such as the Samyukta Agama and Ekottarika Agama would probably be more challenging, though.

I don’t quite get that Tripitaka Koreana website either. I was able to view a few texts that were tagged for one reason or another (just images through flash, not actual text data). For example, here you can see a few. It may be necessary to sign up to access more features, though, because I see some features are only available to members.

What would solve a lot of these problems with the Chinese canon would be a major effort to do a careful and proper Unicode digitization of the Tripitaka Koreana. Then the characters could be easily compared between the two canons, and the Taisho Tripitaka could be corrected.

I did not know this. Is this a true printed edition, or a reproduction of the pencil rubbings from the woodblocks? the latter would be most interesting for us.

Ouch. Okay, I think we can file this under, “not a good idea …”

Maybe, it’s an interesting idea. But unless we can get a much better reliability, it’s going to need so much hand-checking that it may not save any time. To do the OCR on all these files, we’re talking about running one of my CPUs at 100% for three weeks …

Glad that it’s not just me!

I also managed to see the images you linked to, after borrowing a computer that has Flash installed. Through there, there seemed to be actual digitized text I could find, and a widget for breaking up and analyzing the characters. But I couldn’t get anything through the main links on the site.

I tried downloading the whole site, and stopped after a couple of hours of getting nothing but junk gifs and the like.

I’ve found out some more on the TK site. The project summary page is actually quite helpful:

It seems they do have a digitized text. This has both a “Variant Character” version, which uses their own in-house encoding to accurately represent the original, and a “Standard Character Version”, which approximates these in Unicode.

The standard character version was completed in 2004; since then there’s been over 10,000 characters added to the CJK Unicode standard.

They also have “collation data” of TK and Taisho in an Excel file. No idea where this is, though! Or, for that matter, how to access anything else.

I didn’t know that either because I always saw sets of the Longzang being sold instead. I just searched for the Tripitaka Koreana 高麗大藏經, and it came up with printed editions. It looks like different series are done in different formats, though, at different qualities. I don’t know exactly where the images came from…

On the matter of Tripitaka Koreana alternate characters, one thing that has been in the back of my mind is just that Chinese texts did sometimes use different forms of the same character. I don’t know how it breaks down, but I wouldn’t be surprised if there were quite a few alternate forms between canons.

I just noticed another revealing context for this. In AN 4.183, a brahmin is describing to the Buddha how he thinks you can avoid fault in speech. He says there’s nothing wrong with speaking about what you’ve seen, heard, thought, or cognized, as long as you say that you’ve seen, heard, thought, or cognized it.

Here, the normal evam me sutam is joined by the parallel phrases evam me dittham and so on.

The point is that the use of the phrases insulates you against the possibility of false speech by making clear what your sources are. It’s a bit like Wikipedia: you’re not saying these things are true, merely that they are attested.

So the original purpose of the phrase would seem to be to attest that a text is passed down by oral tradition, and therefore needs to be treated as such, not as a literal witness of the truth.

The traditions, by baselessly claiming that Ānanda literally heard all of the Suttas, falls into precisely the trap that the phrase was meant to guard against. The Buddha’s epistemological caution is abandoned in favor of fundamentalist absolutism. And so it goes.

In the book Spreading Buddha’s Word in East Asia, there is more material on the approach of the CBETA project, and some of it is quite interesting. The major takeaway, at least for me, is that CBETA is attempting to create a new and revised canon rather than just digitizing the Taisho Canon. A digitized Taisho Canon is more like their starting point.

They are not just adding punctuation, but also paragraph breaks, and also correcting mistakes that were present even in the printed Taisho, by comparing the CBETA versions to those in previous canons, including the Tripitaka Koreana.

Of course, all of this depends on volunteer efforts, and so it is not exactly clear how thoroughly this checking is done, and to what extent it has already been completed. However, their approach is to change the texts in the canon where they are found to be unreliable:

It seems their goal is to create a fully corrected edition of the canon through comparison, and keeping track of all the changes and versioning with metadata. For example, if a character was changed, the reader can flip between versions using the CBETA reader software.

So their goals are actually pretty ambitious and aimed at making a canon without errors. I’m not sure if or how they measure that sort of progress, though…

On a slightly different note, their goals are actually very broad and even some texts from the Tripitaka Koreana and other canons have been digitized and made available through the CBETA website.

Included here is…

漢譯南傳大藏經 Chinese Translation of the Southern Transmission of the Tripitaka

Basically, Chinese translations of the Pali Canon, translated in the 1980s in Taiwan. I’m not sure exactly how the canon is organized or exactly what it covers, though. Since the translations were done more recently, they are likely covered by ordinary copyright law and not in the public domain.

Thanks for the info. It’s huge job, I hope they can keep up the momentum. Hopefully while in Taiwan I’ll have the chance to talk to them about these things.

Apparently these translations are complete, but not of great quality. I think they were made from the Japanese. I believe the translation we use from are better.

That would be really interesting. After just this short discussion, I am left wondering more about how to verify the accuracy of these canons. It’s certainly a huge task.

For that matter, earlier imperial canons also had some issues with accuracy, so it’s nothing new. With such a large project and a huge number of potential characters, it is basically inevitable that some errors will work their way in. Yet mechanical verification of the canon would require something to check against: other digitized canons.

This type of information is interesting, and I am sure Pootle is able to capture these sort of common phrases. With the size of the Pali Canon and its antiquity, it seems that it would be a better source to look to on stock phrases such as this.

I’m slowly realizing how much very basic information is not readily available on the Web to a general readership, so I did a little write-up introducing some historical Chinese Buddhist canons like the Longzang and others.


Thanks so much, I 've promoted this to it’s own thread.

1 Like

Thank you llt.

Glad that it’s found helpful. It can probably be revised and expanded a bit in the future as well. :smile_cat:

If so, that would be helpful.

I have a reasonable idea of how the texts were brought to China and translated. But I don’t know about what happened next. How were the texts were gathered in canons, and how were those canons were passed down? Any clarity would be most appreciated!

On a slightly different note, perhaps I may be so bold as to make another request. I’ve found it very useful to check the Chinese texts when comparing with the Pali. I have studied Chinese only very little, and rely almost exclusively on the lookup tool on SC. Fortunately, Chinese lookup is a lot easier than Pali, and we have an excellent dictionary to back us up.

But it would be nice to have a short, simple how-to for people who’d like to do this. Not necessarily to be able to read Chinese texts in depth, which obviously takes serious study, but to check up a different reading or variation. Something that would encourage people to at least give it a go, so that these texts are not hidden away behind walls of obscurity. Perhaps even an online course? Anyway, just a thought.

Good point, and I think that whole development is often overlooked.

Sure, I’ve made a little post about this here:

I recommend others give it a go and try for themselves. The biggest hurdle is just the vocabulary. Once the vocabulary is clear, then the grammar is usually pretty apparent as well. The lookup tool will give most of the vocabulary, although there are always some occasional problems with grouping and false positives (but it’s not a big deal most of the time).

1 Like

Wow, so good, thank you! Hopefully this will encourage some people to look beyond the Pali texts.