I had a bit of a chat a few days ago with Prof Lewis Lancaster. He’s one of the luminaries of modern Buddhist studies, and a pioneer in the area of digitizing the Chinese texts.
He mentioned some issues with the Chinese texts that I was not aware of so I thought I’d post them here. Perhaps some of our Chinese speaking users might have more to add on this.
Prof Lancaster said that when they printed the Taisho canon the fonts available to the printers did not have all the characters they needed to fully represent all that was in the manuscripts. So they kludged it to some degree, substituting modern versions of characters, or using the closest in form.
With the tens of thousands of Chinese characters, can you imagine what this job was like? Seeing a shape on a page—maybe blurry or indistinct—and trying to ascertain whether that exact form is in your font collection, and if not, which one is the closest substitute.
In most cases these variations won’t substantially affect the meaning. Nevertheless, it is still important to stay as close as possible to the original.
When the text was digitized, these kludges were not corrected. Moreover, at that time—over a decade ago—the Unicode specification was not as complete as today, and there were many characters missing. These are represented with special codes, or even with images.
Today the Unicode specification has expanded to represent a fairly complete view of Chinese characters (not so the Indic characters, but that’s another story). We have the excellent Noto fonts that can display the entire range of glyphs.
Yet the digitized versions of the Taisho are not yet completely updated to the new specification. According to Lewis Lancaster, the SAT version is more updated than the CBETA, which is the one we have adapted for SC.
(There is a further problem with the Taisho and derived digital texts, which is that while most of the texts are taken from the famous Tripitaka Koreana, not all of them are. Apparently the text did not identify these, and it was only recently discovered that they derive from an inferior privately printed edition. However, I think it’s unlikely that this affects any of the texts on SC.)
Clearly it would be better to sidestep these problems by using an edition that is based directly on the Tripitaka Koreana. Such an edition exists, it was digitized by Prof Lancaster himself.
However I can’t find it online. (Most of the links on the relevant Wikipedia pages are broken and outdated, a problem that infests most of the Buddhist text pages on Wikipedia.) All I can find is a site with the scanned images. (warning: flash plugin required!) Does anyone know anything about this?
According to Prof Lancaster, his edition is also not updated to the latest Unicode specification, and he said it was unlikely that he would do this himself. Nevertheless, it is still the most accurate digitized text.
Of course the Taisho edition is still the standard for referencing, so there’s no need to remove it. But it would be nice to update the site with the more authentic text if it is available.