Changes in the Chinese source texts

sujato · October 6, 2017, 3:03am

Up until now, we have always used CBETA as the primary source for our Chinese texts. One of the major technical problems in dealing with ancient Chinese texts is the huge number of characters, including thousands that are not used in modern Chinese. These are known as gaiji.

CBETA developed an XML specification for handling these, which looks like this:

共比丘尼<span class=\"gaiji\">鬪<!--gaiji,鬥,1[鬥@(豆*寸)],2&#x9B2A;,3--> </span>諍時

Cute, right? So far my policy with gaiji has been to get a cup of coffee and hope it goes away. It’s been years of diligent effort, but hey, it looks like it’s finally paying off!

Unicode is expanding, and many of the gaiji formerly requiring special handling no longer do so. It seems, based on a recent chat with Lewis Lancaster, that the other digital Chinese Tripitaka, the SAT project, is now more advanced than CBETA in this regard. Their home page has a discussion of this issue.

http://21dzk.l.u-tokyo.ac.jp/SAT/index_en.html

If you check the relevant Wikipedia article, you’ll see that SAT is in fact the largest single contributor to the most recent expansion of Unicode CJK coverage, aka “Extension F”:

I don’t know how far along this process is. Have they encoded all the necessary characters? Are there still more to be done? There’s a PDF explanation on the SAT site, so perhaps @vimalanyani, you can look at this and see what we can glean from it.

I’m not really sure what CBETA is doing about this, so perhaps we should find out. In any case, the Chinese files on SC are a few years old, so they don’t reflect any recent changes.

If, as it seems, SAT is now taking the lead in this field, perhaps we should switch to them as our source? Given that we are now beginning a new generation of translations on Pootle, should we use it as a testing ground for the new texts?

So here’s the glass half empty.

Just because somethings in Unicode, it doesn’t mean you can see it: you have to have a font with the actual characters. Go here to view the new characters:

You will, I suspect, see what I see: a page of empty boxes, or “tofu”, like this.

Now, I have the latest version of Noto CJK installed; the whole point of Noto is “no tofu”, but here we are.

Noto CJK is the font we use on SC for this, chosen for several reasons, but one of them being the extensive coverage. The number of glyphs covered is limited by the format of the font files: it is literally impossible to stick more glyphs in the file.

Nevertheless, it is far from complete: only about 30,000 of the 74,616 CJK unified ideographs defined in Unicode version 6.0 are covered by Noto fonts. Google are currently rolling out the Phase 3 extension, which will cover Plane 0 (BMP) CJK characters in Unicode 9.0, i.e. not all the archaic glyphs, and not the latest additions. They do ultimately plan to cover all of Unicode, but this will take time. To get all the glyphs working, they’ll have to split the font into multiple subsets. Or else, being Google, maybe they’ll just invent a new font file format.

Now, go here:

http://21dzk.l.u-tokyo.ac.jp/SAT/unicode/satunicode.html

This is the same list of glyphs as found on the above Wikipedia page. But hey! You can see them all. V. cool. (There’s some boxy shapes here; these aren’t tofu, they’re ideographic descriptors, that’s how they’re meant to look!) Wikipedia uses system fonts, so you just use what’s on your computer. SAT uses @font-face (like SC), so they send you the font over the wire.

SAT has designed its own font with these characters. What they’ve done is to expand the Hanazono Mincho fonts to include all the recently updated Unicode glyphs. That’s great work on their behalf, to make so many glyphs freely available so quickly! You can grab the latest versions here:

Throw them in your font folder, and the tofu on the Wikipedia page looks like this:

Yay!

Okay, so what should we do about it?

We should embrace the most recent Unicode spec, especially in a case like this where it is clearly superior. Going back over all our Chinese texts, however, would be a huge undertaking. Say, three months full time work. Maybe we could figure out a way to automate or partly automate it, but it won’t be easy. The problem is not in converting the HTML files per se, but because we frequently structure the texts quite differently, by sutta rather than folio. Anyway, I haven’t worked with the SAT source code, so maybe it’s possible. On initial inspection, though, their source code looks brutal: spans for literally every character, and no—repeat no—structural markup at all. Just a series of lines, with nothing to indicate title or anything. Yikes!

We could, however, start small, with the new translation of the Mahasanghika Vinaya by @vimalanyani on Pootle. It’s a fairly small text, with a well-known and predictable structure. Once it’s done, we can look at deploying it to SC.

We then have to figure out how to handle the fonts on SC. Basically, we’ll have to use Hanazono Mincho until and unless Noto expands to extension F, which will be years if ever. That’s fine, we can use Hanazono Mincho as a text font, and keep Noto for the UI. Anyway, it’s a bit of fiddling to sort that out, but it is some way down the track. The bottom line is, we can finally handle most (all?) archaic characters properly!

vimalanyani · October 6, 2017, 7:02am

I looked at the pdf but it’s in Japanese. What I could get from the gibberish on google translate is that they will continue to do similar projects, but whether that means with Taisho texts or with other old Chinese works I’m not sure.

vimalanyani · October 6, 2017, 7:34am

I don’t know enough about what amount of work would be involved if we switched to SAT texts to make an informed comment. I’m just thinking, if they still don’t have all the characters, wouldn’t our texts be outdated again in a few years? Is there any way to keep them up to date?

I’m happy to start small. The mg pm has 45 gaiji characters (45 in total, not 45 different ones; often it is the same character coming up multiple times), the other Pms have between 30 and 61 gaijis.

Just to clarify: I am translating parallel rules together. So I’m working on all the Patimokkhas at the same time. That I started with the Mg was a coincidence. So we might want to consider using the SAT texts for all 5 Patimokkhas.

sujato · October 6, 2017, 9:07am

Okay, well, we are in no hurry here, so we should find out some info before proceeding.

Do you know Ayya @Suvira? She speaks good Japanese, let’s see if she can help us out.

It’s possible, yes. Hopefully we can determine the extent of ongoing process. Maybe they’re already complete, who knows? Or maybe they have a planned completion date. I mean, there’s a lot of different glyphs, but it’s not an infinite number.

So far there doesn’t seem to be any way to ensure that the texts on SC are kept up to date. There are simply too many variables involved in converting the files for it to be automated. Maybe it is possible, but we haven’t found out how to do it yet.

Okay, good to know. Well, let’s figure out what’s up with SAT. Then we can ask CBETA where they stand. If they have updated, or are planning to update, with the new glyphs, it’s probably best to stick to using them as a source. Their markup is much friendlier.

jenjouhung · October 6, 2017, 9:18am

Hi, I am joey hung, just came back from CBETA’s monthly meeting.

In fact, CBETA has noticed this SAT news for months, and has already started checking their gaiji database for a while. So far CBETA has already added new 1388 unicode mappings proposed by SAT to their gaiji databases. The information will be used to generate the next editions of CBETA corpus, which is, probably could be seen before end of this year in our new CBETA online platform, (http://cbetaonline.dila.edu.tw/en), or in next version of CBETA DVD, to be released at April 2018.

BUT, I am afraid, the newly added characters from Unicode 10.0 will still be treated as gaiji in CBETA corups. As the you mentioned, users still need to install proper font to display the correct content. For CBETA, it does not want to ask majority to install extra packages for displaying the correct character rather than “tofu”. Especially some users really enjoy in reading CBETA content with mobile devices, and we have not yet found any good solutions to help them installing pre-specified font.

Vimala · October 6, 2017, 10:15am

Dear Joey Hung,

Thank you so much for your quick update! This is very helpful!

sujato · October 6, 2017, 11:16pm

Wow, thank you so much! That’s really helpful. Please accept my thanks and gratitude, for you and everyone at CBETA. You might be aware that I’m in Taiwan. I’ve been mostly keeping to myself to focus on work, but I would love to get a chance to meet some time.

I understand the complexity of the issue, it’s such a complex corpus. It’s wonderful to hear that you’re making these upgrades, as we would definitely prefer to continue using CBETA as our primary source.

May I ask a few questions?

Will you eventually upgrade, or this this intended as a permanent situation?

Also I’m wondering if you can answer the question we had about the latest SAT additions: is that all? Are they still adding glyphs, or is this work finished?

Yes, mobile is so important, and we certainly don’t want to hit users with huge downloads. We’re currently getting around 40% of users on mobile, so a little lower than average, which is over 50% these days.

On SuttaCentral, we have developed quite a nice solution for this. Essentially the Chinese texts are scanned and a list of all glyphs is extracted. Then that list is used to create a subset of the Noto CJK font with only those glyphs, which is then served as usual with @font-face.

Just checking, this currently results in a font size of only 2.3MB. That’s still a lot, but given that the average webpage size today is over 3MB, and that most CJK fonts are well over 20MB, it’s quite acceptable, and we still manage to serve a page in around 1 second. And of course the font is cached for subsequent pages.

There are a number of factors working in our favor here:

Noto is very efficiently built and is quite a bit smaller per glyph than most CJK fonts.
We use woff2 font files which have improved compression.
Obviously we have less Chinese text than CBETA, so there are fewer glyphs.

If we were to switch to Hanazono Mincho the size would increase, but we haven’t yet tested to what extent.

jenjouhung · October 7, 2017, 3:28am

hi, thank you for the reply.

CBETA office is currently located near to MRT 西門站, I am sure that they will always welcome their users to come to their office and have a cup of tea. You can contact them through the website contact form.

Yes, CBETA will eventually take Unicode 10.0 as the default standard, but I can’t say when will it happen. However, I would like to thank you for bringing up this discussion. That will make CBETA reconsider their criteria for determining gaiji in next edition of CBETA corpus. Currently, CBETA is a bit conservative in this point. The judgment line now falls on Unicode 2.0 in CBETA 2016. I hope in CBETA 2018 they can agree to raise this line at least to Unicode 6.0 (supported by default Chinese font with Window 8, contains: BMP to CJK EXT-D).

I actually have no answer to this question. Maybe you should directly contact people from SAT. Kiyonori Nagasaki (http://www.dhii.jp/nagasaki/ ) would be the best person to answer this question.

sujato:

On SuttaCentral, we have developed quite a nice solution for this. Essentially the Chinese texts are scanned and a list of all glyphs is extracted. Then that list is used to create a subset of the Noto CJK font with only those glyphs, which is then served as usual with @font-face.

Just checking, this currently results in a font size of only 2.3MB. That’s still a lot, but given that the average webpage size today is over 3MB, and that most CJK fonts are well over 20MB, it’s quite acceptable, and we still manage to serve a page in around 1 second. And of course the font is cached for subsequent pages.

There are a number of factors working in our favor here:

Noto is very efficiently built and is quite a bit smaller per glyph than most CJK fonts.

We use woff2 font files which have improved compression.

Obviously we have less Chinese text than CBETA, so there are fewer glyphs.

If we were to switch to Hanazono Mincho the size would increase, but we haven’t yet tested to what extent.

This sounds like a very good idea, I will share this information with CBETA staffs.

sujato · October 7, 2017, 3:46am

Thanks so much, i’ll see if i can do that next time I’m in Taipei.

That’s excellent news.

Well, for what it’s worth, I certainly support moving forward with this. But once again, in a project of this size, I understand that they have to be careful with priorities.

Thank you, I will do so.

That’s great. If they want any technical support, just contact us. Our main developer @blake make the scripts for this so he should be able to help.

sujato · October 7, 2017, 3:59am

I managed to find an answer to this: Unicode 11, due to land mid-2018, plans to include IRG Working Set 2015 (AKA CJK Extension G), which includes several hundred new glyphs from SAT. So clearly it is an ongoing process.

http://babelstone.co.uk/CJK/IRG2015/index.html

I have also found the proposed extension H, which has a list of some 300 glyphs. However, this does not include any from SAT, so perhaps with Extension G they have completed this work.

Suvira · October 8, 2017, 8:28am

Over 2800 character units encoded in Unicode.pdf (866.2 KB)
Here is the translated notice (in English) from the SAT website about the encoding for the remaining unencoded characters/gaiji: http://21dzk.l.u-tokyo.ac.jp/SAT/ucs_encoded.pdf

It’s not perfect, but you should be able to get the general idea from the document about where SAT is AT in relation to the Unicode problem. If there are any questions/if you need general or technical translation etc please let me know. @vimalanyani I hope Europe is treating you well!

sujato · October 8, 2017, 8:34am

Thanks so much Ayya. So there’s 6000 gaiji altogether, and some 3000+ have been/will be added by Unoicde via SAT. So it seems there’s still a way to go.

vimalanyani · October 8, 2017, 9:18am

Ayya Suvira, thanks for your help!

sujato · October 8, 2017, 10:11am

It looks like it’ll be a few years before this is fully resolved, so let’s leave our texts as they are for now, and update them all when the time is right.