Fonts for Buddhism

Yes, that’s right. The CJK versions, and other non-Latin scripts, are new, but designed to harmonize. It’s the CJK fonts that are the real gem in Noto; they’re just so nice. Blake has been working on a custom subsetting method for SC, which you can see on staging. On startup, the system lists all the glyphs used in the CJK fonts, extracts just those from the relevant Noto CJK files, and creates a woff2 subset. The result is that we can serve even large text pages of traditional CJK text via @font-face, something unheard on till now. In addition, it guarantees that the correct regional variants for Chinese, Japanese, and Korean are used.

Yes, amazing, right! They have Brahmi, too.

Source Sans Pro and Roboto are both designed as UI fonts, so they have good legibility in small sizes, but as you say, not so ideal for body text. Noto is more humanist, less geometric, and I agree, it sits better in body text.

Yes, that’s right. The CJK versions, and other non-Latin scripts, are new, but designed to harmonize. It’s the CJK fonts that are the real gem in Noto; they’re just so nice. Blake has been working on a custom subsetting method for SC, which you can see on staging. On startup, the system lists all the glyphs used in the CJK fonts, extracts just those from the relevant Noto CJK files, and creates a woff2 subset. The result is that we can serve even large text pages of traditional CJK text via @font-face, something unheard on till now. In addition, it guarantees that the correct regional variants for Chinese, Japanese, and Korean are used.

Wow, is it specific for each page? That makes sense, given the huge size of the total number of CJK characters. But you’re right, that is a very custom sort of thing to do.

/fonts/compiled/noto-sans-tc_bold_zh_lzh_79d5953da549.woff2

Interesting…

You know, when I was looking at the “staging” area before, I noticed that a few glyphs seemed unusual, and then realized that they are the Taiwan variants of those characters. I downloaded Noto Sans CJK fonts and made a test page, trying them with different language codes (zh, zh-cn, zh-hk, zh-tw, zh-trad, zh-simp, lzh), and I found that “lzh” automatically picks the Taiwan forms of those characters.

I was quite sure that the mainland variants are closer to the old forms used in the Qing dynasty and before, like in the Kangxi Dictionary. After looking at a few example characters that differ, though, and comparing them with scans of the pages in Kangxi, I think the Taiwan forms are also reasonably authentic. In some cases the variants used in Taiwan are more traditional, and in some cases less.

Part of the problem is that there are many variants of Chinese characters, and specifying “lzh” does not necessarily tell us anything about which variants to use. Nevertheless, the font or browser tries to guess when it sees “lzh”. For example, a book in traditional Chinese may be published with different character variants between mainland China, Taiwan, and Japan.

An example of how this is handled by companies like Microsoft is that they commission two different fonts. Microsoft JhengHei has both traditional and simplified characters, and uses Taiwan character variants. Microsoft YaHei has both traditional and simplified characters, and uses Mainland character variants.

For example, from the article linked to below, we can see the two variants of these traditional characters. Despite the text alongside, none of the characters are simplified. The first two are how they are printed in Taiwan. The second two are how they are printed in mainland China.

image

It’s kind of an ugly problem. This article goes into it in more detail:

https://blog.zydeo.net/chinese-typefaces-simplified-and-traditional/

No, we thought of that, but then you have to download a new font each page. What we do is make a single font that subsets all the Chinese (or Japanese or Korean) glyphs used on SC. The user gets that on their first visit to a page that requires that language. So that’s a bit of a hit, about 2.2MB on initial download. (Which is, however, in fact about average for page sizes on the web today …) But it’s cached, so subsequent visits are very fast.

Well, this is complicated, isn’t it? I am still not entirely clear on the differences and how they’re handled in Noto and Unicode more generally, but for now, do you have a recommendation? Obviously it is partly to do with one’s background. But i assume any reader sophisticated enough to read ancient Buddhist texts in Chinese will be familiar with the fact of regional variants. I’m wondering which seems more "native’, more appropriate for canonical texts in general.

I’m not sure that the use of the language specifier is relevant here as such on the site (as opposed to local testing). It just happens to be how we identify the necessary font. Our traditional font is generated by extracting the relevant glyphs from the Noto Sans CJK TC font. As such we should exclude all unneeded glyphs, including regional variants. So when the font is served via @font-face, there should be no question of selecting variants. Of course, this might not be working correctly at the moment; possibly the subsetting doesn’t take this into account and we do include the variants.

Mostly the Noto/Source documentation treats simplified = mainland and traditional = taiwan (with a nod to Hong Kong). These particular glyphs are not simplified, but they’re part of the simplified font. I’m not sure that it’s possible to specify "traditional characters in mainland china form’.

Ah, very nice! That’s a numbers game of realizing that the total number of characters used in the texts is really not that much. :slight_smile:

Okay, I was playing around with Noto Sans a bit, and I found that by setting the font face (TC or SC) and setting the language tags, glyphs are rendered differently.

The following is with “Noto Sans CJK TC”, for different language tags. The form with the broken radical at the top is the form used in Taiwan or Hong Kong.

When Noto Sans SC is specified instead, the results are quite different:

Apparently even when specifying the TC or SC font, and not the other at all, the language tags will still influence which forms are selected. For example, in CSS I have “Noto Sans CJK SC”, and next “Unifont,” and it still picks out glyphs for Taiwan variants for “zh-tw”.

Basically, using TC will cause “lzh” text to have Taiwan variants, while using SC will cause “lzh” text to have Mainland variants (although note all are traditional characters using the same Unicode points).

These sort of things are potentially sensitive issues also because they are regional variants, and so there are always some politics about Chinese characters in East Asia, and to what extent they should be unified (with “unification” using revolving around the Chinese language, and mainland China).

Neither variant is more “correct” than the other. After comparing a few characters and looking at the many variants used between countries, and in the Kangxi Dictionary, it’s apparent that there was always a bit of variation, and that no modern forms are exactly the same as the Kangxi forms.

In terms of numbers and demographics, the mainland has some 40 times the number of people in Taiwan and Hong Kong combined. This inevitably means that most readers around the world are more familiar with the SC forms, and most Chinese books will also use SC forms. Additionally, Chinese (“zh”) text on the Web seems to render in SC forms by default, including everything on CBETA. So my tendency is to use the SC forms (and also because they look more familiar to my eyes).

Nevertheless, Taiwan and Hong Kong have long histories of using the traditional characters, and still use them to this day for everything. Their forms are no less correct, and so it mostly comes down to personal preference.

Finally, here are a few dictionary entries that show different forms. In the sequence, it shows (1) mainland China, (2) Taiwan, (3) Hong Kong, (4) Japan, (5) Korea, and (6) Kangxi Dictionary.

#茶

http://www.zdic.net/z/22/zy/8336.htm

#無

http://www.zdic.net/z/1d/zy/7121.htm

In general, it seems like mainland China, Japan, and Korea often use similar forms, while Hong Kong and Taiwan often agree about slightly different ones.

1 Like

Okay, thanks. It is complicated! But it does seem as if using the mainland forms would make the most sense.

I just checked 茶 in a bunch of sites—Wikipedia, google, facebook, wechat, baidu, duckduckgo—and on my machine at least most of them have the unbroken mainland form; the exception is google.com.tw, which makes sense, and facebook for some reason.

I still don’t understand how this can work with the simplified forms. My understanding is that in some cases the traditional forms were radically simplified or combined, so how how can we simply present “simplified” glyphs and have them appear as traditional? If the simplified forms are assigned different Unicode points, it’s not an issue, but my understanding was that this was done for glyphs with a major difference, whereas for minor stylistic changes the same Unicode point is used. So doesn’t this look, well, “simplified”? Or is this not an issue?

As far as I know, the simplified characters used in the mainland were all given different Unicode points, so this shouldn’t be an issue. The simplifications codified existing shorthand forms and made them standard. This was mostly done to aid literacy. There are around 2500 simplified characters, but they are assigned their own Unicode points.

(As an anecdote, Mao Zedong wanted to do away with Chinese characters completely and adopt the Latin alphabet. The simplification and adoption of Pinyin for transliteration was part of that push, but they later settled on more conservative goals.)

From the example characters above, 茶 stayed the same, but the character 無 was simplified into 无. As a result, modern books printed in simplified Chinese use a combination of simplified and traditional characters.

For example, 無知 simplified would be written as 无知. The first character was changed drastically, while the second character stayed the same. Using traditional characters, though, it would always be 無知.

The regional variants as they relate to fonts would be the same Unicode points, but different font representations of those points, like slightly different ways to write the character 無.

Huh, okay, well I’ll check, maybe we will end up using the SC version after all. Thanks so much for the help.

Sure, any time.

And I understand that it was Stalin who persuaded him to keep the Chinese characters.

1 Like

Okay we’re now using CJK SC instead of TC on staging. The TC font is still available if you want to toggle between them in inspector.

Btw since it came up Taiwanese users are actually a very significant minority of Sutta Central users and perhaps this should influence our choice of font, for example Google Analytics languages show:

  1. en-us: 55.6%
  2. en-gb: 11.6%
  3. zh-tw: 3.8%

So zh-tw is the second most reported language iso code after englishes.

The other zh languages to pop up are zh-cn at 0.9% and zh-hk at 0.14%.

In addition geographically Taiwan and Hong Kong are the most prominent of locations using a CJK script - I mention this because browser language is really unreliable, but in this case the location and language statistics are in close agreement. The mainland users are poorly represented with only about 1/5th as many as from the islands.

Interesting stats. In many ways the internet in China is a separate beast, perhaps the Tw/HK communities are more international. And of course, Buddhism has a more solid grounding there.

Huh, interesting. On my site, China is around 6%, Singapore around 4%, and HK and Taiwan each below 1%.

Internet in mainland China is indeed very different. Google, Facebook, Twitter, and most western blogging platforms are blocked. Social media widgets and Google fonts will also cause websites to hang as the connections time out.

I didn’t know this, good thing we don’t have them. CBETA has social media widgets, though, so I wonder how they fare?

What about Google analytics? Currently we use this, though I’d like to get rid of it. Does this cause any problem in China?

I noticed on the Baidu results that SC results tend to not feature very highly. You get a lot of results for an SC mirror at dhammatalks.net. Perhaps this is a reason for relatively low usage in mainland China.

The site still renders after about 30 seconds, so not too unreasonable.

[quote=“sujato, post:16, topic:2849”]What about Google analytics? Currently we use this, though I’d like to get rid of it. Does this cause any problem in China?
[/quote]

Just looking briefly, it appears that the site functions fairly normally. The jQuery stuff hosted off-site times out, and maybe a few other occasional things, but the site still works.

Blimey. 30 seconds is reasonable? I suppose you can meditate while waiting for a page to load. Is this common?

I just tested it at dotcom-monitor which gives results from multiple countries. China is way out there, 40-50 seconds to load. On inspection it seems it is blocking resources from google domains, even benign things like jQuery or the font loader.

@blake, this seems like a serious issue that will majorly impact our users in mainland China. Can you check it out? It seems we should probably avoid serving anything from a google domain. I will expand on this in another thread.

Blimey. 30 seconds is reasonable? I suppose you can meditate while waiting for a page to load. Is this common?

It’s common for foreign websites. The quality of Internet is generally pretty low, and varies randomly as well. To get around this, most foreigners (and others who want to use Google, Facebook, Gmail, Twitter, etc.), just use a VPN, which solves the problem pretty well.

this seems like a serious issue that will majorly impact our users in mainland China. Can you check it out? It seems we should probably avoid serving anything from a google domain.

It does affect users from mainland China, but realistically, websites shouldn’t have to design around government censorship. China could have better Internet, but they choose to hobble it for political and economic reasons.

With my site, I just host everything on one server, and it is safe as long as the website itself is not blocked. The two sites can’t really be compared easily, though, because my site is like some Web 1.0 sort of thing. Even then, though, sometimes the site randomly loads slowly.

Before you do, perhaps you or someone with access could create a brief overview of the current usage of SC. It would be interesting to know how many people in how many countries read the EBTs here, perhaps there’s enough data to point out some regional or worldwide trends, top suttas read etc.

Actually, it would be nice if such basic stats were generated live and displayed to all users all the time…there’s probably some free tools around to achive that without much effort and without sacrificing anyones privacy. Or maybe something like this is already available somewhere and I just don’t know about it.

Good idea. If we were to roll our own analytics using Piwik, we could do this easily. Or at least, we could do it, i don’t know how easy it would be! Actually, Piwik supports this natively, so it’s quite trivial.

I think we’re bit busy to make such a report, but if you’re interested, maybe you’d like to do it yourself? Let me know and I can grant you access to our analytics account.

At the moment, I’m pretty busy myself so I probably don’t have time to go into content…but I could whip something together on visitors. You can grant access to the email I use in Discourse.

Done.