Fonts for Buddhism

llt · May 28, 2016, 7:24am

Buddhist texts and typography on the Web have changed drastically over just a few years. Web fonts are common now, but web fonts for Buddhism are still kind of rare. It seems the world of Buddhism is still kind of asleep, or at least kind of unorganized (although SC is doing very well).

Long ago when the world was young, when IE 6 could still be found, and web fonts were something very new, I wanted a web font for my website that would have all the necessary glyphs for romanized Chinese, Pali, and Sanskrit. I made one based on Droid Serif, and used it for a number of years.

It was very useful at a time when web typography was mostly limited to the system fonts that shipped with Windows XP. I can still clearly remember that the only normal font in Windows XP that supported all necessary characters was “Microsoft Sans Serif,” for which at least two faces were faked in the ugliest ways.

I found, though, that Droid Serif was not as legible as some fonts like Helvetica and Arial, so I switched over. This was at a time when Windows XP was dying, and new versions of Arial shipping in Windows now had all the necessary glyphs.

I wrote a CSS font stack that would carefully and strategically “stack” Helvetica and Arial fonts from best to worst. It was alright, but the results were somewhat inconsistent and some small problems with spacing and diacritics continued to bother me.

So I finally decided to do something about it. Yesterday I made “Helvetii Dharma,” which is a fork of Nimbus Sans L, as used in the GhostScript project, and donated by URW++ (although they used Adobe font sources).

In contrast to my previous font attempt, this one separates every stage of development, and automates the build process with scripts to generate everything, do auto-hinting, and convert to different web font formats.

It won’t win any awards for innovation (Helvetica is an old and widely-used font with a long history), but it’s a general and useful. And no, this doesn’t have anything in particular to do with SuttaCentral, but I tricked you into reading this far!

Just kidding. Check it out here if you are curious. I still need to tweak the diacritic placement a bit, and I will do that gradually in the coming days and weeks.

You can see it in use as a web font here:

http://lapislazulitexts.com/articles/H-1957_sans

I was playing around with some other fonts as well such as Roboto, which already has the necessary glyphs, but I found that the extra spacing hurt legibility a bit, and the letter forms were not as elegant as the older Helvetica.

sujato · May 28, 2016, 8:41am

Nice, thanks for the heads up, @llt.

Did you consider using Noto? We’re implementing it for all the non-Latin scripts, including CJK; you can test it on staging here.

llt · May 28, 2016, 10:56am

Very nice! I feel like I’m unlocking secret areas and passageways in SuttaCentral. It looks especially good for Chinese bold titles. Previously in Chrome, the titles for Chinese texts were coming in rendered with GNU Unifont (fake bold of a bitmap font), so it looked like they were displaying on a Commodore 64 or something (to be fair, it’s probably Chrome’s fault). The new text looks great.

Just recently I was looking a bit at what fonts are available now. As I wrote above, the situation now is so much different than even a few years ago. It looks like the Noto fonts are based on the earlier Droid fonts, but they are being expanded include all of Unicode. This is a huge project and really impressive. I even saw something like “Noto Sans Kharosthi,” and checked out the glyphs in FontForge. For a project like SC which offers texts in many different languages, it’s a perfect fit.

Some years ago I wanted to try making an expanded version of Droid Sans. At that time, though, the Italic and Bold-Italic faces were “faked” by Android, and so it just didn’t seem to be a great fit. Now I see all the faces in Noto Sans are real and it’s a wonderfully complete font.

Just out of curiosity, I tried changing the CSS for the local version of my site, and I found that Noto Sans works very nicely and looks very legible. The design of Noto Sans also looks a bit less extreme than Source Sans Pro, which probably doesn’t work too well for large titles and the like. The spacing in Noto Sans also seems to work better than that in Roboto, which I find has kind of a “dazzling” effect in running text (maybe a bit too much letter spacing for that style of font?).

Noto Sans also has quite a different style compared to Helvetica. With Helvetica, there is some sense of modernism, efficiency, sleekness, uprightness, etc. Noto Sans seems less assuming, more relaxed, more open, more airy.

Well, those are my thoughts on the matter. A good sans-serif font has some nice qualities. One thing I have found is that with serif fonts, we tend to look at the font and its subtle details and stylings. With sans-serif fonts, though, we tend to look more at the words. The font becomes sort of transparent and just functions, and there is a type of beauty in that too.

sujato · May 28, 2016, 11:14am

Yes, that’s right. The CJK versions, and other non-Latin scripts, are new, but designed to harmonize. It’s the CJK fonts that are the real gem in Noto; they’re just so nice. Blake has been working on a custom subsetting method for SC, which you can see on staging. On startup, the system lists all the glyphs used in the CJK fonts, extracts just those from the relevant Noto CJK files, and creates a woff2 subset. The result is that we can serve even large text pages of traditional CJK text via @font-face, something unheard on till now. In addition, it guarantees that the correct regional variants for Chinese, Japanese, and Korean are used.

Yes, amazing, right! They have Brahmi, too.

Source Sans Pro and Roboto are both designed as UI fonts, so they have good legibility in small sizes, but as you say, not so ideal for body text. Noto is more humanist, less geometric, and I agree, it sits better in body text.

llt · May 28, 2016, 2:12pm

Yes, that’s right. The CJK versions, and other non-Latin scripts, are new, but designed to harmonize. It’s the CJK fonts that are the real gem in Noto; they’re just so nice. Blake has been working on a custom subsetting method for SC, which you can see on staging. On startup, the system lists all the glyphs used in the CJK fonts, extracts just those from the relevant Noto CJK files, and creates a woff2 subset. The result is that we can serve even large text pages of traditional CJK text via @font-face, something unheard on till now. In addition, it guarantees that the correct regional variants for Chinese, Japanese, and Korean are used.

Wow, is it specific for each page? That makes sense, given the huge size of the total number of CJK characters. But you’re right, that is a very custom sort of thing to do.

/fonts/compiled/noto-sans-tc_bold_zh_lzh_79d5953da549.woff2

Interesting…

You know, when I was looking at the “staging” area before, I noticed that a few glyphs seemed unusual, and then realized that they are the Taiwan variants of those characters. I downloaded Noto Sans CJK fonts and made a test page, trying them with different language codes (zh, zh-cn, zh-hk, zh-tw, zh-trad, zh-simp, lzh), and I found that “lzh” automatically picks the Taiwan forms of those characters.

I was quite sure that the mainland variants are closer to the old forms used in the Qing dynasty and before, like in the Kangxi Dictionary. After looking at a few example characters that differ, though, and comparing them with scans of the pages in Kangxi, I think the Taiwan forms are also reasonably authentic. In some cases the variants used in Taiwan are more traditional, and in some cases less.

Part of the problem is that there are many variants of Chinese characters, and specifying “lzh” does not necessarily tell us anything about which variants to use. Nevertheless, the font or browser tries to guess when it sees “lzh”. For example, a book in traditional Chinese may be published with different character variants between mainland China, Taiwan, and Japan.

An example of how this is handled by companies like Microsoft is that they commission two different fonts. Microsoft JhengHei has both traditional and simplified characters, and uses Taiwan character variants. Microsoft YaHei has both traditional and simplified characters, and uses Mainland character variants.

For example, from the article linked to below, we can see the two variants of these traditional characters. Despite the text alongside, none of the characters are simplified. The first two are how they are printed in Taiwan. The second two are how they are printed in mainland China.

It’s kind of an ugly problem. This article goes into it in more detail:

https://blog.zydeo.net/chinese-typefaces-simplified-and-traditional/

sujato · May 28, 2016, 10:58pm

No, we thought of that, but then you have to download a new font each page. What we do is make a single font that subsets all the Chinese (or Japanese or Korean) glyphs used on SC. The user gets that on their first visit to a page that requires that language. So that’s a bit of a hit, about 2.2MB on initial download. (Which is, however, in fact about average for page sizes on the web today …) But it’s cached, so subsequent visits are very fast.

Well, this is complicated, isn’t it? I am still not entirely clear on the differences and how they’re handled in Noto and Unicode more generally, but for now, do you have a recommendation? Obviously it is partly to do with one’s background. But i assume any reader sophisticated enough to read ancient Buddhist texts in Chinese will be familiar with the fact of regional variants. I’m wondering which seems more "native’, more appropriate for canonical texts in general.

I’m not sure that the use of the language specifier is relevant here as such on the site (as opposed to local testing). It just happens to be how we identify the necessary font. Our traditional font is generated by extracting the relevant glyphs from the Noto Sans CJK TC font. As such we should exclude all unneeded glyphs, including regional variants. So when the font is served via @font-face, there should be no question of selecting variants. Of course, this might not be working correctly at the moment; possibly the subsetting doesn’t take this into account and we do include the variants.

Mostly the Noto/Source documentation treats simplified = mainland and traditional = taiwan (with a nod to Hong Kong). These particular glyphs are not simplified, but they’re part of the simplified font. I’m not sure that it’s possible to specify "traditional characters in mainland china form’.

llt · May 29, 2016, 3:10am

Ah, very nice! That’s a numbers game of realizing that the total number of characters used in the texts is really not that much.

Okay, I was playing around with Noto Sans a bit, and I found that by setting the font face (TC or SC) and setting the language tags, glyphs are rendered differently.

The following is with “Noto Sans CJK TC”, for different language tags. The form with the broken radical at the top is the form used in Taiwan or Hong Kong.

When Noto Sans SC is specified instead, the results are quite different:

Apparently even when specifying the TC or SC font, and not the other at all, the language tags will still influence which forms are selected. For example, in CSS I have “Noto Sans CJK SC”, and next “Unifont,” and it still picks out glyphs for Taiwan variants for “zh-tw”.

Basically, using TC will cause “lzh” text to have Taiwan variants, while using SC will cause “lzh” text to have Mainland variants (although note all are traditional characters using the same Unicode points).

These sort of things are potentially sensitive issues also because they are regional variants, and so there are always some politics about Chinese characters in East Asia, and to what extent they should be unified (with “unification” using revolving around the Chinese language, and mainland China).

Neither variant is more “correct” than the other. After comparing a few characters and looking at the many variants used between countries, and in the Kangxi Dictionary, it’s apparent that there was always a bit of variation, and that no modern forms are exactly the same as the Kangxi forms.

In terms of numbers and demographics, the mainland has some 40 times the number of people in Taiwan and Hong Kong combined. This inevitably means that most readers around the world are more familiar with the SC forms, and most Chinese books will also use SC forms. Additionally, Chinese (“zh”) text on the Web seems to render in SC forms by default, including everything on CBETA. So my tendency is to use the SC forms (and also because they look more familiar to my eyes).

Nevertheless, Taiwan and Hong Kong have long histories of using the traditional characters, and still use them to this day for everything. Their forms are no less correct, and so it mostly comes down to personal preference.

Finally, here are a few dictionary entries that show different forms. In the sequence, it shows (1) mainland China, (2) Taiwan, (3) Hong Kong, (4) Japan, (5) Korea, and (6) Kangxi Dictionary.

#茶

http://www.zdic.net/z/22/zy/8336.htm

#無

http://www.zdic.net/z/1d/zy/7121.htm

In general, it seems like mainland China, Japan, and Korea often use similar forms, while Hong Kong and Taiwan often agree about slightly different ones.

sujato · May 29, 2016, 3:37am

Okay, thanks. It is complicated! But it does seem as if using the mainland forms would make the most sense.

I just checked 茶 in a bunch of sites—Wikipedia, google, facebook, wechat, baidu, duckduckgo—and on my machine at least most of them have the unbroken mainland form; the exception is google.com.tw, which makes sense, and facebook for some reason.

I still don’t understand how this can work with the simplified forms. My understanding is that in some cases the traditional forms were radically simplified or combined, so how how can we simply present “simplified” glyphs and have them appear as traditional? If the simplified forms are assigned different Unicode points, it’s not an issue, but my understanding was that this was done for glyphs with a major difference, whereas for minor stylistic changes the same Unicode point is used. So doesn’t this look, well, “simplified”? Or is this not an issue?

llt · May 29, 2016, 4:19am

As far as I know, the simplified characters used in the mainland were all given different Unicode points, so this shouldn’t be an issue. The simplifications codified existing shorthand forms and made them standard. This was mostly done to aid literacy. There are around 2500 simplified characters, but they are assigned their own Unicode points.

(As an anecdote, Mao Zedong wanted to do away with Chinese characters completely and adopt the Latin alphabet. The simplification and adoption of Pinyin for transliteration was part of that push, but they later settled on more conservative goals.)

From the example characters above, 茶 stayed the same, but the character 無 was simplified into 无. As a result, modern books printed in simplified Chinese use a combination of simplified and traditional characters.

For example, 無知 simplified would be written as 无知. The first character was changed drastically, while the second character stayed the same. Using traditional characters, though, it would always be 無知.

The regional variants as they relate to fonts would be the same Unicode points, but different font representations of those points, like slightly different ways to write the character 無.

sujato · May 29, 2016, 9:44am

Huh, okay, well I’ll check, maybe we will end up using the SC version after all. Thanks so much for the help.

llt · May 29, 2016, 1:13pm

Sure, any time.

Brahmali · May 30, 2016, 1:18am

And I understand that it was Stalin who persuaded him to keep the Chinese characters.

blake · May 31, 2016, 5:08pm

Okay we’re now using CJK SC instead of TC on staging. The TC font is still available if you want to toggle between them in inspector.

Btw since it came up Taiwanese users are actually a very significant minority of Sutta Central users and perhaps this should influence our choice of font, for example Google Analytics languages show:

en-us: 55.6%
en-gb: 11.6%
zh-tw: 3.8%

So zh-tw is the second most reported language iso code after englishes.

The other zh languages to pop up are zh-cn at 0.9% and zh-hk at 0.14%.

In addition geographically Taiwan and Hong Kong are the most prominent of locations using a CJK script - I mention this because browser language is really unreliable, but in this case the location and language statistics are in close agreement. The mainland users are poorly represented with only about 1/5th as many as from the islands.

sujato · May 31, 2016, 10:47pm

Interesting stats. In many ways the internet in China is a separate beast, perhaps the Tw/HK communities are more international. And of course, Buddhism has a more solid grounding there.

llt · June 1, 2016, 3:38am

Huh, interesting. On my site, China is around 6%, Singapore around 4%, and HK and Taiwan each below 1%.

Internet in mainland China is indeed very different. Google, Facebook, Twitter, and most western blogging platforms are blocked. Social media widgets and Google fonts will also cause websites to hang as the connections time out.

sujato · June 1, 2016, 11:05am

I didn’t know this, good thing we don’t have them. CBETA has social media widgets, though, so I wonder how they fare?

What about Google analytics? Currently we use this, though I’d like to get rid of it. Does this cause any problem in China?

I noticed on the Baidu results that SC results tend to not feature very highly. You get a lot of results for an SC mirror at dhammatalks.net. Perhaps this is a reason for relatively low usage in mainland China.

llt · June 1, 2016, 4:24pm

The site still renders after about 30 seconds, so not too unreasonable.

[quote=“sujato, post:16, topic:2849”]What about Google analytics? Currently we use this, though I’d like to get rid of it. Does this cause any problem in China?
[/quote]

Just looking briefly, it appears that the site functions fairly normally. The jQuery stuff hosted off-site times out, and maybe a few other occasional things, but the site still works.

sujato · June 2, 2016, 12:37am

Blimey. 30 seconds is reasonable? I suppose you can meditate while waiting for a page to load. Is this common?

I just tested it at dotcom-monitor which gives results from multiple countries. China is way out there, 40-50 seconds to load. On inspection it seems it is blocking resources from google domains, even benign things like jQuery or the font loader.

@blake, this seems like a serious issue that will majorly impact our users in mainland China. Can you check it out? It seems we should probably avoid serving anything from a google domain. I will expand on this in another thread.

llt · June 2, 2016, 1:32am

Blimey. 30 seconds is reasonable? I suppose you can meditate while waiting for a page to load. Is this common?

It’s common for foreign websites. The quality of Internet is generally pretty low, and varies randomly as well. To get around this, most foreigners (and others who want to use Google, Facebook, Gmail, Twitter, etc.), just use a VPN, which solves the problem pretty well.

this seems like a serious issue that will majorly impact our users in mainland China. Can you check it out? It seems we should probably avoid serving anything from a google domain.

It does affect users from mainland China, but realistically, websites shouldn’t have to design around government censorship. China could have better Internet, but they choose to hobble it for political and economic reasons.

With my site, I just host everything on one server, and it is safe as long as the website itself is not blocked. The two sites can’t really be compared easily, though, because my site is like some Web 1.0 sort of thing. Even then, though, sometimes the site randomly loads slowly.

raivo · June 2, 2016, 12:48pm

Before you do, perhaps you or someone with access could create a brief overview of the current usage of SC. It would be interesting to know how many people in how many countries read the EBTs here, perhaps there’s enough data to point out some regional or worldwide trends, top suttas read etc.

Actually, it would be nice if such basic stats were generated live and displayed to all users all the time…there’s probably some free tools around to achive that without much effort and without sacrificing anyones privacy. Or maybe something like this is already available somewhere and I just don’t know about it.