Add texts in Japanese and Afrikaans

sujato · April 13, 2015, 7:25am

I have just added a number of the Khuddaka texts (Dhammapada, Udana, Khuddakapatha, Theragatha, Therigatha) in Japanese. Curiously enough, we cannot find any digital text of Japanese translations of the main nikayas, even in scanned form. If you know of anything, please let us know.

In addition, @vimala has added the Sutta Nipata in Afrikaans. This brings to a total of thirty languages represented on SuttaCentral.

JacquelineR · April 17, 2015, 3:47pm

I don’t know if it’s helpful, but http://komyojikyozo.web.fc2.com has material from the Majjhima and Digha nikayas. The formatting is piecemeal but it might be possible to email the site admin.

RE digital text of Japanese translations of the other nikayas: the National Diet Library of Japan had made portions of the Nanden Daizokyo Japanese tipitaka available online, but the digitised text was withdrawn after a complaint from the publisher, Daizo Shuppan, as they are still selling hard copies by print-on-demand. The digital edition is now only available for in-house use. To the best of my understanding, the copyright on the Nanden Daizokyo expired in 1995 at latest and the explanation from the NDL for the removal of digital access has been somewhat evasive.

sujato · April 17, 2015, 8:51pm

Thank you so much, this is wonderful.

As for http://komyojikyozo.web.fc2.com , I have no japanese, would you be able to point me to pages that actually host sutta texts?

This is pretty shocking if it is true. May I ask, where did you hear this from? Do you know of anyone at the NDL we might contact for information? And how might we determine the actual copyright expiration date?

I might add that it is not unheard of for certain publishers, even within the Buddhist sphere, to not only insist on copyright, but to claim rights that go well beyond the legal requirements, and bully people who use texts in accord with the licences.

JacquelineR · April 19, 2015, 4:43pm

I just did a quick web search for “digital” +“nanden daizokyo”. The 2013 statement from the National Diet Library about this is here http://www.ndl.go.jp/jp/news/fy2013/report140107.html. The digitisation dispute ended up going to the review of an expert panel, which took six months to complete. The full report of the expert panel is here http://www.ndl.go.jp/jp/news/fy2013/report140107.pdf The TLDR version is that the panel found that Daizo Shuppan held no legal right to the text, but that the library should proceed in accordance with internal guidelines, which state that the digitisation project should take business considerations into account (which apparently means making greater concessions to publishers than what is required by Japanese copyright law). The report lists the names of the relevant library representatives: a good question might be when is this decision coming up for review?

The fact that the copyright had expired on the Nanden Daizokyo wasn’t in dispute: according to article 57 of the current Japanese copyright law, the copyright expires 50 years after the death of the copyright holder. If the copyright holder was general editor Takakusu Junjiro (not Mizuno Kogen) the copyright ended in 1995. I am no expert in this area but I think that would mean that there would be no legal barrier if someone else wanted to create a digital edition.

By way of comparison, the Taisho Tripitaka (also published by Daizo Shuppan) is already available digitally in Japanese from both the SAT database and the NDL digital library project. This also went to the NDL review panel but it was found that the digitisation wouldn’t effect Daizo Shuppan’s profits as the SAT database was already available. This is why there is the unusual situation that the Taisho Tripitaka is readily available online in Japanese but the Pali Tipitaka is not.

As for the suttas at http://komyojikyozo.web.fc2.com, the links under “パーリ三蔵の現代語訳” are nikaya+vagga, the next page is sutta, and the page after that is subsection. There are 30 suttas in total.

sujato · April 19, 2015, 8:38pm

Wow, thanks so much for this swift and detailed information. It is bad enough that any Buddhist texts are placed under copyright, to be used for the profit of corporations, but it is terrible that this goes even beyond the scope of copyright. The purpose of copyright law is to protect the interests of the creators of original works. Here, it would seem, neither the law nor the Dhamma is being respected, only the profits of publishers.

Just to clarify, the text has been digitized, right? And it is still available internally? (so the Google translate version of the notice seems to say: については、当分の間、インターネット提供は行わず、館内限定の提供を行う)

The text says that they are working hard for a resolution (解決に向けて、引き続き努力を重ねてまいりたいと存じます。) I presume this means that no-one is doing anything.

We could try contacting the NDL and asking about it: would you be willing to be a go-between for SuttaCentral? We could also try contacting relevant Buddhist organizations in Japan and asking for their support. Unlikely that this will get any traction, we are outsiders and don’t know how these things work.

Perhaps it would be effective, however, if we were to propose our own digitization project. We are already digitizing the texts in Hindi and Sinhala, with Bangla getting started, and are in contact with a similar project for Khmer.

We could contact the NDL and various other Buddhist organizations, express our regret at the situation, and ask for their support in undertaking our own digitization, to be made freely available for everyone. It’s possible this might shake them up enough to revise their decision and release their digital text. If not, we can simply do the digitizing ourselves and be done with it.

sujato · April 23, 2015, 10:07am

This issue is also discussed here:

Henry · January 28, 2020, 6:47am

I have begun to scan the 70 vols of the 南傳大藏經.

Since just presenting the plain scanned files as they o on archive.org seems rather pointless, I’d like to create pdf documents with backlinks (in form of a concordance) to the other materials on suttacentral using LaTeX.

There are a few technical matters I need help with:

Given that suttacentral is a rather complex app, I have not bothered to wade through the code on github (as yet), leaving the question:
“The basic idea is that anything must be labelled with the standard SC UIDs.”
what are these/where do I find the “standard classes”? (I assume these are those mentioned substituting for TEI-annotation?)
OCR: tesseract, although working with Japanese, produces terrible results – I have more or less given up on that. Does anybody know/have experience with other non-commercial software? I’d say 95%+ accuracy is absolutely necessary. The original text contains unreformed (= non-Toyo) characters. Text is written vertically right-to-left.
The original has a frame around the body text. Anybody out there who could write a python-script for GIMP, detecting the printed frame, then to cropping text to that (and lighten grey by way of the histogram?)

Viveka · January 29, 2020, 7:03am

Greetings @Henry and welcome to the forum

Many thanks for your post and the great work you are doing

Look forward to seeing you around the boards. If you have any questions or need some assistance, please feel free to ask one of the Mods
@sujato

sujato · January 29, 2020, 9:59pm

Hi Henry,

You’ll be please to know we have begun discussions with a translator interested to create modern Japanese translations!

Numbering is defined here:

https://suttacentral.net/numbering

But to be honest, I think this is going to be a hapless task, especially once you get into AN and SN and more obscure works. Who knows what system they used? It is extremely difficult to isolate the exact boundaries of texts and numbers even in a well-formed digital edition, and I would suspect it will prove nigh-impossible in a scanned edition, without long and laborious work checking the Japanese.

And who is it for? The translation is, according to our Japanese friend, almost unreadable to modern Japanese. It’ll be a great resource for translators creating a new edition, or perhaps for scholars (although Japanese scholars usually just learn Pali), but it will never be widely read by Japanese Buddhists interested to read the suttas.

Personally, I think creating good PDFs and uploading to archive.org and similar is the best option. Keep it simple; just that much accomplishes most of the benefit with a reasonable amount of effort. Doing so, we rescue this important historical work from obscurity, make it freely available to scholars, and provide the basis for a modern translation.

Indeed, I’ve not had any success using OCR on Buddhist texts in complex scripts. It would be nice if it worked, we could create a searchable PDF, but I’m not optimistic.

I’m not sure exactly what you want to achieve here? Is this just to crop to the main text for OCR purposes?

Henry · January 30, 2020, 12:21am

I should point out that I read and translate Japanese. During my 4 years there I started in the automobile sector, completed the 4-stage course of the Nihon Kagagaku Gijjutsu Kyokai and branched into patents. When I started preparing Hermann Bohner’s work (a brilliant translator whose quality I’ll never attain) for the Web in 2006, I took up Kanbun (i.e classic Chinese with reading marks), initially to compare his translations.

So far I only borrowed Vols. 1 and 2. Their introduction mentions that the texts were translated into Japanese from Oldenbergs 1881 edition of the Vinaya. I assume the suttas will be based on other PTS editions – no point in translating that into a third language from the Japanese. This Japanese academic argues for any new translation to be based on the 6th council texts.

Sorry, are we talking about the same book here, the Nanden Daizokyo? – and not the Taisho? which is written in Kanbun .
Looking at the ND, apart from the fact that some pre-reformed Kanji are being used the type is clear. The text is modern Japanese ( – assuming one does not take Manga-Japanese as the standard). Special Buddhist readings are given as furigana – nothing special here, provided one has a bit of background knowledge. The style as such is of course antiquated, but then so are some of the suttas as such (just try and find a translation of the Metta Sutta in any Western language that makes pleasant reading).

I agree that the aim must of course be [quote=“sujato, post:10, topic:363”]
creating good PDFs
[/quote]

To this end I have explored various technical (open source) options available. I have spent most of today refining on the script semi-automatically preparing a XeTeX-frame to create a usable pdf from that (and I think we agree – herein differing from Oxford Univ. Press – that LaTeX produces the best results).
By providing a ToC at least down to subsection level (and backlinks) plus placing correct page numbers (i.e. pdf matching the original) the file becomes usable since at least the headings are machine readable (and would show up as search result even when on archive.org). Since image quality has be checked anyway adding a few headlines causes not too much extra work. For me, say 4-6 hours, i.e. a dreary German winter evening, is not too much effort to put into “post-production” cleanup for each volume. (I’ll look into the numbering once I scanned Vol 2).

Providing the scans of course has, like any web-based presentation, take into account ALL potential users.
The lack of OCR and therefore accessibility excludes most visually impaired.
Removing the frame (and superfluous info outside) will, should sometime in the future a workable OCR solution become available, make it easier.
In the same vain is my insistence on creating pdf-A (or pdf-U) standard compliant files on a level that major online repositories/libraries use – these pdfs will be permanently portable as long as electronic “permanence” will exist.
(Note: I am well aware that regarding “standard compliance” the old technician rears his head Some of my explanations are too technical, particularly if one is not academically inclined , e.g. not liking footnotes etc.)

Henry · February 6, 2020, 10:11am

An update:

Regarding the mentioned GIMP script, I’d say Frank’s imagemagick based “textcleaner” together with some cropping does a reasonable job.

So far Vol 1 and 2 are almost ready as pdfs, Vol 3 in the process of being scanned. Touching up and adding links to suttacentral won’t take more than a week, when I should be able to upload the files to archive.org – to an account named “NandenDaizokyo.” In the hope that scanning the ND may still turn into a collective enterprise anybody may contact me for other uses of high-resolution files/scans (several hunderd MB per Vol.) and the TeX master template.

Additionally adding the pdfs to suttacentral-files github repo would make sense.

On a cheeky note, I did find this Japanese woodcut by Katsushika Taito 葛飾戴斗 (1818–54):¹ Journey to the West (西遊記): “The Seiryu-zan Goblin (青竜山妖怪²) steals the Tripitaka from the `monkey.’” Anybody who has read this Chinese classic knows that “the monkey” is Hsüan-tsang, and he gets the Canon back …

¹) Scanned from BSB L.jap. Aa 2-60, p. 460
²) 妖怪 can also be rendered as “apparition,” which of course does not fit a librarian.

Henry · February 6, 2020, 11:09am

PS: Regarding numbering
(Hardlinking seems rather pointless):
Is there a concordance table already in use internally?

“These same segment numbers apply to both text and translation. Here is how this looks in the underlying file, which uses the PO file type of the open source gettext translation system.“ Said underlying file would be a start.

sujato · February 6, 2020, 8:56pm

A concordance of what, exactly? I’m really not sure sure what you’re looking for.

Our sutta reference concoirdances are here. These allow you to reference between SC and other legacy editions of the Pali text.

https://raw.githubusercontent.com/suttacentral/sc-data/master/misc/all_pali_concordance.json

The root Pali texts are in bilara-data, for example here:

https://github.com/suttacentral/bilara-data/blob/master/root/pli/ms/mn/mn1_root-pli-ms.json

All links to suttas must link to the SuttaCentral ID, eg. mn1.
All links to things within suttas must link to segment ID, eg. mn1:1.1

See discussions here:

Henry · February 10, 2020, 9:56am

Thanks the above seems to be just what I was looking for. Amazing how complex the technology behind suttacentral is.

Apart from that an unordered list what is in the Nanden Daizokyo:

5 vols, Ritsuzō 律蔵 (= Vinaya)
3 vols 長部経 (= Dīgha-nikāya)
4 vols, Chūbu kyōten 中部経典 (= Majjhima-nikāya)
22 vols, Shōbu kyōten 小部経 (= Khuddaka Nikāya)
6 vols, Sōō-bu kyōten 相應部経 (= Saṁyutta-nikāya)
2 vols, 論事 (= Kathāvatthu)
7 vols Zō-shibu 增支部経 (= Ekôttarikâgama)
3 vols, Sōron 雙論 (= Yamaka)
2 vols, Ronji 論事 (= kathā-vatthu)
7 vols, Hosshu-ron 發趣論 (= Paṭṭhāna)
1 vol, Hōshū-ron 法集論 (= Dhamma-saṅgaṇi)
2 vols, Bumbetsu-ron 分別論 (= Vibhaṅga)
1 vol, Shō-ō tōshi 小王統史 (= Cūḷavaṁsa)
3 vols, 清浄道論 (= Visuddhimagga)
1 vol 島王統史 (= Dīpavaṁsa + Mahāvaṁsa)
3 vols 彌蘭 (= Milindapañha)
2 vol bits and pieces, Ashoka edicts among them: 一切善見律註序. 攝阿毘達磨義論. - 阿育王刻文
3 vols, glossary + register crossreferencing Pali + Kanji

Henry · February 17, 2020, 9:47pm

The first two scanned volumes of the Nanden Daizōkyō (Japanese Pāli canon) are up on archive.org as of tonight. These are:

律蔵 [Vinaya] 1: 経分別一 (Sutta-Vibhaṅga 1)
律蔵 [Vinaya] 2: 経分別二 (Sutta-Vibhaṅga 2), 比丘尼分別一 (Bhikkhuni-Vibhaṅga 1)
律蔵 [Vinaya] 3 should follow tomorrow, I’ll just play around with another script for better image conversion.

Page numbering follows exactly the printed original, it’s therefore possible to quote correctly.

As there is no decent open-source OCR available (unless one buys a license for Abbey) for Japanese with lots of unreformed Kanji, chapter and section headings have been manually added so users can at least navigate to (sub)section headings.

The pdf contains backlinks to the relevant parallels on suttacentral. [Note: these work only in the original large size pdf (350 Mb) not the automatically converted files, neither pdf nor djvu.]
Somebody capable ought to place links the other directions. The TeX master (also uploaded see “other files”) has \labels for each subsection.
Additionally, from vol 2 onward an archive of the original tiff scans will also be uploaded. Should usable OCR for CJK become available n the future anybody may improve on the existing work.

sujato · February 17, 2020, 9:57pm

Sadhu! Sadhu! Sadhu! Thanks so much for this Henry. I’m thinking that we should get in touch with the Theravadin people in Japan, they would be the ones most interested in this. Hmm, not sure how to do that.

Kaz · February 18, 2020, 12:04am

Bhante @sujato I know there is Japan Theravada Buddhims Society in Japan.

Am I on the right track? Is this kind of thing you are thinking of? Or more academic ones?

Henry · February 18, 2020, 9:16am

@Kaz, sorry mate I read Japanese, writing is a totally different story.

But then Sumanasara Thero is Sri Lankan. And then of course there is a link to the [https://sujaatabs.wixsite.com/mysite](http://Sujata Buddhist sisters)

Henry · March 18, 2020, 5:14pm

Volume 4 is up as well. Using ScanTailor to improve images did make a big difference.

As much as I’d like to complete the Vinaya quickly (i.e. the final Vol 5), I am prevented from doing so by the current worldwide outbreak of virus-induced madness, which here in Europe takes the form of blind actionism on part of hapless politicians who for the last thirty years have ruined public health services in the name of “efficiency better provide by profit oriented private sectors.” Of course one can see all those ordered closures as a dry run for a authoritarian form of government soon to come – one of the effects is the restriction of research, by closing libraries.
Since the copy to be scanned may only be used in the reading room, no way of getting access for the immediate future.

Viveka · March 18, 2020, 8:16pm

Sorry for your troubles. It too will pass

Metta and Karuna