Machine translations of the CBETA corpus: Discussion on H-Buddhism

Khemarato.bhikkhu · June 18, 2022, 5:37am

Isn’t that precisely the situation glossed earlier as “gleaners and cleaners following behind the machines”?

sujato · June 19, 2022, 1:31am

One more in my series of notes on AI. I’d like to raise another ethical question, namely energy. Here’s some background.

The ethical problem is simple and unanswerable: how much energy are we justified in using to create such models?

The energy use of AI modelling is driven by the fact that they have to create extremely large datasets through highly intensive number crunching. This is ultimately moving electrons around, which requires energy and generates heat; the same heat that’s coming out of your computer or phone now, just a lot more of it.

When making more advanced systems, the basic method is to use bigger datasets—which in this case is limited, since it requires more human translation as input—or more intensive data-crunching. To enable, for example, the kind of “context-aware” translation spoken of above, which expands the data vectors beyond a single segment, is orders of magnitude more complex. This is a basic physical constraint. Of course you can do things more efficiently, but the problem is that efficiency gains tend to be linear, while complexity increases exponentially. This is why the field as a whole keeps demanding more and more energy.

Now we can address this with practical solutions:

more efficient chips
use renewable energy
make more efficient systems

Which, fun fact, are the very arguments used by the crypto bros to justify burning the planet so they can play with their toy money. Crypto and AI are similar in that they both aim to replace or enhance conventional technologies with highly energy-intensive computational processes. Of course, crypto uses this to run scams and fleece fools of their money, while AI can, in principle, be used for good things. But still, it should prompt us to consider these ethical questions seriously, and not just fob them off.

There should be a story front and center about these ethical considerations. Is there an ethical justification for applying AI in this context? What is it? How much energy use is too much? What are the criteria that AI projects should consider when evaluating their energy usage?

SebastianN · June 19, 2022, 9:28am

Thank you all and especially Bhante Sujato for the elaborate messages. I do not have the time to answer them all in detail, but I enjoy to read along the lines here and see what direction the discussion is heading to.
Just to give a few more observations from my side:
Regarding the environmental impact: To give a dimension on how much energy the training of the linguae dharmae model took: 1x 3090 with a TDP of 350 watts running for 12 hours. An electric car burns .24kw/mile so training this model is equal to driving 15 miles in an electric car. This is not nothing but sometimes it is also good to put things into perspective. The pretraining of larger models of the GPT-class requires more energy, certainly. For example the training of a Latin BERT model on a google cloud TPU with 300w tdp took 4 full days. Once such a model for our data is trained, it will remain in use for a few years to come . Its not like we need to retrain these large models every couple of months as it is the case for other languages – the source material isn’t changing, and the availability of more translations only affects the fine-tuning process which only requires a few hours every now and then.
Regarding ethical considerations of the development and availability of machine translations of Buddhist source material both from the research and the religious perspective, I would like to mention another aspect:
I think it’s vital for us as independent researchers and translators to have our own models that are not tied to any agenda (apart from that of gaining knowledge). If we don’t engage in this research now, we might see a political or religious party (with more cash and manpower at hand than we do) publish their machine translations in a few months or years. Now that will happen anyway, in my eyes its just a question when. I think the best we can do as a community is to develop our own models transparently and make the data available to the public. This might be the best way to take the wind out of the sails of big companies and religious/political institutions who might develop such models with certain agendas on their mind, similar to how it worked out with GNU/Linux and the big software companies of the 90s.
I think the interest in machine translation of Buddhist material in East Asia is huge. And I can already see people publishing translations that might be just marginally better than our models at the moment in printed books etc. claiming them to be genuine works. I don’t think it is a process that can be stopped, it is just something our field has to go through. I don’t think there is a big danger to “Buddhism as such”, but it will certainly become more important to educate the community about the possible dangers and pitfalls of just trusting ‘translation x’ that has been found on the internet or in a bookstore.
I think we will see a lot of improvement of the quality of the models in the coming years. Neural machine translation has only really kicked of in the mid 2010s and has since then overtaken other more limited machine translation approaches. The transformer-networks introduced in 2017 saw another big increase in quality, just as the pretraining methods applied since 2018. Since 2020 we see how pretraining methods can be leveraged to improve the translation quality of low-resource languages (for which only very limited data is available). Linguae Dharmae is one outcome of these developments, but it is still in an extremely rudimentary state. There is so much which can be improved easily.
Currently we only used about 10%, maybe less than that, of the available translated data to train the model. We didn’t use any domain-specific pretraining, we didn’t address the problem of punctuation/text segmentation, we didn’t deal with document-level context. And then, of course, we are hoping that the availability of machine translations will motivate some people to do a bit of post-correction here and there that can then flow back into retraining and improving the models. We already get feedback from that direction and I also engage in that task in my own free time every now and then since, to me, it feels like a meaningful exercise. In my eyes it is already more time-efficient to post-correct the output of our model than to translate from scratch, at least for the material that I am working on.
I think Marcus was intentionally a bit provocative in the announcement but at the same time I do believe that this managed to raise the important questions for our field and caused a lot of people discuss this. And, of course, nobody needs to rely on machine translations for their work. So if somebody feels uncomfortable with the idea of just becoming a ‘gleaner and cleaner’ of machine translations, one doesn’t have to do that of course. Similarly to bicycles still being in use despite the availability of trains, cars and airplanes.
So I am mildly optimistic about our future here and see it as a good sign that we have lively debates about what is happening here!

sujato · June 22, 2022, 9:00am

Thanks sebastian, i’m meaning to follow up on what you said, as well as writing a what I promise will be a positive contribution to the debate!

In the meantime, though, i raised this question today with the rector of the Nagananda Buddhist University, where I’m currently holding a course, Ven Bodagama Chandima. I asked whether he thought a machine translation of suttas from Chinese would be a good idea. He replied with an enthusiastic “yes”, saying that based on his many years of teaching In Taiwan, that we needed more knowledge of Chinese-language texts and sutras. So you have a new fan!

Dheerayupa · July 1, 2022, 4:56am

This statement really hits the point. Thailand has three widely known translation versions. All by major monk institutes, so naturally the general public who are interested to read the Tipitika ‘trust’ their translations. Those who can’t make heads or tails of the stylish language used just don’t read the Tipitika.

As a professional translator, I don’t mind machine translation and find it help save precious time (I have fewer than 24 hours a day ) to translate/type. Of course, when I use a machine (Microsoft Office Word), I treat it as a kitchen hand who chops and places all ingredients that their little brain can figure out for me. It’s up to me, the chef, to use what and how much to prepare a good dish.

If and when the machine translation for Tipitika is available in Thai, the reader has to be made aware that mistakes should be expected.

Vimala · July 1, 2022, 6:59am

Dear all,

I apologize I have been very absent from the discussion and forgive me for not having the time to read everything. I just want to share here that we made some decisions that might be of interest with regards to the machine translations, also those of the Pali.

First of all we will not host the machine translations on BuddhaNexus and also remove the Pali translations there. We will make a new website that hosts these machine translations with the specific purpose to allow scholars to make corrections to the machine’s output translations. This data can then serve as training data for the model so as to improve future versions. This website will also make it very clear what these “translations” are.

We had discussed this earlier for the Pali also and I thing that would be absolutely great. I would suggest however that we make some further headway first before going this route. For the Pali I have some more segmented translations of commentarial texts in from our friends from the Theravada Group Moscow so I would like to incorporate those in the next run of the Pali texts.

Bird-of-Paradise · July 1, 2022, 1:12pm

SebastianN wrote

One can of course disagree with this approach and I am very much happy to hear opinions that diverge from what we are doing.

I appreciate your approach. Machines have been crucial to me for a sound understanding of Samyukta Agama. Sometimes Google translator becomes the saviour when DeepL flounders. I use Yandex too.

You wrote

One thing to consider is that what we are doing here is to develop a linguistically driven, alternative approach to the question of translation of Buddhist texts, its not about cloning human beings or building nuclear weapons, and certainly not about putting well-crafted human made translations of Buddhist texts into question. In that sense, I hope we as the technicians can keep a positive relationship with the translators here, who have created and still are, without any doubt, creating constant merit for the Buddhist community that a machine will never be able to achieve.

On occasion a machine can correct a human error, SN 12.63 comes to mind. My main focus is on Samyuktagama, which VBB has said is the closest to the Buddha, in his introduction to Samyutta Nikaya.

It is puzzling why the Pali translators omitted some seminal suttas. An omission of a critical sutta at the beginning of a Samyutta, and subtle modifications of the ensuing text can lead to misleading interpretations. Anapana Samyutta comes to mind.

It begins with SA 803/SN 54.1, but omits SA 801. SA 801(Easing into breath), as the introductory sutta would have made a significant difference.

If not for machines I would not have detected such.

Why did the Pali tradition exclude the “Seal of Dhamma” SA 80?, sandwiched between SA 79 (SN 22.9:impermanence) and SA 81 (SN 22.60;Mahali).

SA 80 stands out in the subtle forcefulness and power of its content. It is a synopsis of the liberating process.
Without the machines, curious people could not investigate which suttas were tampered with over time. A comparison of AN 9.37 and SA 557 presents a case of sutta tampering. SA 556 and SA 558 support SA 557, but Pali compilers left these untranslated.

I was able to read the untranslated Chinese suttas, thanks to the work of dedicated technicians like you who enable machines to echo the voice/spirit of the Buddha. Thanks to cdpatton’s contributions. Yinshun's Reconstruction of the Chinese Saṃyukta Āgama (Taisho 99)

Thanks to Sutta Central for the Chinese versions of the suttas of Samyukta Agama, which were left untranslated, and translated.

Dheerayupa · July 2, 2022, 1:59am

I personally don’t mind. I’m still confident that human translators today are still better than machines, but who knows one day we humans may be able to create a Universal Translator as used in the Star Trek universe. I wouldn’t object to it at all.

Vimala · July 2, 2022, 9:08am

UPDATE: The repository has been moved to:

(but the old link will redirect you too)

twestermann · October 21, 2022, 10:23am

Dear suttacentral community,

Thank you very much for all the discussions and comments on this topic. I’m Till, one of the people who has been working on this topic in deepl and has been one of the main drivers for it out of personal interest.

I’d like to clarify a few points about this project and our intentions, but first I’d like to refer you to the disclaimer we published the translations with:

About these translations

This text was translated by DeepL using an experimental model that was trained as an internal, non-commercial research project. Given the sparsity of the training material and the diverse nature of the Taisho and Shinsan corpus, the model might not always meet DeepL’s usual quality standards. Please take extreme caution when interpreting the results. We did not translate sentences with more than 400 characters as the resulting translations wouldn’t be reliable at all.

Personally, I think that machine translation can be very useful if used correctly. It could be a good start to a higher quality translation, or just give people an idea of what these texts are about. We are trying to see if we can tackle this problem from deepl’s side and use our knowledge and infrastructure to do something good as a company. This is not and will never be a commercial project.

We decided to put our translations on github so that there would be a starting point. That starting point is not great from a quality perspective. This is why we put the disclaimer above. It seems to me that this disclaimer didn’t get as much attention as the announcement on h-net.

You might ask why we put this translation on github in the first place: Vimala and I discussed publishing it, collecting corrected translations, and re-training to improve the quality. This might lead us to a more reliable model.

Putting the translation out there did indeed stir things up: More training material has emerged and we are working on improving the model. Others have joined in and hopefully this will improve things even more. So from our point of view it was actually a success. Not in the sense that we have perfect translations, but that we have made progress towards better translations.

The focus of this project is to provide the community with helpful data with our translation models to better decipher these texts. We are already working on collecting multiple translations per source sentence and alignments between the Chinese characters and the English words to gain additional insight into the original texts.

So I can only invite everyone here to provide feedback in the form of corrected translations so that we can improve the translations. As the corpus is not homogeneous, this will be a very interesting challenge - but not impossible.

I like to state that we as deepl employees didn’t write the announcement on H-Buddhism. Personally, I don’t know what will happen in the future and how the relationship between machine translation and human translators will look like. The project might be a dead-end or lead to great insight. Only time will tell.

Till

josephzizys · October 21, 2022, 10:36am

thanks Till! This is a fascinating project!!

Ric · October 21, 2022, 11:47am

Hi @twestermann,

Welcome to the D&D forum!

Enjoy the multiple resources here available: may these be of assistance along the path.

Should you have any questions related to the forum, feel free to contact the @moderators.

With Metta,
Ric
On behalf of the moderators

josephzizys · October 22, 2022, 3:54am

@twestermann or @Vimala I am looking to check out the parallels on suttacentral in the machine translations, for example I am looking at MN35 which says there are 2 parallels, given as Tii715a28 and Tii035a17 however the machine files are listed in the format T01n0001_010 for example, how do the 2 formats relate? Thanks for any help you can give!!

addendum: Analayo gives SĀ 110 at T 99, 35a-37b. EĀ 37.10 at T 125, 715a-717b which is different again?

Metta.

Snowbird · October 22, 2022, 5:06am

Welcome to the forum!

Yes. Honestly, I think that it should have been much more prominent and direct. Instead of “About These Translation” it should have been headed “Warning”. And personally I believe it should have been included in the text of every result. More like a black box warning.

In the Read Me, again you talk about a warning, but it is not labeled as such. If you are serious about this I request that you actually present them with a warning section, rather than a “how to publish” section (although that’s good to have too). In this warning section you might keep in mind that non-specialists will be reading it and use language that conveys the natural caution someone knowledgeable in machine translation would have.

The convention of putting quotation marks around the word “translation” also does not convey any real meaning to the general public because of the frequent misuse of scare quotes. If they are not translations it would be good to have a plain language way of indicating this rather than scare quotes.

Honestly, most people aren’t qualified to decide how to use the work you have done. I think it is just as likely to be misused as not.

Certainly I’m happy you are doing this work, and I hope it progresses. And while the openhandedness of the Buddha with the Dhamma should usually be a guiding principle, he also spoke very strongly about teaching “non-Dhamma as Dhamma.” This is especially relevant as people will be using your work to try and decide what is Dhamma and non-Dhamma in other canons.

josephzizys · October 22, 2022, 5:29am

But who gets to decide who is “qualified” and who is “misusing” the work @Snowbird ? Is there some qualification for the people who get to determine who is qualified? a qualifier qualification perhaps.

The more translations, be they human or machine, the better, in my opinion. Let a thousand flowers bloom!

Snowbird · October 22, 2022, 6:51am

No need to make this personal.

cdpatton · October 22, 2022, 7:29am

Their files are divided into Taisho texts and then individual fascicles. Fascicles were individual scrolls in ancient times. The filenames look like this:

T01n0001_001.json

That means Taisho Vol. 1, text no. 1, fascicle no. 1. Inside that file, they label the individual Chinese segments:

“source”: “T01n0001_001:0001a04_0”

The part to the right of the colon means page no. 1, column a, line 4. Taisho pages have three columns of text (a, b, c from top to bottom), and each is about 29 lines long (counted from right to left). They looked like this in the print edition.

Normally, we would write that as T1.1a4, or T1.1.1a4 (the volume number is optional since we hardly ever look at physical copies these days). SuttaCentral’s notation, which skips the text no., would be T i 1a4.

Just to help out with typical Chinese Agama parallels, the Dirgha Agama is Taisho No. 1 (volume 1), the Madhyama Agama is Taisho No. 26 (volume 1), the Samyukta Agama is Taisho No. 99 (volume 2), and the Ekottarika Agama is Taisho No. 125 (volume 2). That would cover most parallels listed in SuttaCentral.

I personally don’t think anyone who doesn’t read classical Chinese should take AI translations of the Taisho very seriously. None are these algorithms are competent yet. People who have a decent grasp of the original language can find it useful to scan texts that haven’t been translated yet. The quality of the translation is very uneven. Sometimes it’s fine, and sometimes it’s gobbledegook (as my mother would say). It’s just really difficult to automate translation of East Asian languages like Chinese and Japanese, much less ancient texts.

josephzizys · October 22, 2022, 8:00am

Your completly right @Snowbird , i apologise. I will revise my post.

josephzizys · October 22, 2022, 8:04am

Thank you so much @cdpatton ! And i agree with you completly, my main hope is to be able to find things by recourse to the machine translations, and then ask you to translate them

But seriously, its mostly to help locate parallels for me rather than to understand the meaning of the text, for that I will rely on the experts

Metta!

cdpatton · October 22, 2022, 6:14pm

It can be useful in certain ways. At first, I thought maybe it could be used as a first draft for future translations, but it’s so inconsistent from one sentence to the next, it would be too much work. I’d be rewriting every sentence, so why not just start from scratch?

I could see it as a learning aid or a way to quickly skim through untranslated Taisho texts for people who aren’t fluent in classical Chinese.