Machine translations of the CBETA corpus: Discussion on H-Buddhism

Isn’t that precisely the situation glossed earlier as “gleaners and cleaners following behind the machines”?

One more in my series of notes on AI. I’d like to raise another ethical question, namely energy. Here’s some background.

The ethical problem is simple and unanswerable: how much energy are we justified in using to create such models?

The energy use of AI modelling is driven by the fact that they have to create extremely large datasets through highly intensive number crunching. This is ultimately moving electrons around, which requires energy and generates heat; the same heat that’s coming out of your computer or phone now, just a lot more of it.

When making more advanced systems, the basic method is to use bigger datasets—which in this case is limited, since it requires more human translation as input—or more intensive data-crunching. To enable, for example, the kind of “context-aware” translation spoken of above, which expands the data vectors beyond a single segment, is orders of magnitude more complex. This is a basic physical constraint. Of course you can do things more efficiently, but the problem is that efficiency gains tend to be linear, while complexity increases exponentially. This is why the field as a whole keeps demanding more and more energy.

Now we can address this with practical solutions:

  • more efficient chips
  • use renewable energy
  • make more efficient systems

Which, fun fact, are the very arguments used by the crypto bros to justify burning the planet so they can play with their toy money. Crypto and AI are similar in that they both aim to replace or enhance conventional technologies with highly energy-intensive computational processes. Of course, crypto uses this to run scams and fleece fools of their money, while AI can, in principle, be used for good things. But still, it should prompt us to consider these ethical questions seriously, and not just fob them off.

There should be a story front and center about these ethical considerations. Is there an ethical justification for applying AI in this context? What is it? How much energy use is too much? What are the criteria that AI projects should consider when evaluating their energy usage?


Thank you all and especially Bhante Sujato for the elaborate messages. I do not have the time to answer them all in detail, but I enjoy to read along the lines here and see what direction the discussion is heading to.
Just to give a few more observations from my side:
Regarding the environmental impact: To give a dimension on how much energy the training of the linguae dharmae model took: 1x 3090 with a TDP of 350 watts running for 12 hours. An electric car burns .24kw/mile so training this model is equal to driving 15 miles in an electric car. This is not nothing but sometimes it is also good to put things into perspective. The pretraining of larger models of the GPT-class requires more energy, certainly. For example the training of a Latin BERT model on a google cloud TPU with 300w tdp took 4 full days. Once such a model for our data is trained, it will remain in use for a few years to come . Its not like we need to retrain these large models every couple of months as it is the case for other languages – the source material isn’t changing, and the availability of more translations only affects the fine-tuning process which only requires a few hours every now and then.
Regarding ethical considerations of the development and availability of machine translations of Buddhist source material both from the research and the religious perspective, I would like to mention another aspect:
I think it’s vital for us as independent researchers and translators to have our own models that are not tied to any agenda (apart from that of gaining knowledge). If we don’t engage in this research now, we might see a political or religious party (with more cash and manpower at hand than we do) publish their machine translations in a few months or years. Now that will happen anyway, in my eyes its just a question when. I think the best we can do as a community is to develop our own models transparently and make the data available to the public. This might be the best way to take the wind out of the sails of big companies and religious/political institutions who might develop such models with certain agendas on their mind, similar to how it worked out with GNU/Linux and the big software companies of the 90s.
I think the interest in machine translation of Buddhist material in East Asia is huge. And I can already see people publishing translations that might be just marginally better than our models at the moment in printed books etc. claiming them to be genuine works. I don’t think it is a process that can be stopped, it is just something our field has to go through. I don’t think there is a big danger to “Buddhism as such”, but it will certainly become more important to educate the community about the possible dangers and pitfalls of just trusting ‘translation x’ that has been found on the internet or in a bookstore.
I think we will see a lot of improvement of the quality of the models in the coming years. Neural machine translation has only really kicked of in the mid 2010s and has since then overtaken other more limited machine translation approaches. The transformer-networks introduced in 2017 saw another big increase in quality, just as the pretraining methods applied since 2018. Since 2020 we see how pretraining methods can be leveraged to improve the translation quality of low-resource languages (for which only very limited data is available). Linguae Dharmae is one outcome of these developments, but it is still in an extremely rudimentary state. There is so much which can be improved easily.
Currently we only used about 10%, maybe less than that, of the available translated data to train the model. We didn’t use any domain-specific pretraining, we didn’t address the problem of punctuation/text segmentation, we didn’t deal with document-level context. And then, of course, we are hoping that the availability of machine translations will motivate some people to do a bit of post-correction here and there that can then flow back into retraining and improving the models. We already get feedback from that direction and I also engage in that task in my own free time every now and then since, to me, it feels like a meaningful exercise. In my eyes it is already more time-efficient to post-correct the output of our model than to translate from scratch, at least for the material that I am working on.
I think Marcus was intentionally a bit provocative in the announcement but at the same time I do believe that this managed to raise the important questions for our field and caused a lot of people discuss this. And, of course, nobody needs to rely on machine translations for their work. So if somebody feels uncomfortable with the idea of just becoming a ‘gleaner and cleaner’ of machine translations, one doesn’t have to do that of course. Similarly to bicycles still being in use despite the availability of trains, cars and airplanes.
So I am mildly optimistic about our future here and see it as a good sign that we have lively debates about what is happening here!


Thanks sebastian, i’m meaning to follow up on what you said, as well as writing a what I promise will be a positive contribution to the debate!

In the meantime, though, i raised this question today with the rector of the Nagananda Buddhist University, where I’m currently holding a course, Ven Bodagama Chandima. I asked whether he thought a machine translation of suttas from Chinese would be a good idea. He replied with an enthusiastic “yes”, saying that based on his many years of teaching In Taiwan, that we needed more knowledge of Chinese-language texts and sutras. So you have a new fan!


This statement really hits the point. Thailand has three widely known translation versions. All by major monk institutes, so naturally the general public who are interested to read the Tipitika ‘trust’ their translations. Those who can’t make heads or tails of the stylish language used just don’t read the Tipitika.

As a professional translator, I don’t mind machine translation and find it help save precious time (I have fewer than 24 hours a day :smiley: ) to translate/type. Of course, when I use a machine (Microsoft Office Word), I treat it as a kitchen hand who chops and places all ingredients that their little brain can figure out for me. It’s up to me, the chef, to use what and how much to prepare a good dish.

If and when the machine translation for Tipitika is available in Thai, the reader has to be made aware that mistakes should be expected.


Dear all,

I apologize I have been very absent from the discussion and forgive me for not having the time to read everything. I just want to share here that we made some decisions that might be of interest with regards to the machine translations, also those of the Pali.

First of all we will not host the machine translations on BuddhaNexus and also remove the Pali translations there. We will make a new website that hosts these machine translations with the specific purpose to allow scholars to make corrections to the machine’s output translations. This data can then serve as training data for the model so as to improve future versions. This website will also make it very clear what these “translations” are.

We had discussed this earlier for the Pali also and I thing that would be absolutely great. I would suggest however that we make some further headway first before going this route. For the Pali I have some more segmented translations of commentarial texts in from our friends from the Theravada Group Moscow so I would like to incorporate those in the next run of the Pali texts.


SebastianN wrote

One can of course disagree with this approach and I am very much happy to hear opinions that diverge from what we are doing.

I appreciate your approach. Machines have been crucial to me for a sound understanding of Samyukta Agama. Sometimes Google translator becomes the saviour when DeepL flounders. I use Yandex too.

You wrote

One thing to consider is that what we are doing here is to develop a linguistically driven, alternative approach to the question of translation of Buddhist texts, its not about cloning human beings or building nuclear weapons, and certainly not about putting well-crafted human made translations of Buddhist texts into question. In that sense, I hope we as the technicians can keep a positive relationship with the translators here, who have created and still are, without any doubt, creating constant merit for the Buddhist community that a machine will never be able to achieve.

On occasion a machine can correct a human error, SN 12.63 comes to mind. My main focus is on Samyuktagama, which VBB has said is the closest to the Buddha, in his introduction to Samyutta Nikaya.

It is puzzling why the Pali translators omitted some seminal suttas. An omission of a critical sutta at the beginning of a Samyutta, and subtle modifications of the ensuing text can lead to misleading interpretations. Anapana Samyutta comes to mind.

It begins with SA 803/SN 54.1, but omits SA 801. SA 801(Easing into breath), as the introductory sutta would have made a significant difference.

If not for machines I would not have detected such.

Why did the Pali tradition exclude the “Seal of Dhamma” SA 80?, sandwiched between SA 79 (SN 22.9:impermanence) and SA 81 (SN 22.60;Mahali).

SA 80 stands out in the subtle forcefulness and power of its content. It is a synopsis of the liberating process.
Without the machines, curious people could not investigate which suttas were tampered with over time. A comparison of AN 9.37 and SA 557 presents a case of sutta tampering. SA 556 and SA 558 support SA 557, but Pali compilers left these untranslated.

I was able to read the untranslated Chinese suttas, thanks to the work of dedicated technicians like you who enable machines to echo the voice/spirit of the Buddha. Thanks to cdpatton’s contributions. Yinshun's Reconstruction of the Chinese Saṃyukta Āgama (Taisho 99)

Thanks to Sutta Central for the Chinese versions of the suttas of Samyukta Agama, which were left untranslated, and translated.

1 Like

I personally don’t mind. I’m still confident that human translators today are still better than machines, but who knows one day we humans may be able to create a Universal Translator as used in the Star Trek universe. I wouldn’t object to it at all. :slight_smile: