Machine translations of the CBETA corpus: Discussion on H-Buddhism

I guess I’d like to see examples of this thing side by side with human translations and see how it actually stacks up and more importantly, what are the big mistakes it makes and its limitations.

My first impression is that this won’t be able to make scholarly acceptable translations, but that it could help in that task. In that sense, it’s like how we make use of online dictionaries, but on a whole other level.

Also, I see @Vimala was part of the team who worked on this, it would be nice to have their take on this.


Thank you all for your feedback. It is important that such discussions are being held.

As the instigator of this project I just wish to share my reasons, but I also wish to say that I firmly believe that human translations can never be replaced by a machine. They can merely provide a tool, a supplemental technology, for the human translator.

Machines cannot do translators‘ jobs just as well. Of course it is impressive how much AI can already do and what it will be able to do soon. But, in particular when it comes to language, humans cannot simply be removed from the equation. A good translation is so much more than the pure transfer of semantics. The necessary fine-tuning is not something which can be done by a machine.

That is not to say that the translation industry should simply ignore the potential of neural networks for translation. Instead, machine translation should be seen as an auxiliary technology which can be implemented into modern workflows. After all, there is much more to it than simply “churning” a text through a machine and then using the initial result unchanged.

My reason for initiating this project was that I needed to have translations of Chinese texts that were not previously translated into English for my research into transgender ordination. Without the combination of BuddhaNexus and DeepL I would not have been able to identify and interpret the relevant passages of the corpus that were useful for my research. My interpretations were subsequently checked by an Associate Professor in Buddhist Studies, who made adjustments.

The current model we have for the CBETA texts is just a first step and needs a lot of improvement; much more training is needed. This is our next focus and we hope that in the future we will be able to provide a useful tool that can aid human translators in their work. We would welcome help of scholars in the field to improve the model with their expertise.

For some more background into what machine translations are and what they are not:

I hope this helps to clarify the idea behind the project.

As I am currently conducting a retreat and have further engagements I will not be online much for the next weeks.


Thanks Venerable, it seems that people’s concerns in this thread and in the Hnet thread have just as much to do with Marcus Bingenheimer’s cavalier statements (“We will be gleaners and cleaners following behind the translating machines.”) than the actual output of the AI translator.

But it seems other people in the project are not exactly of the opinion that this technology is going to replace human translators (or reduce their work to a mere secondary job of cleaning up after the machine translation). One would hope anyways.

And this is exactly the positive work that this tech makes possible.

Also FWIW I am wondering whether integrating this with Bilara would be a useful thing; make the ML translations available for Chinese as suggestions.

But back to my much-interrupted thread:

I want to discuss ethical consequences.

One that has been raised a few times is the issue with inaccurate machine translations being in the wild. Again, I think Marcus’ take on this is somewhat intemperate. Fair enough, he’s being provocative and throwing some ideas around. But I looked into machine translation when I was starting SC, fifteen years ago. It was bad then and is somewhat better now.

Then there were zero machine translations of Buddhist texts in the wild, and today there are still zero. I’m going to go out on a limb and say that in fifteen years there’ll still be zero. Why? Because it’s bloody hard to get anyone interested to read suttas even if they are well translated by someone who knows what they are doing. I know, I spend half my life doing it. Almost all of the actually interesting texts have been translated, and remain largely unread. In the Buddhist world at large, who really cares about all those obscure texts tucked away in the Taisho? If they cared, they would have translated them already.

Now, having said this, I’d like to make a proposal for the ethical management of this issue.

The criteria for making machine translations of Buddhist texts available to the general public should be no less stringent than the availability of self-driving cars.

With self-driving cars, there is a clear issue: people will die. There’s no avoiding that. As a general rule, it would be ethically problematic to make them available unless we can show that fewer people will die using self-driving cars than human-driven cars. So we trust regulators to ensure that self-driving cars are not made widely available until we the public can be assured of their safety.

For Buddhists, the sanctity and holiness of their sacred scriptures is unparalleled. We should treat the ethical issues no less seriously than we do self-driving cars, if not more seriously since the very purpose of the scriptures is to support an ethical life.

There is precedent in the world of AI to restrict access to content because of—IMHO—well-grounded concerns for its use. The OpenAI’s GPT-3 program is one of the better known examples. In a another recent case, an independent researcher trained an AI using content stripped from 4chan, the vilest pit of misogyny and racism on the web. The bot was predictably horrible, which was the point: to see what would happen. The model was not made publicly available, but nonetheless, many responded by criticizing the very act of creating such a thing. Even the existence of a digital machine for creating hate can be seen as an abomination, something that inherently should not exist.

We could see the machine translations of the Dhamma as the antithesis of this. What happens when we create a machine programmed to emit Dhamma? If the sheer existence of an evil AI is ethically abhorrent, does it not follow that the existence of a virtuous AI is inherently good? Is it a way of training AI to be better, morally?

I’ll return to these issues later on, but for now I just want to make the point that, while in general I am a firm advocate of making everything freely available, in this case I would recommend that the content not be publicly available. It should only be accessible to scholars and researchers on application. As a minimum necessary requirement, wider public accessibility should only be considered if and when there is a consensus of expert opinion that the translations are no less reliable than human translations. This is not an unattainable goal: there are a lot of bad human translations.

To be honest, I’m not personally concerned about the problem of inaccurate translations. People already believe all kinds of nonsense in Buddhism, it will hardly make things worse. But if these models are made available, it may create a reaction against AI within the Buddhist community. And potentially, against those who are also working in the sphere of digital texts and translations. It was only recently that there was a proposal in Sri Lanka to ban unauthorized translations. It would be easy to whip up public sentiment against digital colonizers who were appropriating scriptures and creating new texts by AI. Of course that’s not what you are doing. But that is irrelevant. What matters is how people can spin what you are doing.

This could discredit the very foundations of the field, and be used to justify draconian legislation giving control of scriptures to authoritarian governments. This may sound alarmist, but again, look at how governments in Buddhist nations work. Control over the Tipitaka is a core principle of political authority. I’ve been subject to this sort of pressure by people trying to stop a project even though we had the full support of the Sri Lankan President, the Minister for Culture, and the monastic head of Pali studies at a major university.

So I would urge caution, and move slowly in making things publicly available. It’s not just data to throw in a model. It’s sacred scripture.

You think this is the end? I’m just getting started.


A bit off topic here but related to this statement.

As a professional translator, I can tell for a fact that a translator’s professional level in a corporate ladder is much lower than an event organiser though all the latter does is making phone calls to different contractors to ‘do’ or ‘make’ things that are part of an event. :smiley:

To gain a higher status, one has to become a translation teacher. :grin:


I don’t think that’s true. There’s this Nidessa translation mentioned back in March which was made by machine translation. In the thread, you were even open to it being added to SC…


Well no, this is a human-created translation, which used a machine as a rough first draft.

Isn’t that precisely the situation glossed earlier as “gleaners and cleaners following behind the machines”?

One more in my series of notes on AI. I’d like to raise another ethical question, namely energy. Here’s some background.

The ethical problem is simple and unanswerable: how much energy are we justified in using to create such models?

The energy use of AI modelling is driven by the fact that they have to create extremely large datasets through highly intensive number crunching. This is ultimately moving electrons around, which requires energy and generates heat; the same heat that’s coming out of your computer or phone now, just a lot more of it.

When making more advanced systems, the basic method is to use bigger datasets—which in this case is limited, since it requires more human translation as input—or more intensive data-crunching. To enable, for example, the kind of “context-aware” translation spoken of above, which expands the data vectors beyond a single segment, is orders of magnitude more complex. This is a basic physical constraint. Of course you can do things more efficiently, but the problem is that efficiency gains tend to be linear, while complexity increases exponentially. This is why the field as a whole keeps demanding more and more energy.

Now we can address this with practical solutions:

  • more efficient chips
  • use renewable energy
  • make more efficient systems

Which, fun fact, are the very arguments used by the crypto bros to justify burning the planet so they can play with their toy money. Crypto and AI are similar in that they both aim to replace or enhance conventional technologies with highly energy-intensive computational processes. Of course, crypto uses this to run scams and fleece fools of their money, while AI can, in principle, be used for good things. But still, it should prompt us to consider these ethical questions seriously, and not just fob them off.

There should be a story front and center about these ethical considerations. Is there an ethical justification for applying AI in this context? What is it? How much energy use is too much? What are the criteria that AI projects should consider when evaluating their energy usage?


Thank you all and especially Bhante Sujato for the elaborate messages. I do not have the time to answer them all in detail, but I enjoy to read along the lines here and see what direction the discussion is heading to.
Just to give a few more observations from my side:
Regarding the environmental impact: To give a dimension on how much energy the training of the linguae dharmae model took: 1x 3090 with a TDP of 350 watts running for 12 hours. An electric car burns .24kw/mile so training this model is equal to driving 15 miles in an electric car. This is not nothing but sometimes it is also good to put things into perspective. The pretraining of larger models of the GPT-class requires more energy, certainly. For example the training of a Latin BERT model on a google cloud TPU with 300w tdp took 4 full days. Once such a model for our data is trained, it will remain in use for a few years to come . Its not like we need to retrain these large models every couple of months as it is the case for other languages – the source material isn’t changing, and the availability of more translations only affects the fine-tuning process which only requires a few hours every now and then.
Regarding ethical considerations of the development and availability of machine translations of Buddhist source material both from the research and the religious perspective, I would like to mention another aspect:
I think it’s vital for us as independent researchers and translators to have our own models that are not tied to any agenda (apart from that of gaining knowledge). If we don’t engage in this research now, we might see a political or religious party (with more cash and manpower at hand than we do) publish their machine translations in a few months or years. Now that will happen anyway, in my eyes its just a question when. I think the best we can do as a community is to develop our own models transparently and make the data available to the public. This might be the best way to take the wind out of the sails of big companies and religious/political institutions who might develop such models with certain agendas on their mind, similar to how it worked out with GNU/Linux and the big software companies of the 90s.
I think the interest in machine translation of Buddhist material in East Asia is huge. And I can already see people publishing translations that might be just marginally better than our models at the moment in printed books etc. claiming them to be genuine works. I don’t think it is a process that can be stopped, it is just something our field has to go through. I don’t think there is a big danger to “Buddhism as such”, but it will certainly become more important to educate the community about the possible dangers and pitfalls of just trusting ‘translation x’ that has been found on the internet or in a bookstore.
I think we will see a lot of improvement of the quality of the models in the coming years. Neural machine translation has only really kicked of in the mid 2010s and has since then overtaken other more limited machine translation approaches. The transformer-networks introduced in 2017 saw another big increase in quality, just as the pretraining methods applied since 2018. Since 2020 we see how pretraining methods can be leveraged to improve the translation quality of low-resource languages (for which only very limited data is available). Linguae Dharmae is one outcome of these developments, but it is still in an extremely rudimentary state. There is so much which can be improved easily.
Currently we only used about 10%, maybe less than that, of the available translated data to train the model. We didn’t use any domain-specific pretraining, we didn’t address the problem of punctuation/text segmentation, we didn’t deal with document-level context. And then, of course, we are hoping that the availability of machine translations will motivate some people to do a bit of post-correction here and there that can then flow back into retraining and improving the models. We already get feedback from that direction and I also engage in that task in my own free time every now and then since, to me, it feels like a meaningful exercise. In my eyes it is already more time-efficient to post-correct the output of our model than to translate from scratch, at least for the material that I am working on.
I think Marcus was intentionally a bit provocative in the announcement but at the same time I do believe that this managed to raise the important questions for our field and caused a lot of people discuss this. And, of course, nobody needs to rely on machine translations for their work. So if somebody feels uncomfortable with the idea of just becoming a ‘gleaner and cleaner’ of machine translations, one doesn’t have to do that of course. Similarly to bicycles still being in use despite the availability of trains, cars and airplanes.
So I am mildly optimistic about our future here and see it as a good sign that we have lively debates about what is happening here!


Thanks sebastian, i’m meaning to follow up on what you said, as well as writing a what I promise will be a positive contribution to the debate!

In the meantime, though, i raised this question today with the rector of the Nagananda Buddhist University, where I’m currently holding a course, Ven Bodagama Chandima. I asked whether he thought a machine translation of suttas from Chinese would be a good idea. He replied with an enthusiastic “yes”, saying that based on his many years of teaching In Taiwan, that we needed more knowledge of Chinese-language texts and sutras. So you have a new fan!


This statement really hits the point. Thailand has three widely known translation versions. All by major monk institutes, so naturally the general public who are interested to read the Tipitika ‘trust’ their translations. Those who can’t make heads or tails of the stylish language used just don’t read the Tipitika.

As a professional translator, I don’t mind machine translation and find it help save precious time (I have fewer than 24 hours a day :smiley: ) to translate/type. Of course, when I use a machine (Microsoft Office Word), I treat it as a kitchen hand who chops and places all ingredients that their little brain can figure out for me. It’s up to me, the chef, to use what and how much to prepare a good dish.

If and when the machine translation for Tipitika is available in Thai, the reader has to be made aware that mistakes should be expected.


Dear all,

I apologize I have been very absent from the discussion and forgive me for not having the time to read everything. I just want to share here that we made some decisions that might be of interest with regards to the machine translations, also those of the Pali.

First of all we will not host the machine translations on BuddhaNexus and also remove the Pali translations there. We will make a new website that hosts these machine translations with the specific purpose to allow scholars to make corrections to the machine’s output translations. This data can then serve as training data for the model so as to improve future versions. This website will also make it very clear what these “translations” are.

We had discussed this earlier for the Pali also and I thing that would be absolutely great. I would suggest however that we make some further headway first before going this route. For the Pali I have some more segmented translations of commentarial texts in from our friends from the Theravada Group Moscow so I would like to incorporate those in the next run of the Pali texts.


SebastianN wrote

One can of course disagree with this approach and I am very much happy to hear opinions that diverge from what we are doing.

I appreciate your approach. Machines have been crucial to me for a sound understanding of Samyukta Agama. Sometimes Google translator becomes the saviour when DeepL flounders. I use Yandex too.

You wrote

One thing to consider is that what we are doing here is to develop a linguistically driven, alternative approach to the question of translation of Buddhist texts, its not about cloning human beings or building nuclear weapons, and certainly not about putting well-crafted human made translations of Buddhist texts into question. In that sense, I hope we as the technicians can keep a positive relationship with the translators here, who have created and still are, without any doubt, creating constant merit for the Buddhist community that a machine will never be able to achieve.

On occasion a machine can correct a human error, SN 12.63 comes to mind. My main focus is on Samyuktagama, which VBB has said is the closest to the Buddha, in his introduction to Samyutta Nikaya.

It is puzzling why the Pali translators omitted some seminal suttas. An omission of a critical sutta at the beginning of a Samyutta, and subtle modifications of the ensuing text can lead to misleading interpretations. Anapana Samyutta comes to mind.

It begins with SA 803/SN 54.1, but omits SA 801. SA 801(Easing into breath), as the introductory sutta would have made a significant difference.

If not for machines I would not have detected such.

Why did the Pali tradition exclude the “Seal of Dhamma” SA 80?, sandwiched between SA 79 (SN 22.9:impermanence) and SA 81 (SN 22.60;Mahali).

SA 80 stands out in the subtle forcefulness and power of its content. It is a synopsis of the liberating process.
Without the machines, curious people could not investigate which suttas were tampered with over time. A comparison of AN 9.37 and SA 557 presents a case of sutta tampering. SA 556 and SA 558 support SA 557, but Pali compilers left these untranslated.

I was able to read the untranslated Chinese suttas, thanks to the work of dedicated technicians like you who enable machines to echo the voice/spirit of the Buddha. Thanks to cdpatton’s contributions. Yinshun's Reconstruction of the Chinese Saṃyukta Āgama (Taisho 99)

Thanks to Sutta Central for the Chinese versions of the suttas of Samyukta Agama, which were left untranslated, and translated.


I personally don’t mind. I’m still confident that human translators today are still better than machines, but who knows one day we humans may be able to create a Universal Translator as used in the Star Trek universe. I wouldn’t object to it at all. :slight_smile:

1 Like

UPDATE: The repository has been moved to:

(but the old link will redirect you too)


Dear suttacentral community,

Thank you very much for all the discussions and comments on this topic. I’m Till, one of the people who has been working on this topic in deepl and has been one of the main drivers for it out of personal interest.

I’d like to clarify a few points about this project and our intentions, but first I’d like to refer you to the disclaimer we published the translations with:

About these translations

This text was translated by DeepL using an experimental model that was trained as an internal, non-commercial research project. Given the sparsity of the training material and the diverse nature of the Taisho and Shinsan corpus, the model might not always meet DeepL’s usual quality standards. Please take extreme caution when interpreting the results. We did not translate sentences with more than 400 characters as the resulting translations wouldn’t be reliable at all.

Personally, I think that machine translation can be very useful if used correctly. It could be a good start to a higher quality translation, or just give people an idea of what these texts are about. We are trying to see if we can tackle this problem from deepl’s side and use our knowledge and infrastructure to do something good as a company. This is not and will never be a commercial project.

We decided to put our translations on github so that there would be a starting point. That starting point is not great from a quality perspective. This is why we put the disclaimer above. It seems to me that this disclaimer didn’t get as much attention as the announcement on h-net.

You might ask why we put this translation on github in the first place: Vimala and I discussed publishing it, collecting corrected translations, and re-training to improve the quality. This might lead us to a more reliable model.

Putting the translation out there did indeed stir things up: More training material has emerged and we are working on improving the model. Others have joined in and hopefully this will improve things even more. So from our point of view it was actually a success. Not in the sense that we have perfect translations, but that we have made progress towards better translations.

The focus of this project is to provide the community with helpful data with our translation models to better decipher these texts. We are already working on collecting multiple translations per source sentence and alignments between the Chinese characters and the English words to gain additional insight into the original texts.

So I can only invite everyone here to provide feedback in the form of corrected translations so that we can improve the translations. As the corpus is not homogeneous, this will be a very interesting challenge - but not impossible.

I like to state that we as deepl employees didn’t write the announcement on H-Buddhism. Personally, I don’t know what will happen in the future and how the relationship between machine translation and human translators will look like. The project might be a dead-end or lead to great insight. Only time will tell.



thanks Till! This is a fascinating project!!

Hi @twestermann,

Welcome to the D&D forum!

Enjoy the multiple resources here available: may these be of assistance along the path.

Should you have any questions related to the forum, feel free to contact the @moderators.

With Metta,
On behalf of the moderators

@twestermann or @Vimala I am looking to check out the parallels on suttacentral in the machine translations, for example I am looking at MN35 which says there are 2 parallels, given as Tii715a28 and Tii035a17 however the machine files are listed in the format T01n0001_010 for example, how do the 2 formats relate? Thanks for any help you can give!!

addendum: Analayo gives SĀ 110 at T 99, 35a-37b. EĀ 37.10 at T 125, 715a-717b which is different again?