Machine translations of the CBETA corpus: Discussion on H-Buddhism

Hey, its Sebastian writing here who was/is to some extent involved in the development of these models.
First of all, regarding “context awareness”, I think we need to make clear what we are talking about. Just to make it a bit more clear, in the announcement, we are referring to models that are able to access previous/following sentences/paragraphs to infer knowledge for the translation of the current sentence. This is something that is being successfully addressed in machine translation research under the heading ‘document level translation’. In the last five years we have seen a lot of activity when it comes to context-aware language models (BERT and the like) and these have been a true revolution in NLP research. Context aware models and large scale pretraining on text corpora have boosted the accuracy of language models for different tasks significantly.
Currently, our model is not able to use context beyond the current sentence but we are working on that. It has been done for other languages so lets see what we can achieve here. it might be that Bhante Sujato is referring to something else here, so I am happy to learn more.
Regarding the general scepticism and critique raised by Charles Patton and I can understand most of that. It needs to be emphasized that this is an early model; its not even close to reaching any level of quality that it would be ethically correct to offer it to people who do not have any direct access to the source languages themselves. If there is any misunderstanding about that, my apologies for us not making it clear enough.
I think one should not think about machine translation systems as means to ‘replace’ human beings but more about being tools to make the life of human translators easier. Similar to as the appearance of computers and pocket calculator didn’t replace the discipline of math as such.
Our main goal with the publication of this material is to draw people that are working on various ends of the Chinese corpus into what could be a potential collaboration to jointly improve the data and to further develop the models. I do believe that an open collaborative attitude could be very beneficial for the creation of what might develop into an ML-assisted ‘community translation’ of the Buddhist canon, much like Wikipedia managed to develop into a quite usable resource. Of course, such an approach will never be able to meet academic standards or even the standards of certain individual translators but that might, after all, not be a problem if one from the outset is not aiming for that.
One can of course disagree with this approach and I am very much happy to hear opinions that diverge from what we are doing. One thing to consider is that what we are doing here is to develop a linguistically driven, alternative approach to the question of translation of Buddhist texts, its not about cloning human beings or building nuclear weapons, and certainly not about putting well-crafted human made translations of Buddhist texts into question. In that sense, I hope we as the technicians can keep a positive relationship with the translators here, who have created and still are, without any doubt, creating constant merit for the Buddhist community that a machine will never be able to achieve.


Hey Sebastian, thanks for the comments. Perhaps the greatest value of your work will, in fact, be the interest and discussion it generates. I have some more things to say, but am very short of time, so I’ll comment when I get the chance.

Also, just so as to ramp up the tension, I actually have a positive thesis to make, and a suggestion for your team. But I’m going to take my time to get there!


I was thinking more in line with what Charles was talking about, the extended contexts of time and place and culture and humanity, in addition to the purely linguistic features of the text. Doubtless it will be possible to expand the context of the text to some degree, but I’m skeptical as to how far. Still, that is an empirical question, so please prove me wrong!

A related point is the role of the translator, which both Marcus and Charles hinted at. Certainly being a proofreader for a machine doesn’t sound that exciting.

It’s probably not something that would have affected my work too much, as my translations have mostly been for texts that are already well-translated, so I can refer to the excellent work of prior scholars where needed. Charles is obviously in a different situation. I have only a small idea of the complexities involved; i’ve just been reading Nyanatusita’s translation of the Vimuttimagga, and he spells out a lot of the difficulties, which go beyond merely linguistic to things like understanding of the history of Abhidhamma, or comparing with the passages in Tibetan, and so on.

Now, I don’t want to repeat what others have said here, but I would like to simply draw attention to the question of, well, attention. What are we paying attention to? As Charles pointed out, the corpus translated is so vast that there is not reasonable possibility that humans could review it all. And more will be made. So whose job is this? How much time should we devote to checking the output of neural nets? Is this what we choose to do with our time?

There’s a similar issue in programming, as we can now have programs that output programs. These have all the bugs that you’d expect from such a process, but they can be quite astonishing. At the end of the day, though, you can’t just roll this stuff into production: a programmer needs to review it.

I for one would love to see Marcus making some more translations of his own. When i spoke with him years ago on the topic, he explained how translation is not valued in academia, so to do translation work is akin to professional suicide. I wonder to what extent this is implicitly guiding the choices we’re making? Are we internalizing the devaluation of translation? Is this something where we will simply have to accept that it is the way it is?

Sorry, have to go now!


I suspect for precisely the same reason that translation and commentary are valued by the religious traditions: translation is ideally an expression of humility


I guess I’d like to see examples of this thing side by side with human translations and see how it actually stacks up and more importantly, what are the big mistakes it makes and its limitations.

My first impression is that this won’t be able to make scholarly acceptable translations, but that it could help in that task. In that sense, it’s like how we make use of online dictionaries, but on a whole other level.

Also, I see @Vimala was part of the team who worked on this, it would be nice to have their take on this.


Thank you all for your feedback. It is important that such discussions are being held.

As the instigator of this project I just wish to share my reasons, but I also wish to say that I firmly believe that human translations can never be replaced by a machine. They can merely provide a tool, a supplemental technology, for the human translator.

Machines cannot do translators‘ jobs just as well. Of course it is impressive how much AI can already do and what it will be able to do soon. But, in particular when it comes to language, humans cannot simply be removed from the equation. A good translation is so much more than the pure transfer of semantics. The necessary fine-tuning is not something which can be done by a machine.

That is not to say that the translation industry should simply ignore the potential of neural networks for translation. Instead, machine translation should be seen as an auxiliary technology which can be implemented into modern workflows. After all, there is much more to it than simply “churning” a text through a machine and then using the initial result unchanged.

My reason for initiating this project was that I needed to have translations of Chinese texts that were not previously translated into English for my research into transgender ordination. Without the combination of BuddhaNexus and DeepL I would not have been able to identify and interpret the relevant passages of the corpus that were useful for my research. My interpretations were subsequently checked by an Associate Professor in Buddhist Studies, who made adjustments.

The current model we have for the CBETA texts is just a first step and needs a lot of improvement; much more training is needed. This is our next focus and we hope that in the future we will be able to provide a useful tool that can aid human translators in their work. We would welcome help of scholars in the field to improve the model with their expertise.

For some more background into what machine translations are and what they are not:

I hope this helps to clarify the idea behind the project.

As I am currently conducting a retreat and have further engagements I will not be online much for the next weeks.


Thanks Venerable, it seems that people’s concerns in this thread and in the Hnet thread have just as much to do with Marcus Bingenheimer’s cavalier statements (“We will be gleaners and cleaners following behind the translating machines.”) than the actual output of the AI translator.

But it seems other people in the project are not exactly of the opinion that this technology is going to replace human translators (or reduce their work to a mere secondary job of cleaning up after the machine translation). One would hope anyways.

And this is exactly the positive work that this tech makes possible.

Also FWIW I am wondering whether integrating this with Bilara would be a useful thing; make the ML translations available for Chinese as suggestions.

But back to my much-interrupted thread:

I want to discuss ethical consequences.

One that has been raised a few times is the issue with inaccurate machine translations being in the wild. Again, I think Marcus’ take on this is somewhat intemperate. Fair enough, he’s being provocative and throwing some ideas around. But I looked into machine translation when I was starting SC, fifteen years ago. It was bad then and is somewhat better now.

Then there were zero machine translations of Buddhist texts in the wild, and today there are still zero. I’m going to go out on a limb and say that in fifteen years there’ll still be zero. Why? Because it’s bloody hard to get anyone interested to read suttas even if they are well translated by someone who knows what they are doing. I know, I spend half my life doing it. Almost all of the actually interesting texts have been translated, and remain largely unread. In the Buddhist world at large, who really cares about all those obscure texts tucked away in the Taisho? If they cared, they would have translated them already.

Now, having said this, I’d like to make a proposal for the ethical management of this issue.

The criteria for making machine translations of Buddhist texts available to the general public should be no less stringent than the availability of self-driving cars.

With self-driving cars, there is a clear issue: people will die. There’s no avoiding that. As a general rule, it would be ethically problematic to make them available unless we can show that fewer people will die using self-driving cars than human-driven cars. So we trust regulators to ensure that self-driving cars are not made widely available until we the public can be assured of their safety.

For Buddhists, the sanctity and holiness of their sacred scriptures is unparalleled. We should treat the ethical issues no less seriously than we do self-driving cars, if not more seriously since the very purpose of the scriptures is to support an ethical life.

There is precedent in the world of AI to restrict access to content because of—IMHO—well-grounded concerns for its use. The OpenAI’s GPT-3 program is one of the better known examples. In a another recent case, an independent researcher trained an AI using content stripped from 4chan, the vilest pit of misogyny and racism on the web. The bot was predictably horrible, which was the point: to see what would happen. The model was not made publicly available, but nonetheless, many responded by criticizing the very act of creating such a thing. Even the existence of a digital machine for creating hate can be seen as an abomination, something that inherently should not exist.

We could see the machine translations of the Dhamma as the antithesis of this. What happens when we create a machine programmed to emit Dhamma? If the sheer existence of an evil AI is ethically abhorrent, does it not follow that the existence of a virtuous AI is inherently good? Is it a way of training AI to be better, morally?

I’ll return to these issues later on, but for now I just want to make the point that, while in general I am a firm advocate of making everything freely available, in this case I would recommend that the content not be publicly available. It should only be accessible to scholars and researchers on application. As a minimum necessary requirement, wider public accessibility should only be considered if and when there is a consensus of expert opinion that the translations are no less reliable than human translations. This is not an unattainable goal: there are a lot of bad human translations.

To be honest, I’m not personally concerned about the problem of inaccurate translations. People already believe all kinds of nonsense in Buddhism, it will hardly make things worse. But if these models are made available, it may create a reaction against AI within the Buddhist community. And potentially, against those who are also working in the sphere of digital texts and translations. It was only recently that there was a proposal in Sri Lanka to ban unauthorized translations. It would be easy to whip up public sentiment against digital colonizers who were appropriating scriptures and creating new texts by AI. Of course that’s not what you are doing. But that is irrelevant. What matters is how people can spin what you are doing.

This could discredit the very foundations of the field, and be used to justify draconian legislation giving control of scriptures to authoritarian governments. This may sound alarmist, but again, look at how governments in Buddhist nations work. Control over the Tipitaka is a core principle of political authority. I’ve been subject to this sort of pressure by people trying to stop a project even though we had the full support of the Sri Lankan President, the Minister for Culture, and the monastic head of Pali studies at a major university.

So I would urge caution, and move slowly in making things publicly available. It’s not just data to throw in a model. It’s sacred scripture.

You think this is the end? I’m just getting started.


A bit off topic here but related to this statement.

As a professional translator, I can tell for a fact that a translator’s professional level in a corporate ladder is much lower than an event organiser though all the latter does is making phone calls to different contractors to ‘do’ or ‘make’ things that are part of an event. :smiley:

To gain a higher status, one has to become a translation teacher. :grin:


I don’t think that’s true. There’s this Nidessa translation mentioned back in March which was made by machine translation. In the thread, you were even open to it being added to SC…


Well no, this is a human-created translation, which used a machine as a rough first draft.

Isn’t that precisely the situation glossed earlier as “gleaners and cleaners following behind the machines”?

One more in my series of notes on AI. I’d like to raise another ethical question, namely energy. Here’s some background.

The ethical problem is simple and unanswerable: how much energy are we justified in using to create such models?

The energy use of AI modelling is driven by the fact that they have to create extremely large datasets through highly intensive number crunching. This is ultimately moving electrons around, which requires energy and generates heat; the same heat that’s coming out of your computer or phone now, just a lot more of it.

When making more advanced systems, the basic method is to use bigger datasets—which in this case is limited, since it requires more human translation as input—or more intensive data-crunching. To enable, for example, the kind of “context-aware” translation spoken of above, which expands the data vectors beyond a single segment, is orders of magnitude more complex. This is a basic physical constraint. Of course you can do things more efficiently, but the problem is that efficiency gains tend to be linear, while complexity increases exponentially. This is why the field as a whole keeps demanding more and more energy.

Now we can address this with practical solutions:

  • more efficient chips
  • use renewable energy
  • make more efficient systems

Which, fun fact, are the very arguments used by the crypto bros to justify burning the planet so they can play with their toy money. Crypto and AI are similar in that they both aim to replace or enhance conventional technologies with highly energy-intensive computational processes. Of course, crypto uses this to run scams and fleece fools of their money, while AI can, in principle, be used for good things. But still, it should prompt us to consider these ethical questions seriously, and not just fob them off.

There should be a story front and center about these ethical considerations. Is there an ethical justification for applying AI in this context? What is it? How much energy use is too much? What are the criteria that AI projects should consider when evaluating their energy usage?


Thank you all and especially Bhante Sujato for the elaborate messages. I do not have the time to answer them all in detail, but I enjoy to read along the lines here and see what direction the discussion is heading to.
Just to give a few more observations from my side:
Regarding the environmental impact: To give a dimension on how much energy the training of the linguae dharmae model took: 1x 3090 with a TDP of 350 watts running for 12 hours. An electric car burns .24kw/mile so training this model is equal to driving 15 miles in an electric car. This is not nothing but sometimes it is also good to put things into perspective. The pretraining of larger models of the GPT-class requires more energy, certainly. For example the training of a Latin BERT model on a google cloud TPU with 300w tdp took 4 full days. Once such a model for our data is trained, it will remain in use for a few years to come . Its not like we need to retrain these large models every couple of months as it is the case for other languages – the source material isn’t changing, and the availability of more translations only affects the fine-tuning process which only requires a few hours every now and then.
Regarding ethical considerations of the development and availability of machine translations of Buddhist source material both from the research and the religious perspective, I would like to mention another aspect:
I think it’s vital for us as independent researchers and translators to have our own models that are not tied to any agenda (apart from that of gaining knowledge). If we don’t engage in this research now, we might see a political or religious party (with more cash and manpower at hand than we do) publish their machine translations in a few months or years. Now that will happen anyway, in my eyes its just a question when. I think the best we can do as a community is to develop our own models transparently and make the data available to the public. This might be the best way to take the wind out of the sails of big companies and religious/political institutions who might develop such models with certain agendas on their mind, similar to how it worked out with GNU/Linux and the big software companies of the 90s.
I think the interest in machine translation of Buddhist material in East Asia is huge. And I can already see people publishing translations that might be just marginally better than our models at the moment in printed books etc. claiming them to be genuine works. I don’t think it is a process that can be stopped, it is just something our field has to go through. I don’t think there is a big danger to “Buddhism as such”, but it will certainly become more important to educate the community about the possible dangers and pitfalls of just trusting ‘translation x’ that has been found on the internet or in a bookstore.
I think we will see a lot of improvement of the quality of the models in the coming years. Neural machine translation has only really kicked of in the mid 2010s and has since then overtaken other more limited machine translation approaches. The transformer-networks introduced in 2017 saw another big increase in quality, just as the pretraining methods applied since 2018. Since 2020 we see how pretraining methods can be leveraged to improve the translation quality of low-resource languages (for which only very limited data is available). Linguae Dharmae is one outcome of these developments, but it is still in an extremely rudimentary state. There is so much which can be improved easily.
Currently we only used about 10%, maybe less than that, of the available translated data to train the model. We didn’t use any domain-specific pretraining, we didn’t address the problem of punctuation/text segmentation, we didn’t deal with document-level context. And then, of course, we are hoping that the availability of machine translations will motivate some people to do a bit of post-correction here and there that can then flow back into retraining and improving the models. We already get feedback from that direction and I also engage in that task in my own free time every now and then since, to me, it feels like a meaningful exercise. In my eyes it is already more time-efficient to post-correct the output of our model than to translate from scratch, at least for the material that I am working on.
I think Marcus was intentionally a bit provocative in the announcement but at the same time I do believe that this managed to raise the important questions for our field and caused a lot of people discuss this. And, of course, nobody needs to rely on machine translations for their work. So if somebody feels uncomfortable with the idea of just becoming a ‘gleaner and cleaner’ of machine translations, one doesn’t have to do that of course. Similarly to bicycles still being in use despite the availability of trains, cars and airplanes.
So I am mildly optimistic about our future here and see it as a good sign that we have lively debates about what is happening here!


Thanks sebastian, i’m meaning to follow up on what you said, as well as writing a what I promise will be a positive contribution to the debate!

In the meantime, though, i raised this question today with the rector of the Nagananda Buddhist University, where I’m currently holding a course, Ven Bodagama Chandima. I asked whether he thought a machine translation of suttas from Chinese would be a good idea. He replied with an enthusiastic “yes”, saying that based on his many years of teaching In Taiwan, that we needed more knowledge of Chinese-language texts and sutras. So you have a new fan!


This statement really hits the point. Thailand has three widely known translation versions. All by major monk institutes, so naturally the general public who are interested to read the Tipitika ‘trust’ their translations. Those who can’t make heads or tails of the stylish language used just don’t read the Tipitika.

As a professional translator, I don’t mind machine translation and find it help save precious time (I have fewer than 24 hours a day :smiley: ) to translate/type. Of course, when I use a machine (Microsoft Office Word), I treat it as a kitchen hand who chops and places all ingredients that their little brain can figure out for me. It’s up to me, the chef, to use what and how much to prepare a good dish.

If and when the machine translation for Tipitika is available in Thai, the reader has to be made aware that mistakes should be expected.


Dear all,

I apologize I have been very absent from the discussion and forgive me for not having the time to read everything. I just want to share here that we made some decisions that might be of interest with regards to the machine translations, also those of the Pali.

First of all we will not host the machine translations on BuddhaNexus and also remove the Pali translations there. We will make a new website that hosts these machine translations with the specific purpose to allow scholars to make corrections to the machine’s output translations. This data can then serve as training data for the model so as to improve future versions. This website will also make it very clear what these “translations” are.

We had discussed this earlier for the Pali also and I thing that would be absolutely great. I would suggest however that we make some further headway first before going this route. For the Pali I have some more segmented translations of commentarial texts in from our friends from the Theravada Group Moscow so I would like to incorporate those in the next run of the Pali texts.


SebastianN wrote

One can of course disagree with this approach and I am very much happy to hear opinions that diverge from what we are doing.

I appreciate your approach. Machines have been crucial to me for a sound understanding of Samyukta Agama. Sometimes Google translator becomes the saviour when DeepL flounders. I use Yandex too.

You wrote

One thing to consider is that what we are doing here is to develop a linguistically driven, alternative approach to the question of translation of Buddhist texts, its not about cloning human beings or building nuclear weapons, and certainly not about putting well-crafted human made translations of Buddhist texts into question. In that sense, I hope we as the technicians can keep a positive relationship with the translators here, who have created and still are, without any doubt, creating constant merit for the Buddhist community that a machine will never be able to achieve.

On occasion a machine can correct a human error, SN 12.63 comes to mind. My main focus is on Samyuktagama, which VBB has said is the closest to the Buddha, in his introduction to Samyutta Nikaya.

It is puzzling why the Pali translators omitted some seminal suttas. An omission of a critical sutta at the beginning of a Samyutta, and subtle modifications of the ensuing text can lead to misleading interpretations. Anapana Samyutta comes to mind.

It begins with SA 803/SN 54.1, but omits SA 801. SA 801(Easing into breath), as the introductory sutta would have made a significant difference.

If not for machines I would not have detected such.

Why did the Pali tradition exclude the “Seal of Dhamma” SA 80?, sandwiched between SA 79 (SN 22.9:impermanence) and SA 81 (SN 22.60;Mahali).

SA 80 stands out in the subtle forcefulness and power of its content. It is a synopsis of the liberating process.
Without the machines, curious people could not investigate which suttas were tampered with over time. A comparison of AN 9.37 and SA 557 presents a case of sutta tampering. SA 556 and SA 558 support SA 557, but Pali compilers left these untranslated.

I was able to read the untranslated Chinese suttas, thanks to the work of dedicated technicians like you who enable machines to echo the voice/spirit of the Buddha. Thanks to cdpatton’s contributions. Yinshun's Reconstruction of the Chinese Saṃyukta Āgama (Taisho 99)

Thanks to Sutta Central for the Chinese versions of the suttas of Samyukta Agama, which were left untranslated, and translated.


I personally don’t mind. I’m still confident that human translators today are still better than machines, but who knows one day we humans may be able to create a Universal Translator as used in the Star Trek universe. I wouldn’t object to it at all. :slight_smile:

1 Like

UPDATE: The repository has been moved to:

(but the old link will redirect you too)