Machine translations of the CBETA corpus: Discussion on H-Buddhism

Snowbird · June 16, 2022, 3:55pm

https://networks.h-net.org/node/6060/discussions/10365735/machine-translations-cbeta-corpus

Thought folks might be interested in this post and discussion. If there is a good thread to merge it into, feel free.

cdpatton · June 16, 2022, 5:29pm

There are so many issues with this. For instance, the Chinese translation of DA is itself corrupted in many passages (mainly, it has many lacunae). Does an AI translator have the ability to assess such passages? I also have no idea how a single translation model can translate DA on one hand and Xuanzang’s Abhidharma texts on the other. Not to mention native Chinese exegesis, which is more properly classical Chinese. The grammars and vocabularies of these types of texts are very different at times. They are like different dialects of Chinese.

Also, the amount of material that they have published on Github is mind-boggling. It’s orders of magnitude larger than the Theravada Buddhist canon. We’re talking about millions and millions of words of English. I’d estimate that just the Agamas section of the Taisho, which is two volumes of the collection, is over two million words of English. They’ve published a translation of 56 volumes of the Taisho, plus some additional material. They can’t possibly have adequately assessed how well the translator performed on that much heterogeneous material. Bingenheimer takes a few quotes from DA and from a biographical collection, but that doesn’t represent their output. It represents a few fairly simple sentences plucked out of two texts.

AI translators aren’t magic or intelligent. They certainly don’t know what language is or how meaning is arrived at. But humans do. The algorithms will get better as better English translations are created by humans who know what they are doing. In that regard, it’s a little disturbing to me to continue translating knowing that my work will be scraped and put this kind of use. I started the project with human readers in mind, not that my work would be put into a software blender with other translations or tweaked by an algorithm to produce something new. It’s all very strange to me. It’s as though Western civilization is yearning to reach a point at which no human intelligence need be brought to bear on any task, and no business should have to account for any human labor of any kind.

In the case of translations of ancient texts, one of the more pernicious things to me relates to the fact language is a matter of habit. Words are associated with meaning culturally. I.e., humans understand language however the culture they belong to assigns meaning to words. The only objective basis for whether we understand words correctly or not is determined by how most people understand them (or, how a particular person does, in the case of translation). Thus, language naturally evolves and changes over time. Not only do meanings associated with words evolve, but the sounds used for words evolve, too. Scholars who study ancient languages become intimately familiar with this; it’s like lingual paleontology.

So, what happens when we allow unintelligent algorithms into this process of human language evolution? For living languages, I think has its effects mainly in how culture evolves. Algorithms are already inserting themselves into how we interact and become aware of media and the information media delivers to us. We’ve watched political and other forms of public discourse degenerate as algorithms quietly shepherd us into deeply emotional and delusional beliefs.

For ancient languages, we may have a more subtle problem. Dead languages are relics of history. The people who spoke them have disappeared into the mists of time, so they cannot help correct our misunderstandings of their words and meanings. We can read the texts they left behind, but we cannot talk to them. We understand them as best we can, using human intelligence, and that isn’t a perfect understanding. There are plenty of places where we guess or arbitrarily decide what was meant by a word or expression.

But what happens when we insert machine produced translations into that process? As a tool to help a human translator save time, they can be helpful, if the human translator is fluent in a given language. If the human translator or reader isn’t fluent, the algorithm will train the human, rather than vice versa. If algorithms become participants in creating meaning for the general public, I think we will have problems. At least, seen from an objective point of view. From a subjective point of view, people are quite happy to misunderstand as long as they don’t know about it.

Khemarato.bhikkhu · June 16, 2022, 9:40pm

“The singularity”… Deus ex machina…

To take on the project of properly translating Buddhism to the West is to sign up for something that can’t be finished in a lifetime. The materialistic (yolo) philosophy and the frantic pace of “late capitalism” make such a timeline literally unimaginable to many. Naturally, you’ll grasp for desperate measures if you feel time is “running out.”

sujato · June 17, 2022, 1:29am

Charles, that is a spectacularly well thought-out response, and I agree completely. I think there is a role for AI machinations, for example in finding relationships between text corpi (a la Buddhanexus), but for a scriptural tradition, human translators are essential. I’ll write more later, but I just feel the need to reinforce: why are the humanities so bent on eliminating the human?

seniya · June 17, 2022, 2:14am

Just a little comment here, I cannot read Chinese, but using AI translation (DeepL translator) and Buddhist Chinese dictionaries (eg. SC Chinese lookup tools which uses DBD database and NTI reader), I can “translate” some of Chinese Agama texts although not as good as an expert translation like Charles Patton does. I think the machine translation is a good start, but it should be corrected by human translator. For my case, I consult the Pali parallel of the Chinese passage (if any) and the modern Chinese translation of the text.

Suvira · June 17, 2022, 4:13am

“We will be gleaners and cleaners following behind the translating machines.”

No, never. Not for the Buddha word.

Khemarato.bhikkhu · June 17, 2022, 5:39am

If by this they mean e.g. Bhante @sujato and Bhante Analayo then yes, absolutely. They truly are translating machines and us mortals can hardly keep up with them!

On a more serious note, a friend of mine is working on an untranslated Mahayana commentary, and suggested that these machine translations may be useful when searching for references: he can more easily think up synonyms and skim the results in English than in Chinese. But yes, most of that benefit is there in the digital tools we have already, so it’s quite unclear what this accomplishes beyond generating some press (pun intended)

sujato · June 17, 2022, 10:32am

I have a little more time to respond more meaningfully.

There are a few issues to consider. I am snatching a few stolen moments to write on this, so expect a few more comments before I’m done!

On the simple side, obviously this is an initial pass and things will be expected to improve. That much is clear. But I think Marcus is being naive when he says that the models have limited understanding of context so far. There’s no reason to expect that they ever will. Technologies tend to evolve in a predictable way: rapid improvement, then a long plateau.

I’m sure Ayya @Vimala will see this thread! Ayya, maybe you could invite Marcus and other project folks to join the discussion here.

sujato · June 17, 2022, 11:06am

Some historical context: in the history of machine translation, programmers have been promising to deliver context-aware machine translations since the field began in the early 1970s. We’re still waiting.

SebastianN · June 17, 2022, 12:24pm

Hey, its Sebastian writing here who was/is to some extent involved in the development of these models.
First of all, regarding “context awareness”, I think we need to make clear what we are talking about. Just to make it a bit more clear, in the announcement, we are referring to models that are able to access previous/following sentences/paragraphs to infer knowledge for the translation of the current sentence. This is something that is being successfully addressed in machine translation research under the heading ‘document level translation’. In the last five years we have seen a lot of activity when it comes to context-aware language models (BERT and the like) and these have been a true revolution in NLP research. Context aware models and large scale pretraining on text corpora have boosted the accuracy of language models for different tasks significantly.
Currently, our model is not able to use context beyond the current sentence but we are working on that. It has been done for other languages so lets see what we can achieve here. it might be that Bhante Sujato is referring to something else here, so I am happy to learn more.
Regarding the general scepticism and critique raised by Charles Patton and I can understand most of that. It needs to be emphasized that this is an early model; its not even close to reaching any level of quality that it would be ethically correct to offer it to people who do not have any direct access to the source languages themselves. If there is any misunderstanding about that, my apologies for us not making it clear enough.
I think one should not think about machine translation systems as means to ‘replace’ human beings but more about being tools to make the life of human translators easier. Similar to as the appearance of computers and pocket calculator didn’t replace the discipline of math as such.
Our main goal with the publication of this material is to draw people that are working on various ends of the Chinese corpus into what could be a potential collaboration to jointly improve the data and to further develop the models. I do believe that an open collaborative attitude could be very beneficial for the creation of what might develop into an ML-assisted ‘community translation’ of the Buddhist canon, much like Wikipedia managed to develop into a quite usable resource. Of course, such an approach will never be able to meet academic standards or even the standards of certain individual translators but that might, after all, not be a problem if one from the outset is not aiming for that.
One can of course disagree with this approach and I am very much happy to hear opinions that diverge from what we are doing. One thing to consider is that what we are doing here is to develop a linguistically driven, alternative approach to the question of translation of Buddhist texts, its not about cloning human beings or building nuclear weapons, and certainly not about putting well-crafted human made translations of Buddhist texts into question. In that sense, I hope we as the technicians can keep a positive relationship with the translators here, who have created and still are, without any doubt, creating constant merit for the Buddhist community that a machine will never be able to achieve.

sujato · June 17, 2022, 12:26pm

Hey Sebastian, thanks for the comments. Perhaps the greatest value of your work will, in fact, be the interest and discussion it generates. I have some more things to say, but am very short of time, so I’ll comment when I get the chance.

Also, just so as to ramp up the tension, I actually have a positive thesis to make, and a suggestion for your team. But I’m going to take my time to get there!

sujato · June 17, 2022, 12:56pm

I was thinking more in line with what Charles was talking about, the extended contexts of time and place and culture and humanity, in addition to the purely linguistic features of the text. Doubtless it will be possible to expand the context of the text to some degree, but I’m skeptical as to how far. Still, that is an empirical question, so please prove me wrong!

A related point is the role of the translator, which both Marcus and Charles hinted at. Certainly being a proofreader for a machine doesn’t sound that exciting.

It’s probably not something that would have affected my work too much, as my translations have mostly been for texts that are already well-translated, so I can refer to the excellent work of prior scholars where needed. Charles is obviously in a different situation. I have only a small idea of the complexities involved; i’ve just been reading Nyanatusita’s translation of the Vimuttimagga, and he spells out a lot of the difficulties, which go beyond merely linguistic to things like understanding of the history of Abhidhamma, or comparing with the passages in Tibetan, and so on.

Now, I don’t want to repeat what others have said here, but I would like to simply draw attention to the question of, well, attention. What are we paying attention to? As Charles pointed out, the corpus translated is so vast that there is not reasonable possibility that humans could review it all. And more will be made. So whose job is this? How much time should we devote to checking the output of neural nets? Is this what we choose to do with our time?

There’s a similar issue in programming, as we can now have programs that output programs. These have all the bugs that you’d expect from such a process, but they can be quite astonishing. At the end of the day, though, you can’t just roll this stuff into production: a programmer needs to review it.

I for one would love to see Marcus making some more translations of his own. When i spoke with him years ago on the topic, he explained how translation is not valued in academia, so to do translation work is akin to professional suicide. I wonder to what extent this is implicitly guiding the choices we’re making? Are we internalizing the devaluation of translation? Is this something where we will simply have to accept that it is the way it is?

Sorry, have to go now!

Khemarato.bhikkhu · June 17, 2022, 1:43pm

I suspect for precisely the same reason that translation and commentary are valued by the religious traditions: translation is ideally an expression of humility

Javier · June 17, 2022, 6:42pm

I guess I’d like to see examples of this thing side by side with human translations and see how it actually stacks up and more importantly, what are the big mistakes it makes and its limitations.

My first impression is that this won’t be able to make scholarly acceptable translations, but that it could help in that task. In that sense, it’s like how we make use of online dictionaries, but on a whole other level.

Also, I see @Vimala was part of the team who worked on this, it would be nice to have their take on this.

Vimala · June 17, 2022, 8:06pm

Thank you all for your feedback. It is important that such discussions are being held.

As the instigator of this project I just wish to share my reasons, but I also wish to say that I firmly believe that human translations can never be replaced by a machine. They can merely provide a tool, a supplemental technology, for the human translator.

Machines cannot do translators‘ jobs just as well. Of course it is impressive how much AI can already do and what it will be able to do soon. But, in particular when it comes to language, humans cannot simply be removed from the equation. A good translation is so much more than the pure transfer of semantics. The necessary fine-tuning is not something which can be done by a machine.

That is not to say that the translation industry should simply ignore the potential of neural networks for translation. Instead, machine translation should be seen as an auxiliary technology which can be implemented into modern workflows. After all, there is much more to it than simply “churning” a text through a machine and then using the initial result unchanged.

My reason for initiating this project was that I needed to have translations of Chinese texts that were not previously translated into English for my research into transgender ordination. Without the combination of BuddhaNexus and DeepL I would not have been able to identify and interpret the relevant passages of the corpus that were useful for my research. My interpretations were subsequently checked by an Associate Professor in Buddhist Studies, who made adjustments.

The current model we have for the CBETA texts is just a first step and needs a lot of improvement; much more training is needed. This is our next focus and we hope that in the future we will be able to provide a useful tool that can aid human translators in their work. We would welcome help of scholars in the field to improve the model with their expertise.

For some more background into what machine translations are and what they are not:

I hope this helps to clarify the idea behind the project.

As I am currently conducting a retreat and have further engagements I will not be online much for the next weeks.

Javier · June 17, 2022, 8:39pm

Thanks Venerable, it seems that people’s concerns in this thread and in the Hnet thread have just as much to do with Marcus Bingenheimer’s cavalier statements (“We will be gleaners and cleaners following behind the translating machines.”) than the actual output of the AI translator.

But it seems other people in the project are not exactly of the opinion that this technology is going to replace human translators (or reduce their work to a mere secondary job of cleaning up after the machine translation). One would hope anyways.

sujato · June 18, 2022, 1:34am

And this is exactly the positive work that this tech makes possible.

Also FWIW I am wondering whether integrating this with Bilara would be a useful thing; make the ML translations available for Chinese as suggestions.

But back to my much-interrupted thread:

I want to discuss ethical consequences.

One that has been raised a few times is the issue with inaccurate machine translations being in the wild. Again, I think Marcus’ take on this is somewhat intemperate. Fair enough, he’s being provocative and throwing some ideas around. But I looked into machine translation when I was starting SC, fifteen years ago. It was bad then and is somewhat better now.

Then there were zero machine translations of Buddhist texts in the wild, and today there are still zero. I’m going to go out on a limb and say that in fifteen years there’ll still be zero. Why? Because it’s bloody hard to get anyone interested to read suttas even if they are well translated by someone who knows what they are doing. I know, I spend half my life doing it. Almost all of the actually interesting texts have been translated, and remain largely unread. In the Buddhist world at large, who really cares about all those obscure texts tucked away in the Taisho? If they cared, they would have translated them already.

Now, having said this, I’d like to make a proposal for the ethical management of this issue.

The criteria for making machine translations of Buddhist texts available to the general public should be no less stringent than the availability of self-driving cars.

With self-driving cars, there is a clear issue: people will die. There’s no avoiding that. As a general rule, it would be ethically problematic to make them available unless we can show that fewer people will die using self-driving cars than human-driven cars. So we trust regulators to ensure that self-driving cars are not made widely available until we the public can be assured of their safety.

For Buddhists, the sanctity and holiness of their sacred scriptures is unparalleled. We should treat the ethical issues no less seriously than we do self-driving cars, if not more seriously since the very purpose of the scriptures is to support an ethical life.

There is precedent in the world of AI to restrict access to content because of—IMHO—well-grounded concerns for its use. The OpenAI’s GPT-3 program is one of the better known examples. In a another recent case, an independent researcher trained an AI using content stripped from 4chan, the vilest pit of misogyny and racism on the web. The bot was predictably horrible, which was the point: to see what would happen. The model was not made publicly available, but nonetheless, many responded by criticizing the very act of creating such a thing. Even the existence of a digital machine for creating hate can be seen as an abomination, something that inherently should not exist.

We could see the machine translations of the Dhamma as the antithesis of this. What happens when we create a machine programmed to emit Dhamma? If the sheer existence of an evil AI is ethically abhorrent, does it not follow that the existence of a virtuous AI is inherently good? Is it a way of training AI to be better, morally?

I’ll return to these issues later on, but for now I just want to make the point that, while in general I am a firm advocate of making everything freely available, in this case I would recommend that the content not be publicly available. It should only be accessible to scholars and researchers on application. As a minimum necessary requirement, wider public accessibility should only be considered if and when there is a consensus of expert opinion that the translations are no less reliable than human translations. This is not an unattainable goal: there are a lot of bad human translations.

To be honest, I’m not personally concerned about the problem of inaccurate translations. People already believe all kinds of nonsense in Buddhism, it will hardly make things worse. But if these models are made available, it may create a reaction against AI within the Buddhist community. And potentially, against those who are also working in the sphere of digital texts and translations. It was only recently that there was a proposal in Sri Lanka to ban unauthorized translations. It would be easy to whip up public sentiment against digital colonizers who were appropriating scriptures and creating new texts by AI. Of course that’s not what you are doing. But that is irrelevant. What matters is how people can spin what you are doing.

This could discredit the very foundations of the field, and be used to justify draconian legislation giving control of scriptures to authoritarian governments. This may sound alarmist, but again, look at how governments in Buddhist nations work. Control over the Tipitaka is a core principle of political authority. I’ve been subject to this sort of pressure by people trying to stop a project even though we had the full support of the Sri Lankan President, the Minister for Culture, and the monastic head of Pali studies at a major university.

So I would urge caution, and move slowly in making things publicly available. It’s not just data to throw in a model. It’s sacred scripture.

You think this is the end? I’m just getting started.

Dheerayupa · June 18, 2022, 2:52am

A bit off topic here but related to this statement.

As a professional translator, I can tell for a fact that a translator’s professional level in a corporate ladder is much lower than an event organiser though all the latter does is making phone calls to different contractors to ‘do’ or ‘make’ things that are part of an event.

To gain a higher status, one has to become a translation teacher.

Khemarato.bhikkhu · June 18, 2022, 4:25am

I don’t think that’s true. There’s this Nidessa translation mentioned back in March which was made by machine translation. In the thread, you were even open to it being added to SC…

sujato · June 18, 2022, 5:30am

Well no, this is a human-created translation, which used a machine as a rough first draft.