Machine translations of the CBETA corpus: Discussion on H-Buddhism

Vimala · October 29, 2022, 8:53am

@josephzizys If you are interested in parallels, you can also use BuddhaNexus.net to find more.
Unfortunately they are only per language at the moment but we are working on making it multi-lingual i.e. so you can directly search for the pali <–> chinese parallels. Right now you still have to find the pali parallels separate and the chinese ones separate.

josephzizys · October 29, 2022, 9:05am

I want to make more use of Buddhanexus, but its a bit opaque to me, is there a tutorial you could recommend?

SebastianN · October 30, 2022, 8:48am

Lack of consistency is inded a problem of the first generation of Deepl/Linguae Dharmae translations that have been trained on extremely small amounts of data and also use small context windows for translation. We made significant progress on various ends in the last months. We will see when/how we can make the results available in the future.

I see two main purposes of this kind of (imperfect) translation system:

To facilitate learning of the language (learning aid) and therefore decrease the barrier for people who want to work with Buddhist primary sources as less time is needed to acquire the language
To support the work of experienced translators by functioning as a kind of “dictionary++”; instead of just presenting all possible translations of the individual words, it makes reasonable first choices and comes up with a draft translation that the user might accept or reject.

I also think that the machine translations can help to skim through a text faster and to get a first rough impression on what a passage/text is about, especially when it lies in a domain that I am not familiar with. I agree that the quality of the uploaded translations is not yet high enough to serve as a basis for creating translations, but this might change in the coming years.

SebastianN · October 30, 2022, 9:10am

Thank you very much for your elaborations Till! I find it very exciting that you and other people working at DeepL found interest in this material and are putting your knowledge and resources into making these translations. We as academic researchers (currently yet without institutional/financial backing) have to do with much less resources, but still I hope that we are able to make some lasting contributions to the problem. I think these will be mostly in the field of aggregating training and evaluation data as well as coming up with a certain scheme of “best practices” on how to approach this special material, including recommendations on pretraining objectives, transfer learning and other methodological issues. I am aware that once the big players (Google, Meta, Microsoft or you at DeepL) will turn their attention on this material and incorporate it into their pretraining routines, our efforts will be overshadowed. Its actually a good thing to happen eventually, and I perceive it as our task to establish resources that will (hopefully) be used in large language model training in the future.

Vimala · November 7, 2022, 1:25pm

@josephzizys Here are a few tutorials: