TML: translating all Pāli texts into English

Vimala · July 30, 2021, 11:36am

When working on my paper re. transgender-ordination I made use of several channels of getting information. One of them was DeepL translator for Chinese. This did not do half bad, esp. for the later texts, and my interpretations were checked by an associate professor in Buddhist Studies in the USA. I came in contact with the people behind DeepL and they expressed an interest in working with me to improve their translator to incorporate Buddhist Chinese.

From this interaction grew the idea to start the TML (i.e. Transformer Machine Learning model) project to translate all Pāli texts to English. This means that we are using artificial intelligence to train a translation engine that can translate all the texts that sofar have not been translated by either Bhante Sujato or Ajahn Brahmali. For DeepL this project would not be profitable enough considering the relatively small amount of people interested in Pāli translations and therefore I started working on it myself together with my colleague Sebastian Nehrdich, using the Bilara data.

The Beta prototype of this is now online on BuddhaNexus.net. Go to any Pāli text and click on the English view-mode. This displays the original text together with a machine translation in English created by the Transformer machine learning model. Additionally, an English translation by Bhante @Sujato (Suttas) or Ajahn @Brahmali (Vinaya), if this exists, is also offered via the settings menu. Click on any segment to show the matching segment in the other columns. These are also the same segment numbers as used in Bilara and SuttaCentral.

As is to be expected, the translations for Suttas/Vinaya texts are nearly identical to the original ones by Bhante Sujato and Ajahn Brahmali, with only minor differences. The system is not so good with verses.

One major problem we ran into is that Pāli has developed over a rather long period of time. The EBTs use an older form of the language than later texts and also the texts we have in prose are rather different than those of other genres like the Abhidhamma. Therefore translations for these texts cannot be relied upon as actual correct translations. We hope that when more Bilara translations become available for the Abhidhamma and Commentaries we can train the TML to learn more and give better results in the long run.

Bodhipaksa · July 30, 2021, 11:53am

The quality of the machine translation is truly astonishing. Thank you for this important work.

Vimala · July 30, 2021, 12:06pm

Ha … don’t look at the commentaries … !

Bodhipaksa · July 30, 2021, 12:13pm

I confess that I never do!

Snowbird · July 30, 2021, 12:56pm

Sadhu sadhu! I’m keen to be able to read the background stories we can find in the commentaries. I guess until Bilara has these kinds of translations, we won’t see much progress on this site?

Is there a place to report problems in general with the site? (Not translation errors) For example this is what I see:

https://buddhanexus.net/pli/english/atk-s0401a13
Granted, I’m on a small laptop, but the sutta viewer only shows, as you can see, only about three lines of text.

Vimala · July 31, 2021, 7:41am

Because of the large amount of data on the site it is usually best used with large screens. But if you want to make the viewport a bit bigger, click on the top right ‘screen’ icon to go to full-screen mode.
We have been struggling a bit with space and it is indeed not ideal.

And yes, any bugs please report to me in some way. Here is fine. Much appreciated.

Snowbird · July 31, 2021, 8:08am

Thanks for all your work on this! I know design can be really tricky. FWIW, I’m using a 13.5" screen, which while small, is not out of the ordinary. Esp for itinerant monastics!

Looks like you have several frames within frames. that might make it more difficult. Understandable. If you could work it out, I think having the heading hide on scroll down like SC. Or maybe have the full screen really be completely full screen and remove all the top matter. Just some thoughts. Another option could be to change the view radio buttons to a dropdown menu.

Vimala · July 31, 2021, 11:26am

It is not frames but webcomponents in each other.
Personally I would like to have less info in the top header and put more in a side-menu but I’m unfortunately not the person in charge of such design changes! I will however pass on your comments in our next meeting.

Gabriel_L · July 31, 2021, 2:59pm

This is great. And I really hope it will evolve and enhance over time.

I do like and use Deepl sometimes to remind myself of how exactly something I read or write in English would be expressed in my mother tongue Portuguese!

If you want, please feel free to use the segmented Portuguese translations so far done in SuttaCentral to kickstart the Portuguese version of this!

sujato · July 31, 2021, 10:48pm

This is truly amazing Ayya, it’s an incredible resource.

Translation quality notwithstanding, this can be a great help for students, who can scan the text to find the passage they need, before looking more closely at the Pali.

I’m guessing, also, that a relatively small quantity of commentarial translations in segmented form might improve the output considerably.

I’m wondering whether we could integrate this with Bilara? Segment the commentarial texts, then run the ML, then give the results as suggestion prompts for translators. As translation proceeds, the “proper” translations would feed back into the system, improving quality of ML output.

Also, I can’t recall if I’ve asked this before, but have you thought of using GPT-3?

Vimala · August 1, 2021, 7:29am

The output is on Github in a format that should integrate with Bilara without major changes. You might however want different file/segment names and I’m happy to make changes accordingly and ensure that BN and SC also stay lined up. The whole structure of the pali files on BuddhaNexus is made to align with Bilara.

https://github.com/BuddhaNexus/segmented-pali/tree/master/outputfiles_TML_v1

That’s already done in the same repo. But like I said you might want different file/segment names.

A couple of files that are a bit more different from the training data for testing should do; maybe 1-2 longer translations of the commentaries. Then we can evaluate the model performance on this data and see in what direction we have to change the parameters. I think 1000 or 2000 segments would already be good for evaluation.

Sebastian tells me that in theory it is an option but in practice not that straight forward. Its difficult to use these high-resource models without expensive hardware and its also not easy to determine whether they will really improve the quality of the translation system. He tried to use the XLM-100 model for our purposes and it didn’t work out.

Thanks so much @Gabriel_L ! That would be great but we would need a lot of translations to get anything sensible. Actually the whole of what Bhante Sujato and Ajahn Brahmali have translated is already very limited for a system like this. In translation programs like DeepL they use hundreds of thousands of segments.

Vimala · August 1, 2021, 10:03am

Just to clarify the details, the repo GitHub - BuddhaNexus/segmented-pali: Segmented files for pali. These are the input files for the Buddhanexus neural network holds all the inputfiles in segmented form in the directory inputfiles and use the exact same filenames as the VRI here: CSCD Tipitaka (Roman)

So the Brahmajālasuttavaṇṇanā, the first part of the Aṭṭhakathā suttas, the original filename is s0101a.att1.xml.

<p rend="chapter">1. Brahmajālasuttavaṇṇanā</p>

<p rend="subhead">Paribbājakakathāvaṇṇanā</p>

<p rend="bodytext">Imissā <pb ed="P" n="1.0026" /><pb ed="V" n="1.0027" /><pb ed="M" n="1.0027" /> paṭhamamahāsaṅgītiyā vattamānāya vinayasaṅgahāvasāne suttantapiṭake ādinikāyassa ādisuttaṃ brahmajālaṃ pucchantena āyasmatā mahākassapena – ‘‘brahmajālaṃ, āvuso ānanda, kattha bhāsita’’nti, evamādivuttavacanapariyosāne yattha ca bhāsitaṃ, yañcārabbha bhāsitaṃ, taṃ sabbaṃ pakāsento āyasmā ānando evaṃ me sutantiādimāha. Tena vuttaṃ ‘‘brahmajālassāpi evaṃ me sutantiādikaṃ āyasmatā ānandena paṭhamamahāsaṅgītikāle vuttaṃ nidānamādī’’ti.</p>

<p rend="bodytext" n="1"><hi rend="paranum">1</hi><hi rend="dot">.</hi> Tattha <hi rend="bold">eva</hi>nti nipātapadaṃ. <hi rend="bold">Me</hi>tiādīni nāmapadāni. <hi rend="bold">Paṭipanno hotī</hi>ti ettha <hi rend="bold">paṭī</hi>ti upasaggapadaṃ, <hi rend="bold">hotī</hi>ti ākhyātapadanti. Iminā tāva nayena padavibhāgo veditabbo.</p>
etc.

The filename in inputfiles is then atk-s0101a1_root-pli-ms.json and in it the segments follow the XML structure of the original files. Segment numbers are consecutive. The numbers like <pb ed="P" n="1.0026" /><pb ed="V" n="1.0027" /><pb ed="M" n="1.0027" /> have not been used but can be easily extracted in any way that is desirable.

  "atk-s0101a1:0": "1. Brahmajālasuttavaṇṇanā",
  "atk-s0101a1:1": "Paribbājakakathāvaṇṇanā",
  "atk-s0101a1:2": "Imissā paṭhamamahāsaṅgītiyā vattamānāya vinayasaṅgahāvasāne suttantapiṭake ādinikāyassa ādisuttaṁ brahmajālaṁ pucchantena āyasmatā mahākassapena – ‘‘brahmajālaṁ, āvuso ānanda, kattha bhāsita’’nti, evamādivuttavacanapariyosāne yattha ca bhāsitaṁ, yañcārabbha bhāsitaṁ, taṁ sabbaṁ pakāsento āyasmā ānando evaṁ me sutantiādimāha. Tena vuttaṁ ‘‘brahmajālassāpi evaṁ me sutantiādikaṁ āyasmatā ānandena paṭhamamahāsaṅgītikāle vuttaṁ nidānamādī’’ti.",
  "atk-s0101a1:3": "1. Tattha evanti nipātapadaṁ. Metiādīni nāmapadāni. Paṭipanno hotīti ettha paṭīti upasaggapadaṁ, hotīti ākhyātapadanti. Iminā tāva nayena padavibhāgo veditabbo.",
etc.

Because for the TML these segments are too long, they are split in the directory inputfiles_cut_segments_for_TML_v1:

  "atk-s0101a1:0_0": "1. Brahmajālasuttavaṇṇanā",
  "atk-s0101a1:1_0": "Paribbājakakathāvaṇṇanā",
  "atk-s0101a1:2_0": "Imissā paṭhamamahāsaṅgītiyā vattamānāya vinayasaṅgahāvasāne suttantapiṭake ādinikāyassa ādisuttaṁ brahmajālaṁ pucchantena āyasmatā mahākassapena –",
  "atk-s0101a1:2_1": "‘‘brahmajālaṁ, āvuso ānanda, kattha bhāsita’’nti, evamādivuttavacanapariyosāne yattha ca bhāsitaṁ, yañcārabbha bhāsitaṁ, taṁ sabbaṁ pakāsento āyasmā ānando evaṁ me sutantiādimāha.",
  "atk-s0101a1:2_2": "Tena vuttaṁ ‘‘brahmajālassāpi evaṁ me sutantiādikaṁ āyasmatā ānandena paṭhamamahāsaṅgītikāle vuttaṁ nidānamādī’’ti.",
  "atk-s0101a1:3_0": "1.",
  "atk-s0101a1:3_1": "Tattha evanti nipātapadaṁ.",
  "atk-s0101a1:3_2": "Metiādīni nāmapadāni.",
  "atk-s0101a1:3_3": "Paṭipanno hotīti ettha paṭīti upasaggapadaṁ, hotīti ākhyātapadanti.",
  "atk-s0101a1:3_4": "Iminā tāva nayena padavibhāgo veditabbo.",
etc.

The TML output files use that same structure in the directory outputfiles_TML_v1 with filename ai-atk-s0101a1.json:

    "ai-atk-s0101a1:0_0": "1. Brahmajālasuttavaṇṇanā",
    "ai-atk-s0101a1:1_0": "Talk on Wanderers",
    "ai-atk-s0101a1:2_0": "Because of this exposition of the teaching on the Great Wood, in accordance with the Monastic Law, and properly resolved.",
    "ai-atk-s0101a1:2_1": "‘Reverend Ānanda, the Brahmā realm is spoken to by me! That’s what I said.’",
    "ai-atk-s0101a1:2_2": "That is what I said.’",
    "ai-atk-s0101a1:3_0": "1.",
    "ai-atk-s0101a1:3_1": "and the doing of the performing of all states.",
    "ai-atk-s0101a1:3_2": "my baby is called ‘the step.’",
    "ai-atk-s0101a1:3_3": "And here they’re practicing to win in this way.",
    "ai-atk-s0101a1:3_4": "This is the extent of this penetration.",
etc.

(Note that headings are not translated.)
But like I said, the translation leaves much to be desired.
One improvement I see is that certain code is still in the inputtext for the TML and taking that out will no doubt improve the results.

UPDATE: I made a new directory inputfiles_cut_segments_for_Aijato_next_run that removes some of the code that is probably one cause of problems so this can be used for a next run of the TML This also takes care of some of the wrong quotemarks.

mikenz66 · August 1, 2021, 8:13pm

Ayya, this is amazing.

Can you tell me where to find the “settings” menu? I can’t locate it…

Vimala · August 2, 2021, 5:44am

Top right corner. It is actually a filter-icon because originally it used to be just the filter settings but we have changed that in the next update to a regular cog-icon for settings.

Vimala · August 2, 2021, 8:44am

@Snowbird

Vimala · September 7, 2021, 3:52pm

@Snowbird - you can see some of the new functionality at our staging site for testing. The transliteration buttons will also be moved to the settings menu in a next update. Also the full screen mode is now really full screen. These changes have not yet been approved but we are working on it.

https://buddhanexus2.kc-tbts.uni-hamburg.de/pli/english/dn1

Snowbird · September 8, 2021, 3:53am

Sadhu sadhu!
This is what I see from the link you sent:

It looks fairly similar to how it did before.
However the full screen view is much better.

Vic · September 8, 2021, 10:32pm

@Vimala, Ayya, this is incredible. Was the AI algorithm trained on translations from Sutta Central? It’s uncanny that the Nikaya translations are almost identical to those by Bhante Sujato.
What a great resource!

mikenz66 · September 8, 2021, 10:57pm

@Vimala, Ayya, there was a recent post on DhammaWheel from someone who has aligned the Pali and English text of the Visuddhimagga:
https://www.dhammawheel.com/viewtopic.php?f=23&t=41133
I suggested they contact you and/or others about putting it into Bilara format, but I don’t think they have been back in the last couple of days. Presumably with that your system could get some training in Commentarial Pali.

Here is the start of the Visuddhimagga, with Nanmoli’s translation: История происхождения и прочее
and here is the machine translation: Buddhanexus

Vimala · September 9, 2021, 8:21am

I still don’t like the margins in the normal mode so will make some changes to the calculations for that too. Unfortunately not straightforward css on that. At least in the full-screen mode you can work.

Yes, it was. The system learns to speak like Bhante Sujato and Ajahn Brahmali.
But I’m not happy with the commentaries and we are working on that. I’m glad to hear you like it!

Wow … that’s incredible! Thanks for pointing it out to me. I will certainly have a good look at that one. That is absolutely very useful for training the system!