AI-1: Let’s Make SuttaCentral 100% AI-free Forever

Sure! We still work together. I’m actually helping him with Pali pronunciation for his audio book.

Thank goodness!

Well, hopefully I can change your mind on that. They’re relevant because they express the desires of people with vast amounts of money and power who are absolutely bent on realizing those desires.

1 Like

Nice to hear that Meng is still working with you and SC, that’s great.

Generally, machine translation is something that companies don’t make money with, save for perhaps the first 10 high resource languages (French, English, German, Chinese, you know). That’s why DeepL is focussing on these. Providing MT services for all other languages is, believe it or not, more charity than anything else. There was a race between Meta and Google to train the largest open source MT models for the last couple of years, but that culminated with NLLB/MADLAD400 and since then MT is not ‘cool’ as a topic anymore.

In a sense, working on low resource cases such as Buddhism is always something that doesn’t ‘pay off’, its a service to the society. I see parallels to the merchants that made the distribution of Buddhism in Asia possible in the very early days. Nowadays its Buddhist donors that support tech projects to make the distribution of the teaching possible in the digital world.

2 Likes

Well, maybe.

Here on SC, we have a small enough community that we can afford real human moderators, agree on terms of conduct and foster deeper learning. I feel that many of the problems we have on the internet stem from trying to scale human interaction.

2 Likes

I think there is a productive dialog to be held about the potential negative effects of this technology, and the most serious source of information on that end might be what folks in the human-computer-interaction field are researching on the possible harmful effects of use of machine translation and AI in general, how it affects translators and language as such etc.

3 Likes

Very good and meaningful parallel. :anjal:

2 Likes

Like this?

EDIT the preview isn’t showing Q39. Can I add/create a thread or post which contains AI (Artificial Intelligence) content from popular sources such as ChatGPT? but the link will take you there and here is the item in its entirety:

5 Likes

Well, if they had access to it, that would be possible. But the priviliged position of people from a well-enough background in western countries that can enroll in universities etc. to study Pali is not comparable to the situation of people in other contries where a systematic Pali curriculum is not established.
The point I want to make is: Learning the primary sources in order to engage with the material has so far been the privilige of a (mainly white) aristocratic elite class in Western countries, and people with less privileged status usually don’t survive the academic treadmill long enough. Not everybody can afford the time to learn these languages well enough, but the desire to interact with traditional material is nonetheless there.
I think the biggest positive effect of machine translation will be democratization of access to the material.
That of course lies at odds with both the academic and the broader Buddhist tradition, where gate-keeping of knowledge is an integral part of establishing hierarchy.

Sure, you can do that. I would compare machine translation here to how wonder bread compares to proper bread. Nobody can argue that the former is better than the latter, but wonder bread is certainly better than no bread.

:pray:

5 Likes

@sujato Looks like that is the plus and negative of giving away the data for free as pubic domain. The best gifts are given without attachment and without looking back. You just have to be okay with however it is used. They have scraped all data that can be scraped since 2021. So, there is not much that can be done. You would have to make a new policy if you wanted to prevent other engines from scraping. However, they might get it right where others get it wrong. It is too late. Once public domain… always public domain. You can only re-copyright what is considered artistic interpretations (of your own work). A funny or sad case is what happened to A Course In Miracles.

ACIM consists of three sections: “Text”, “Workbook for Students”, and “Manual for Teachers”. Written from 1965 to 1972, some distribution occurred via photocopies before a hardcover edition was published in 1976 by the Foundation for Inner Peace.[6] The copyright and trademarks, which had been held by two foundations, were revoked in 2004[6] after lengthy litigation because the earliest versions had been circulated without a copyright notice.[7][8]

My guess is that the next phase of AI training is AI using to verify its own knowledge.

Now that it has good vision, it will start with computer tasks and programming and actually doing things from its own knowledge and getting verifications that things worked. Then it will spread to other areas, (if it has not done so already). There are websites or people posting how to tell AI art. Surely those will be the best trainers.

While it cannot answer questions on dhamma so well and find sources so well, it does know languages and can translate pali quite well. (likely to be better than you, but less than a degreed monk from Myanmar or Sri Lanka). It makes mistakes, but they are not major and common or anywhere in comparison to an average pali reading person with all the modern tools.

The damage has been done. Hope for the best.

3 Likes

So we’re where we want to be already?

1 Like

I have made attempts to state rational arguments for the mindful use of AI. These statements have caused horror in the Sangha. That is not good. I have stated explicitly that the insights gained are from my looking at multiple Pali dictionaries as a human. They are not from generative AI. Has that been heard? I think not.

No I am not a PT native speaker. I am learning PT so that I can speak PT when I travel to PT this year out of respect for the Portuguese. It makes no sense to me to travel to a foreign country to look at dead things and expect the living people to speak English. I will speak PT to the Portuguese because that is the decent respectful thing to do. And the best way for me to learn the heart of Portuguese is to read the suttas segment by segment comparing PT with EN. Bhante’s EN translation sings in my heart. So when I read the PT translation from DeepL and it doesn’t sing in my heart I worry and fix the translation so that it rings in my heart. That is all I am doing. And it is causing an uproar in the Sangha.

The Sangha is arguing here and it is painful for me to see. I see no choice but to stop my translation efforts. The distinctions we are dealing with are subtle and confusing to many. The difference between simple mechanical translation and the hyped up claims of AI companies is losing itself in the uproar of fear and loathing. It is not good for us to proceed in this way.

The EBT-DeepL library project is at an end. There will be no more EBT-DeepL translations. And yes, Bhante, I am more than content with your translations. If segmented EN is all I have, then that will be quite fine.

:heart:
:pray:

6 Likes

This, for me, is the best distillation yet in Bhante’s threads of the technology’s benefit into the future.

(Granted, I am trying to keep up with the manic pace at which these ideas and thought experiments are flying :airplane:.)

This takes us back to Bhante’s AI 2 theme: What problem are we solving?

At the highest level, we’re both solving a problem & taking advantage of a benefit. (I’m not trying to dehumanize this but it sure sounds like it.)

Problem: privileged access
Benefit: massively increase access for everyone regardless of privilege
Risks: read these threads … they’re all in there
Remediation: pause AI-generated content on SuttaCentral (forever? :thinking:)

The US government wasted hundreds of millions of dollars over about 10 years trying to nail down the energy mix for scaling compute capacity forever (in order to stop the madness and consolidate a gazillion server rooms). It was a good-faith effort, in ways. But it proved to be a futile exercise. It’s not that it’s not possible theoretically. But time and again we (the federal agencies) all came up dry because it’s just too hard to do. I recommend no resources be spent on that data. It totally wears down the data center people.

:pray:t3: :elephant:

1 Like

Thank you; Bhante, I’m on the job. Not because I love high tech, other than the tech of consciousness, but because I have come to believe that one must choose between being domesticated by AI or being the one that domesticates the AI.

2 Likes

So are all the translations on SC currently composed by humans with Right View? Cause mistakes are made by humans with Right View to say nothing of those without Right View.

I see a middle ground in the way I was working with @karl_lew where the index is formed, the translation completed and then at least two human native speakers proofread the translation. Any translation can be proofread and concepts should be/ can be discussed. It can continue to be a group effort. It is not necessarily AI vs human. But maybe AI can produce the basic translation - to make use of technology in order to assist those who encounter the Dhamma in less privileged countries where Buddhism is nascent.

Does it have to be one or the other for SC? Is there a way to work with technology for the sake of others(?) removing our/selves/views, fortunately blessed with resources and privilege, from the equation. This wouldn’t cease human translation and it would possibly give entirely human translations of the Tipitaka more credence.

3 Likes

Thanks for your efforts and help. :anjal:
I did not mean to hurt you and apologise if I did so.

I am also available for PM conversation if you want also to use a Brazilian Portuguese native speaker human being to continue your learning journey.

3 Likes

I don’t think Bhante is against us using AI tools for our own studying… I believe he’s simply against us publishing that work, especially without sufficient warning labels.

2 Likes

Venerable @Khemarato, the 100% rule is 100%. That is the rule.

EBT-DeepL is actually tuned to Venerable @Sujato’s translations specifically. I had intended to make it general purpose, but the task was simply too difficult. To tame DeepL, I had to make it specific to one (1) author. The reason is that Bhante’s coherence in translation is phenomenal and that very coherence tames EbtDeep-L randomness affording DeepL no room for error. I have had to wrestle DeepL segment by segment to get intelligible and consistent output.

The goal here is coherence, not quality. EBT-DeepL ensures coherence. People ensure quality. Coherence guarantees fidelity of translation. The Dhamma is a truly stunning example of error correction and detection across all the languages the Sangha encountered at that time. This coherence is the shield for the Dhamma. Break the coherence, break the Dhamma. With EBT-DeepL providing coherence, humans provide quality by updating the glossaries with correct terms. EBT-DeepL isn’t generating anything. EBT-DeepL sole goal is to hammer the randomness out of DeepL and do it in a way that human translators can insert correct Dhamma terms for the time and language at hand.

DeepL is an AI-translator trained on language datasets. EBT-DeepL is an AI-translator trained on Venerable @Sujato. Without Bhante’s blessing, EBT-DeepL should not proceed.

100% is 100%.

“Own study” :thinking:
I suppose that means “without good spiritual friends?”
Is that 100% of the spiritual life?

100% is 100%

Ven. @marajina, I will support your efforts in less privileged countries where Buddhism is nascent. For now, we can put DeepL aside for the coherence of the Sangha, but if your own translations appear in SuttaCentral, we will be able to hear them in SC-Voice in ES. It will be slower and painful, but I will help as I can. Most of the SC-Voice narrators are AI-based, but sc-voice.net is not suttacentral.net, so SuttaCentral is still 100% AI free. Although technically speaking, we could have curated EBT-DeepL ES translations on SC-Voice under your guidance, I think the greater benefit is to have your translations on both SuttaCentral and SC-Voice, which means they have to be 100% AI-Free.

100% is 100%

:pray:

7 Likes

I am a machine learning researcher. I worked on an open source LLM project a year ago.

At the time I’ve spent a lot of time deciding if democratizing the technology is the correct thing to do: should this technology be available to everyone or is it really safer if only a few companies have access to it? An LLM run by a company can alert the police if someone asks it questions about making bombs, while a private one would allow you to ask all kinds of questions without a trace. Some people even compared the technology to nuclear reactions at the time which is useful for generating energy but would be devastating if anyone could make their own atom bombs with it. Once it’s out you have no control over what people will use it for, and I really started doubting myself if releasing such technology is a good idea when some of the people in the group chat started arguing on why including the book Lolita in the training data is a must (according to them, it’s due to freedom and democracy reasons).

I went to a retreat to think about it, decided to leave it, but later on, I have decided to go back and ultimately continue with the effort. The reason for this is because there are only two possible futures. Think about gen AI as fire: in which one of these futures would you want to live in?

  • Fire is invented, but only a handful of companies have it. You are allowed to use it after paying a small, negligible fee (which they might raise at some point). You are allowed to cook certain things that are on their list, use it for lighting but you are not allowed to use it for experimenting with steam engines, as they have decided that they are not interested in that. These companies will alert the police if you try to burn anything down.
  • Fire is invented, and is given to everyone for free, including people who are obviously arsonists. Some people figure out steam engines and others figure out hot air balloons.

I do not want to live in a future where every single white-collar work has to go through FAANG for a small fee (note: their current API fees are lower than the cost of running these systems in order to destroy their competition).

I’ve also asked my father about this and gave him a different example on AI:

  • Imagine that you are a car company. You have invented the motor engine. People can buy this engine from you, but they decide whether or not they are willing to sell them for your purpose. This is safer, because people will not be allowed to put this engine into something like a tank, but on the other hand we will never invent motor boats or airplanes because the companies would not give away their engines for something like that.
    And then my father was like: I don’t know about AI, but you got this all wrong. Companies always create tanks first and only then they sell you their technology as cars.

So I decided to continue with the open source LLM effort, and in retrospect it was the right decision: the world still hasn’t exploded and seeing the drama in OpenAI (how they can fire and rehire Sam Altman within days) is a clear example of how we shouldn’t just trust companies in doing the “right thing for us”.

(I admit, that of course, besides making sure the technology is available to anyone and forcing these companies to step up their game due to the competition in open source was one thing, I also had my own agenda in trying to use this effort to get more familiar with the technology and to allow myself to be seen.)

Please don’t say that Bhanthe, one of the reasons I chose this field was because it allowed me to have a right livelihood. I worked on ML projects that would detect car crashes and summon the paramedics to the spot automatically. Our civilization runs on ML, it’s just we tend not to call these AIs since ChatGPT (by the way, GPS was called AI in the 2000s but we no longer use that marketing term for it).

I saw you quoting Altman multiple times but you should not take these people for face value. (Update: never mind, started reading the rest of the essays. I wasn’t even aware of all the bullshit going on about Altamn)
They are trying to sell you their stuff and whenever they mention how AI could be regulated, it is usually just them making sure it becomes far harder for any competition to appear. They never disclose the data they used or how they got them. Whenever they are pressed what people are supposed to do once AI is able to replace their jobs (which is something both DeepMind and OpenAI is working on at full speed, trying to reach to this leve within the next decade) they just come up with some bullshit, like how “AI is making us more human” and “it’s going to make such a huge progress, we will be able to ponder about life on our spaceship”. Well, if AI is better at pondering about life, that’ll probably also be a waste of time.

There is a scene in Silicon Valley where everyone says they want to make the world a better place, then the CEO Gevin Belson gets really frustrated about their competition doing the same thing:
“I don’t want to live in a world where someone else makes the world a better place better than we do”

I love when these people talk about “universal income” as a solution and how “AI will make the world a better place for all of us”. I guess the reason Google fired thousands of engineers and anyone trying to unionize is because they still don’t have enough shareholder value to universally share it with these people. Better luck next year! And I am pretty sure the reason these companies need to buy exclusive rights for Reddit data is because they just want to be the ones to share it with all the other AI companies working on making the world a better place.

These projects cost a lot of money, and the winner in this race will be the company with the most resources. Transformers are extremely data and computation heavy. It might look like they are simple because anyone can log in to ChatGPT but even doing inference and making it widely available for so many people in such a short time without any lag is an engineering miracle itself that is usually overlooked. Cutting edge research has long left Academia. ML is a bit like doing physics: you could do experiments in your own basement for several decades and come up with useful observations. But now, you would need to go to CERN and work on a large hadron collider to come up with anything new.

About the translation of the Suttas:

I am heavily invested in transformers and gen AI, yet I do believe banning LLMs on SuttaCentral is the right call. I had some ideas on creating Goenka voice models to fix his singings or translate his talks, but realised that once a model like that exists, it would be really easy to just generate talks that he never said which would hurt Vipassana more than it would benefit it. I also thought about collecting BSWA dhamma talk transcriptions and fine-tuning a model on them but I am glad I haven’t started that effort (for similar reasons).
Using LLM embeddings for sutta search and recommendations would still be useful, although it’s a slippery slope where to draw the line.

On the other hand I can also understand how researchers would feel attacked by this sudden ban, since until recently their work was regarded as noble and selfless. They took a huge gamble even working on the field (as it was really niece before ChatGPT) and now they suddenly get interrogated about the carbon footprint of their GPUs.

I do not think you will be able to completely ban crawlers from the site, since the bots are fully automatic so it’s unlikely that an actual person from OpenAI or Deepmind will read your post and follow your request. There were many attempts by artists to “poison” their content so AI would not be able to use their work as training data, but honestly those are just snake oil. I am not sure if a 100% total ban would be the most beneficial, because it could have the effect of search engines and agents giving far less useful answers about Buddhism (similarly how Wikipedia is a horrible source for Buddhism).

7 Likes

I think its an important distinction to note that this ban is primarly affecting the people that care about the opinion of Bhante. I can say with reasonably confidence that nobody at FAANG or the new generation of AI companies cares even slightly about Pali or Buddhism in general. Its just data for them, they have no time to read posts on a forum like this.

I want to make clear that Dharmamitra is taking an open source approach, we will make the model weights of our training runs available once they are stable enough, and we will also make datasets available whereever that is possible.

I have a tricky decision to make: Will I take Bhante’s stance word by word? That would actually put my PhD proejct at risk, since part of that is to create a system that automatically maps Pali sentences with their Chinese and Sanskrit counterparts across the entire corpus. So now it seems that according to the opinion of other people, I made an unethical decision 2 years ago when i started that project. Or will I ignore Bhante’s request, and live with the fact that the very people creating this data are against my research?

I agree about your point Richard that cutting edge research has long left academia. Calling Berkeley ‘big tech’ sounds funny to me, but of course many people here end up in big tech, but that is a different story.

4 Likes

Such an interesting issue, and such complex and interesting comments. At the end of the day, my sense is that AI will be amazingly beneficial to mankind designing safer cars, cleaner energy systems, and providing translations of volumes of texts in minutes of computing time with technical accuracy. All of this will be beneficial.

Yet, AI won’t ever have “heart.” It might feign nuance, but it can’t truly capture a moment in time of poetry, or inspiration. It might mimic these qualities, but it won’t ever have the capacity to feel a moment of absorption in meditation, or feel Metta for a difficult person in a difficult moment, and be able to infuse these feelings and emotions into a translation. The problem with mimicry is that it is inauthentic; it lacks heart…a quality we need in the best translations.

Some years ago, I had a chance to participate in a very small way in a translation project, a book of great nuance being translated from Thai to English. Monastics and lay folks weighed in on the translation, and the different iterations were so beautiful, so heartfelt, and in the end so accurate in capturing the heart of the author, a respected and revered Bhikkhu. In theory, we might have given the translation assignment to an AI bot, and in a day had a translation. But, the end product would have lacked the heart and poetry and Metta of all of the men and women that worked on the translation. You’ll never convince me that AI can replicate a heartfelt endeavor like this, just as it will never experience a sense of letting go and release found in a samadhi absorption. For this reason, keep AI out of the Pali texts.

2 Likes

I don’t deny the existence of the elite class’s involvement in Pali translation. And that of course opens up a whole different form of bias in translations. However the three most prolific translators of Pali into English (Bhantes Bodhi, Thanissaro, and Sujato) are not a part of that. As far as I know all three are mostly self taught, although Bhante Bodhi would have had Sinhala teachers as a guide.

I’m not so sure I agree. But I recognize many people would feel this way.

I think on an individual basis the problem of consuming “wonder bread” translations is not a big deal. It’s more that when the market is saturated with one thing, the other doesn’t have a chance.

But to your point on democratizing Pali education, I think the better solution to the problem of access is increasing the amount of language learning materials. But that isn’t a problem that either of us can solve.

I sincerely hope you can find a way to work this out.

3 Likes