I have developed a simple RAG application that primarily focuses on question-answering based on the Sutta. The data source I used is the version published by Ajahn Sujato.
To facilitate efficient searching by the large language model, I ingested the entire content of the Sutta into a vector database. The end result is a bot that can respond to user queries based on the content found within the suttas, providing answers along with the relevant sources.
However, it may still require substantial modifications to reach a practical level of usability. Currently, my main use case for this bot is illustrated in the image below. There are times when I can recall a portion of a scripture but remain uncertain about its exact location. In such situations, I can utilize this bot to search for and retrieve the specific sutta.
I am going to write more extensively on this point, but let me just ask that my translations are not used for any such project.
I have come to believe that AI agents are essentially creators of delusion and do not wish to support them.
My translations are released under CC0 license, so there is no legal obstacle, but I also ask the people use them in accordance with the Buddhist tradition. And I believe it is fundamentally against the Buddhist tradition to allow scripture to be remixed by a machine, summarized, and presented as if it were the Buddhaās words.
Yeah, the chatbotās summary is incorrect. Iām not sure why people are so excited about these things. Clearly you didnāt bother to verify what the chatbot said, which is exactly the problem with these things. A much better reference for the simile is MN 54 Potaliya. And the simile is not about misapprehending the teachings. Itās about sense pleasures.
These tools are not AI, they are Large Language Models whose goal is to create English that sounds like native speakers, not provide accurate information.
I hope that folks will follow Bhante Sujatoās wishes with regards to the use of his translations.
@HonkChou, I want to come back to say that Iām not accusing you of any bad intentions at all. Itās natural that when new technology appears in the world that good people would look for ways to use it to help people learn the Buddhaās teachings. I just think that in this case itās not a good use of the tool.
I noticed that this is your first time posting here. I hope that you feel welcome to participate.
Welcome to the D&D forum! We hope you enjoy the various resources, FAQs, and previous threads. You can use the search function for topics and keywords you are interested in. Forum guidelines are here: Forum Guidelines. May some of these resources be of assistance along the path.
If you have any questions or need further clarification regarding anything, feel free to contact the moderators by including @moderators in your post or a PM.
Thanks so much for that. I really appreciate your responsiveness. These are new domains and we are all figuring things out as we go.
One approach that might be more appropriate would be using the LLM as a natural language search engine. But the results would not be generated text, simply a list of links that might point to the answer. In this way the true nature of the LLM is preserved: it is a mashup of data, not an intelligent agent.
Having said this, if my work is to be used like this, it should be on a dedicated language model trained and used only for this purpose, not by building on top of ChatGPT or similar.
To be clear, Iām not advocating such uses, as I have reservations about the whole field. But my main problem is that such machines pretend to be human. So uses that do not imitate humanity are less problematic.
Currently I think that is what I think is purported to be going on in suttacentral search, which states itās done by āThe one-stop shop for AI searchā Algolia. But Iām doubtful, because terms that should be very close in a vector embedding to words I know are in the suttas get 0 results.
Any functional specialized llm is going to be built on top of pre-existing models, because essentially there isnāt enough information in a smaller corpus to ālearnā the basic relationships like āking - male = queenā.
What works for certain applications is ātransfer learningā or āfine tuningā where you start with the weights from the original base model and then modify them based on the dataset. This is good for, for example, looking through a bunch of HR documents and learning that in the context of one company āMetLifeā and āDentalā are synonyms but āvacationā and āholidayā are not.
But clearly even that approach isnāt considered appropriate for law or medicine or talking to those you truly love and cherish, so I think it makes sense for you to say youāre opposed to your work being used in that way in the sensitive area of dhamma. I just want to share the difficulties with the alternative you mentioned.
Indeed, yes, itās always complicated in practice.
Huh, I hadnāt even followed this. It seems they introduced their āAIā neuralsearch in May 2023, and featured it prominently in branding from then on.
Itās pretty unclear what it actually is.
TBH Iām not a huge fan of using Algolia (even tho it was my suggestion!) but it turned out to be the best solution from our dev point of view. Iād rather go back to a pure self-hosted search like Orama, but letās see how it goes.
Is that true? I know there is ongoing research in small and medium sized models. For practical purposes, though, it sure does seem that they will be swamped for the reasons you mention.
I think weāre all going to find out that the G in AGI stands for āMonopolyā.
Iām hesitant to double down, because I certainly wasnāt predicting the timeline for all the developments weāve seen, but I believe even āsmallā language models like phi-2 have millions of parameters and are more practical to fine-tune than to train from scratch. For reference, I believe your translation of the nikayas is about 700k total words and about 12k unique words. Maybe someone is going to shock us with a revolutionary technique, but right now I just donāt think thereās enough degrees of freedom in the corpus to get even a āsmallā language model.
You can definitely do stuff with a model trained on the suttacentral corpus, I just donāt think itās spooky enough to to get the āAIā label right now. I havenāt actually done this, so itās all speculative, but I imagine someone could train a neural network on the suttas and get something like, āThe ParinibbÄnasuttasā SN6.15 and DN16 embeddings [behind-the-scenes representation as a bunch of numbers] are more similar to each otherās than SN6.15 is to any other DN suttaā or with search you might be able to get it to where it realizes āblissā and āraptureā are close to each other, but it wouldnāt be possible to either generate or really process truly natural text. For example, I donāt think you ever use the word āecstasyā so a model trained only on your translations would have no idea that word is similar to āraptureā and dissimilar to ādespairā. It also likely wouldnāt be able to string words together with the illusion of coherency & flexibility - or, if it did, it would be even more blatant as a āplagiarism machineā because of course a million parameter model of a dataset of less than a million observations can effectively just contain & regurgitate the dataset. It might know which words come after āPrime Netā but if you truly start training the model from scratch, itāll have no idea how to do those AI style intro and outros & would have absolutely no idea what to say in response to a prompt like āWordiest essayā because again those words simply donāt appear in the training dataset and so the model would be stuck with NA in NA out.
I could be wrong, but thatās my understanding at the current time.
Essentially they are using the large statistical model to automatically identify synonyms and search for those synonyms in addition to what you literally searched for. (The actual details are a lot more complicated, involving vector embeddings, but Iām trying to give a non-technical description.)
The Linguae Dharmae project used our Bilara text/translations (for which we have multiple languages) as well as the whole commentarial corpus, all the Chinese texts, and so on. Well thatās the idea. But basically what they found was that training a small model on just these texts produced output that was no better, and possibly inferior, to ChatGPT etc.
So I think this is an ongoing area of research, but currently yeah, there seems little practical advantage in the smaller models at this point. I think thereās a good chance that GPT-5 will widen the gap even further.
Sebastian told me his model turned out to be able to unexpectedly translate from Sanskrit to Pali, is that spooky enough?
Thatās very impressive and Iād love to learn more if you could point me in that direction.
However, very few people were talking about āAIā when BERT was seen to have more-or-less cracked simultaneous translation of living languages (instead they chose the term NMT - neuro machine translation), so, for whatever arbitrary reason, I think thatās below the spookiness threshold to be labeled as AI. I feel like itās really only with image and text generation that the term exploded in use.
Another thing nobody calls AI but is honestly quite remarkable are content recommendation algorithms. Ajahn Brahmali has been submitting his dhamma talks at the BSWA to a black box machine learning algorithm (YouTube) for years and years now and I think itās actually worked out quite well at identifying candidates to be radicalized by his rhetoric
I think itās all a hot confusing mess and I look forward to seeing what you decide you approve / disapprove of.
O indeed, itās more of an uncanny wilderness than an uncanny valley. For sure, AI is mostly a marketing term. The things I have the most problems with are things that imitate humans. Like if youāre on Youtube, and one video is open and another is recommended, no-one thinks that a human is behind the scenes doing that. Itās still problematic, of course, because the recommendations somehow always push towards more radical and extreme content. But itās more problematic when, say, the comments section is swarmed by bots pretending to be human.
It gets even worse when there is a complex web of interactions between human and non-human entities. Hereās a good example:
So, if I understand right, human actors are being paid to promote a scam without knowing it. The scam is derived from a genuine kind of scam, which is built upon another genuine kind of scam, crypto itself. But in this case itās a false scam as the scam doesnāt even take place, it just sends money from the āinvestorā straight to the scammers. Itās like a triple-layered scamorama. Then the human actors are reinforced by bots in the comments. But inside the scam, the āinvestorsā are warned not to fall for the kind of scam that they are actually falling for in that moment. Then other videos pop up exposing the scam. Obviously, they too are scams.
How much of the world is being burn to death to power this scamception?
This is the reason I left Quora. As a Q&A platform it was particularly susceptible. You ended up with the Quora āprompt generatorā creating questions, then users cutting and pasting chatgpt responses as answers with the obligatory AI generated image to spice things up.
The folk running Quora also replaced their excellent moderation team with an algorithm and introduced their own generative AI, āPoeā. Plus they changed their default settings from ādonāt use my words to train an AIā to āyeah, letās do thatā.
The net result was a mass exodus and, even if the owners are looking to sell, a less valuable business.
Well, I seem to find interesting communities out there on the interwebs. Itās certainly not all bad. As an aside, I had a look at SubStack. Thereās certainly some good content there, but it can be hard to find. The endless feed is a dead end.