Simple RAG Application on Sutta Question Answering

HonkChou · March 20, 2024, 12:28pm

I have developed a simple RAG application that primarily focuses on question-answering based on the Sutta. The data source I used is the version published by Ajahn Sujato.

To facilitate efficient searching by the large language model, I ingested the entire content of the Sutta into a vector database. The end result is a bot that can respond to user queries based on the content found within the suttas, providing answers along with the relevant sources.

However, it may still require substantial modifications to reach a practical level of usability. Currently, my main use case for this bot is illustrated in the image below. There are times when I can recall a portion of a scripture but remain uncertain about its exact location. In such situations, I can utilize this bot to search for and retrieve the specific sutta.

Just thought someone might found this interesting.

Pasanna · March 21, 2024, 3:26am

fantastic idea!
I hope that this can be hosted online somewhere. I don’t know if i can be bothered setting it up on my own machine.

sujato · March 21, 2024, 6:50am

I am going to write more extensively on this point, but let me just ask that my translations are not used for any such project.

I have come to believe that AI agents are essentially creators of delusion and do not wish to support them.

My translations are released under CC0 license, so there is no legal obstacle, but I also ask the people use them in accordance with the Buddhist tradition. And I believe it is fundamentally against the Buddhist tradition to allow scripture to be remixed by a machine, summarized, and presented as if it were the Buddha’s words.

Snowbird · March 21, 2024, 7:08am

Yeah, the chatbot’s summary is incorrect. I’m not sure why people are so excited about these things. Clearly you didn’t bother to verify what the chatbot said, which is exactly the problem with these things. A much better reference for the simile is MN 54 Potaliya. And the simile is not about misapprehending the teachings. It’s about sense pleasures.

These tools are not AI, they are Large Language Models whose goal is to create English that sounds like native speakers, not provide accurate information.

I hope that folks will follow Bhante Sujato’s wishes with regards to the use of his translations.

Snowbird · March 21, 2024, 7:10am

I’ll also point out that if you had just done a search on the SC website for dog AND bone the Potaliya sutta would have been the first result.

HonkChou · March 21, 2024, 8:17am

I apologize for not considering this carefully enough.

Repackaging the Buddha’s teachings using large language models is a highly inappropriate action, and I not mean to do that.

I have removed the link to the vector database in the project and omitted this project.

I apologize again for not using your translation for the correct purpose.

Snowbird · March 21, 2024, 8:09pm

@HonkChou, I want to come back to say that I’m not accusing you of any bad intentions at all. It’s natural that when new technology appears in the world that good people would look for ways to use it to help people learn the Buddha’s teachings. I just think that in this case it’s not a good use of the tool.

I noticed that this is your first time posting here. I hope that you feel welcome to participate.

trusolo · March 21, 2024, 8:17pm

Hi HonkChou,

Welcome to the D&D forum! We hope you enjoy the various resources, FAQs, and previous threads. You can use the search function for topics and keywords you are interested in. Forum guidelines are here: Forum Guidelines. May some of these resources be of assistance along the path.

If you have any questions or need further clarification regarding anything, feel free to contact the moderators by including @moderators in your post or a PM.

Regards,
trusolo (on behalf of the moderators)

sujato · March 21, 2024, 10:00pm

Thanks so much for that. I really appreciate your responsiveness. These are new domains and we are all figuring things out as we go.

One approach that might be more appropriate would be using the LLM as a natural language search engine. But the results would not be generated text, simply a list of links that might point to the answer. In this way the true nature of the LLM is preserved: it is a mashup of data, not an intelligent agent.

Having said this, if my work is to be used like this, it should be on a dedicated language model trained and used only for this purpose, not by building on top of ChatGPT or similar.

To be clear, I’m not advocating such uses, as I have reservations about the whole field. But my main problem is that such machines pretend to be human. So uses that do not imitate humanity are less problematic.

DonatorProponent · March 21, 2024, 10:44pm

Currently I think that is what I think is purported to be going on in suttacentral search, which states it’s done by “The one-stop shop for AI search” Algolia. But I’m doubtful, because terms that should be very close in a vector embedding to words I know are in the suttas get 0 results.

Any functional specialized llm is going to be built on top of pre-existing models, because essentially there isn’t enough information in a smaller corpus to “learn” the basic relationships like “king - male = queen”.

What works for certain applications is “transfer learning” or “fine tuning” where you start with the weights from the original base model and then modify them based on the dataset. This is good for, for example, looking through a bunch of HR documents and learning that in the context of one company “MetLife” and “Dental” are synonyms but “vacation” and “holiday” are not.

But clearly even that approach isn’t considered appropriate for law or medicine or talking to those you truly love and cherish, so I think it makes sense for you to say you’re opposed to your work being used in that way in the sensitive area of dhamma. I just want to share the difficulties with the alternative you mentioned.

sujato · March 21, 2024, 10:59pm

Indeed, yes, it’s always complicated in practice.

Huh, I hadn’t even followed this. It seems they introduced their “AI” neuralsearch in May 2023, and featured it prominently in branding from then on.

It’s pretty unclear what it actually is.

TBH I’m not a huge fan of using Algolia (even tho it was my suggestion!) but it turned out to be the best solution from our dev point of view. I’d rather go back to a pure self-hosted search like Orama, but let’s see how it goes.

Is that true? I know there is ongoing research in small and medium sized models. For practical purposes, though, it sure does seem that they will be swamped for the reasons you mention.

I think we’re all going to find out that the G in AGI stands for “Monopoly”.

DonatorProponent · March 25, 2024, 3:47pm

I’m hesitant to double down, because I certainly wasn’t predicting the timeline for all the developments we’ve seen, but I believe even “small” language models like phi-2 have millions of parameters and are more practical to fine-tune than to train from scratch. For reference, I believe your translation of the nikayas is about 700k total words and about 12k unique words. Maybe someone is going to shock us with a revolutionary technique, but right now I just don’t think there’s enough degrees of freedom in the corpus to get even a “small” language model.

You can definitely do stuff with a model trained on the suttacentral corpus, I just don’t think it’s spooky enough to to get the “AI” label right now. I haven’t actually done this, so it’s all speculative, but I imagine someone could train a neural network on the suttas and get something like, “The Parinibbānasuttas’ SN6.15 and DN16 embeddings [behind-the-scenes representation as a bunch of numbers] are more similar to each other’s than SN6.15 is to any other DN sutta” or with search you might be able to get it to where it realizes “bliss” and “rapture” are close to each other, but it wouldn’t be possible to either generate or really process truly natural text. For example, I don’t think you ever use the word “ecstasy” so a model trained only on your translations would have no idea that word is similar to “rapture” and dissimilar to “despair”. It also likely wouldn’t be able to string words together with the illusion of coherency & flexibility - or, if it did, it would be even more blatant as a “plagiarism machine” because of course a million parameter model of a dataset of less than a million observations can effectively just contain & regurgitate the dataset. It might know which words come after “Prime Net” but if you truly start training the model from scratch, it’ll have no idea how to do those AI style intro and outros & would have absolutely no idea what to say in response to a prompt like “Wordiest essay” because again those words simply don’t appear in the training dataset and so the model would be stuck with NA in NA out.

I could be wrong, but that’s my understanding at the current time.

Khemarato.bhikkhu · March 26, 2024, 7:42am

Essentially they are using the large statistical model to automatically identify synonyms and search for those synonyms in addition to what you literally searched for. (The actual details are a lot more complicated, involving vector embeddings, but I’m trying to give a non-technical description.)

Snowbird · March 26, 2024, 7:47am

I guess we don’t have that feature activated because when I do the Algolia search for snake it doesn’t suggest vipers or serpents.

sujato · March 26, 2024, 11:04pm

The Linguae Dharmae project used our Bilara text/translations (for which we have multiple languages) as well as the whole commentarial corpus, all the Chinese texts, and so on. Well that’s the idea. But basically what they found was that training a small model on just these texts produced output that was no better, and possibly inferior, to ChatGPT etc.

So I think this is an ongoing area of research, but currently yeah, there seems little practical advantage in the smaller models at this point. I think there’s a good chance that GPT-5 will widen the gap even further.

Sebastian told me his model turned out to be able to unexpectedly translate from Sanskrit to Pali, is that spooky enough?

DonatorProponent · March 27, 2024, 9:37pm

That’s very impressive and I’d love to learn more if you could point me in that direction.

However, very few people were talking about “AI” when BERT was seen to have more-or-less cracked simultaneous translation of living languages (instead they chose the term NMT - neuro machine translation), so, for whatever arbitrary reason, I think that’s below the spookiness threshold to be labeled as AI. I feel like it’s really only with image and text generation that the term exploded in use.

Another thing nobody calls AI but is honestly quite remarkable are content recommendation algorithms. Ajahn Brahmali has been submitting his dhamma talks at the BSWA to a black box machine learning algorithm (YouTube) for years and years now and I think it’s actually worked out quite well at identifying candidates to be radicalized by his rhetoric

I think it’s all a hot confusing mess and I look forward to seeing what you decide you approve / disapprove of.

sujato · March 27, 2024, 9:55pm

O indeed, it’s more of an uncanny wilderness than an uncanny valley. For sure, AI is mostly a marketing term. The things I have the most problems with are things that imitate humans. Like if you’re on Youtube, and one video is open and another is recommended, no-one thinks that a human is behind the scenes doing that. It’s still problematic, of course, because the recommendations somehow always push towards more radical and extreme content. But it’s more problematic when, say, the comments section is swarmed by bots pretending to be human.

It gets even worse when there is a complex web of interactions between human and non-human entities. Here’s a good example:

So, if I understand right, human actors are being paid to promote a scam without knowing it. The scam is derived from a genuine kind of scam, which is built upon another genuine kind of scam, crypto itself. But in this case it’s a false scam as the scam doesn’t even take place, it just sends money from the “investor” straight to the scammers. It’s like a triple-layered scamorama. Then the human actors are reinforced by bots in the comments. But inside the scam, the “investors” are warned not to fall for the kind of scam that they are actually falling for in that moment. Then other videos pop up exposing the scam. Obviously, they too are scams.

How much of the world is being burn to death to power this scamception?

Jhanarato · March 28, 2024, 5:09am

This is the reason I left Quora. As a Q&A platform it was particularly susceptible. You ended up with the Quora “prompt generator” creating questions, then users cutting and pasting chatgpt responses as answers with the obligatory AI generated image to spice things up.

The folk running Quora also replaced their excellent moderation team with an algorithm and introduced their own generative AI, “Poe”. Plus they changed their default settings from “don’t use my words to train an AI” to “yeah, let’s do that”.

The net result was a mass exodus and, even if the owners are looking to sell, a less valuable business.

Snowbird · March 28, 2024, 6:28am

It’s an amazing time we are living in, isn’t it?

Jhanarato · March 28, 2024, 10:30am

Well, I seem to find interesting communities out there on the interwebs. It’s certainly not all bad. As an aside, I had a look at SubStack. There’s certainly some good content there, but it can be hard to find. The endless feed is a dead end.