Simple RAG Application on Sutta Question Answering

I have developed a simple RAG application that primarily focuses on question-answering based on the Sutta. The data source I used is the version published by Ajahn Sujato.

To facilitate efficient searching by the large language model, I ingested the entire content of the Sutta into a vector database. The end result is a bot that can respond to user queries based on the content found within the suttas, providing answers along with the relevant sources.

However, it may still require substantial modifications to reach a practical level of usability. Currently, my main use case for this bot is illustrated in the image below. There are times when I can recall a portion of a scripture but remain uncertain about its exact location. In such situations, I can utilize this bot to search for and retrieve the specific sutta.

Just thought someone might found this interesting.

3 Likes

fantastic idea!
I hope that this can be hosted online somewhere. I donā€™t know if i can be bothered setting it up on my own machine.

I am going to write more extensively on this point, but let me just ask that my translations are not used for any such project.

I have come to believe that AI agents are essentially creators of delusion and do not wish to support them.

My translations are released under CC0 license, so there is no legal obstacle, but I also ask the people use them in accordance with the Buddhist tradition. And I believe it is fundamentally against the Buddhist tradition to allow scripture to be remixed by a machine, summarized, and presented as if it were the Buddhaā€™s words.

11 Likes

Yeah, the chatbotā€™s summary is incorrect. Iā€™m not sure why people are so excited about these things. Clearly you didnā€™t bother to verify what the chatbot said, which is exactly the problem with these things. A much better reference for the simile is MN 54 Potaliya. And the simile is not about misapprehending the teachings. Itā€™s about sense pleasures.

These tools are not AI, they are Large Language Models whose goal is to create English that sounds like native speakers, not provide accurate information.

I hope that folks will follow Bhante Sujatoā€™s wishes with regards to the use of his translations.

4 Likes

Iā€™ll also point out that if you had just done a search on the SC website for dog AND bone the Potaliya sutta would have been the first result.

4 Likes

I apologize for not considering this carefully enough.

Repackaging the Buddhaā€™s teachings using large language models is a highly inappropriate action, and I not mean to do that.

I have removed the link to the vector database in the project and omitted this project.

I apologize again for not using your translation for the correct purpose.

7 Likes

@HonkChou, I want to come back to say that Iā€™m not accusing you of any bad intentions at all. Itā€™s natural that when new technology appears in the world that good people would look for ways to use it to help people learn the Buddhaā€™s teachings. I just think that in this case itā€™s not a good use of the tool.

I noticed that this is your first time posting here. I hope that you feel welcome to participate.

7 Likes

Hi HonkChou,

Welcome to the D&D forum! We hope you enjoy the various resources, FAQs, and previous threads. You can use the search function for topics and keywords you are interested in. Forum guidelines are here: Forum Guidelines. May some of these resources be of assistance along the path.

If you have any questions or need further clarification regarding anything, feel free to contact the moderators by including @moderators in your post or a PM.

Regards,
trusolo (on behalf of the moderators)

Thanks so much for that. I really appreciate your responsiveness. These are new domains and we are all figuring things out as we go.

One approach that might be more appropriate would be using the LLM as a natural language search engine. But the results would not be generated text, simply a list of links that might point to the answer. In this way the true nature of the LLM is preserved: it is a mashup of data, not an intelligent agent.

Having said this, if my work is to be used like this, it should be on a dedicated language model trained and used only for this purpose, not by building on top of ChatGPT or similar.

To be clear, Iā€™m not advocating such uses, as I have reservations about the whole field. But my main problem is that such machines pretend to be human. So uses that do not imitate humanity are less problematic.

5 Likes

Currently I think that is what I think is purported to be going on in suttacentral search, which states itā€™s done by ā€œThe one-stop shop for AI searchā€ Algolia. But Iā€™m doubtful, because terms that should be very close in a vector embedding to words I know are in the suttas get 0 results.

Any functional specialized llm is going to be built on top of pre-existing models, because essentially there isnā€™t enough information in a smaller corpus to ā€œlearnā€ the basic relationships like ā€œking - male = queenā€.

What works for certain applications is ā€œtransfer learningā€ or ā€œfine tuningā€ where you start with the weights from the original base model and then modify them based on the dataset. This is good for, for example, looking through a bunch of HR documents and learning that in the context of one company ā€œMetLifeā€ and ā€œDentalā€ are synonyms but ā€œvacationā€ and ā€œholidayā€ are not.

But clearly even that approach isnā€™t considered appropriate for law or medicine or talking to those you truly love and cherish, so I think it makes sense for you to say youā€™re opposed to your work being used in that way in the sensitive area of dhamma. I just want to share the difficulties with the alternative you mentioned.

Indeed, yes, itā€™s always complicated in practice.

Huh, I hadnā€™t even followed this. It seems they introduced their ā€œAIā€ neuralsearch in May 2023, and featured it prominently in branding from then on.

Itā€™s pretty unclear what it actually is.

TBH Iā€™m not a huge fan of using Algolia (even tho it was my suggestion!) but it turned out to be the best solution from our dev point of view. Iā€™d rather go back to a pure self-hosted search like Orama, but letā€™s see how it goes.

Is that true? I know there is ongoing research in small and medium sized models. For practical purposes, though, it sure does seem that they will be swamped for the reasons you mention.

I think weā€™re all going to find out that the G in AGI stands for ā€œMonopolyā€.

Iā€™m hesitant to double down, because I certainly wasnā€™t predicting the timeline for all the developments weā€™ve seen, but I believe even ā€œsmallā€ language models like phi-2 have millions of parameters and are more practical to fine-tune than to train from scratch. For reference, I believe your translation of the nikayas is about 700k total words and about 12k unique words. Maybe someone is going to shock us with a revolutionary technique, but right now I just donā€™t think thereā€™s enough degrees of freedom in the corpus to get even a ā€œsmallā€ language model.

You can definitely do stuff with a model trained on the suttacentral corpus, I just donā€™t think itā€™s spooky enough to to get the ā€œAIā€ label right now. I havenā€™t actually done this, so itā€™s all speculative, but I imagine someone could train a neural network on the suttas and get something like, ā€œThe Parinibbānasuttasā€™ SN6.15 and DN16 embeddings [behind-the-scenes representation as a bunch of numbers] are more similar to each otherā€™s than SN6.15 is to any other DN suttaā€ or with search you might be able to get it to where it realizes ā€œblissā€ and ā€œraptureā€ are close to each other, but it wouldnā€™t be possible to either generate or really process truly natural text. For example, I donā€™t think you ever use the word ā€œecstasyā€ so a model trained only on your translations would have no idea that word is similar to ā€œraptureā€ and dissimilar to ā€œdespairā€. It also likely wouldnā€™t be able to string words together with the illusion of coherency & flexibility - or, if it did, it would be even more blatant as a ā€œplagiarism machineā€ because of course a million parameter model of a dataset of less than a million observations can effectively just contain & regurgitate the dataset. It might know which words come after ā€œPrime Netā€ but if you truly start training the model from scratch, itā€™ll have no idea how to do those AI style intro and outros & would have absolutely no idea what to say in response to a prompt like ā€œWordiest essayā€ because again those words simply donā€™t appear in the training dataset and so the model would be stuck with NA in NA out.

I could be wrong, but thatā€™s my understanding at the current time.

1 Like

Essentially they are using the large statistical model to automatically identify synonyms and search for those synonyms in addition to what you literally searched for. (The actual details are a lot more complicated, involving vector embeddings, but Iā€™m trying to give a non-technical description.)

I guess we donā€™t have that feature activated because when I do the Algolia search for snake it doesnā€™t suggest vipers or serpents.

1 Like

The Linguae Dharmae project used our Bilara text/translations (for which we have multiple languages) as well as the whole commentarial corpus, all the Chinese texts, and so on. Well thatā€™s the idea. But basically what they found was that training a small model on just these texts produced output that was no better, and possibly inferior, to ChatGPT etc.

So I think this is an ongoing area of research, but currently yeah, there seems little practical advantage in the smaller models at this point. I think thereā€™s a good chance that GPT-5 will widen the gap even further.

Sebastian told me his model turned out to be able to unexpectedly translate from Sanskrit to Pali, is that spooky enough?

Thatā€™s very impressive and Iā€™d love to learn more if you could point me in that direction.

However, very few people were talking about ā€œAIā€ when BERT was seen to have more-or-less cracked simultaneous translation of living languages (instead they chose the term NMT - neuro machine translation), so, for whatever arbitrary reason, I think thatā€™s below the spookiness threshold to be labeled as AI. I feel like itā€™s really only with image and text generation that the term exploded in use.

Another thing nobody calls AI but is honestly quite remarkable are content recommendation algorithms. Ajahn Brahmali has been submitting his dhamma talks at the BSWA to a black box machine learning algorithm (YouTube) for years and years now and I think itā€™s actually worked out quite well at identifying candidates to be radicalized by his rhetoric :stuck_out_tongue:

I think itā€™s all a hot confusing mess and I look forward to seeing what you decide you approve / disapprove of.

2 Likes

O indeed, itā€™s more of an uncanny wilderness than an uncanny valley. For sure, AI is mostly a marketing term. The things I have the most problems with are things that imitate humans. Like if youā€™re on Youtube, and one video is open and another is recommended, no-one thinks that a human is behind the scenes doing that. Itā€™s still problematic, of course, because the recommendations somehow always push towards more radical and extreme content. But itā€™s more problematic when, say, the comments section is swarmed by bots pretending to be human.

It gets even worse when there is a complex web of interactions between human and non-human entities. Hereā€™s a good example:

So, if I understand right, human actors are being paid to promote a scam without knowing it. The scam is derived from a genuine kind of scam, which is built upon another genuine kind of scam, crypto itself. But in this case itā€™s a false scam as the scam doesnā€™t even take place, it just sends money from the ā€œinvestorā€ straight to the scammers. Itā€™s like a triple-layered scamorama. Then the human actors are reinforced by bots in the comments. But inside the scam, the ā€œinvestorsā€ are warned not to fall for the kind of scam that they are actually falling for in that moment. Then other videos pop up exposing the scam. Obviously, they too are scams.

How much of the world is being burn to death to power this scamception?

3 Likes

This is the reason I left Quora. As a Q&A platform it was particularly susceptible. You ended up with the Quora ā€œprompt generatorā€ creating questions, then users cutting and pasting chatgpt responses as answers with the obligatory AI generated image to spice things up.

The folk running Quora also replaced their excellent moderation team with an algorithm and introduced their own generative AI, ā€œPoeā€. Plus they changed their default settings from ā€œdonā€™t use my words to train an AIā€ to ā€œyeah, letā€™s do thatā€.

The net result was a mass exodus and, even if the owners are looking to sell, a less valuable business.

2 Likes

Itā€™s an amazing time we are living in, isnā€™t it?

Well, I seem to find interesting communities out there on the interwebs. Itā€™s certainly not all bad. As an aside, I had a look at SubStack. Thereā€™s certainly some good content there, but it can be hard to find. The endless feed is a dead end.

1 Like