Exploring the Development of a ChatGPT Plugin for SuttaCentral

Is there a way to get an LLM to quote the source of the information associated with the output, every time a statement is made?
Especially when discussing EBTs (or any other specific topic), we notice how precise references are important. If the source of information is given, then at least someone could verify it.

3 Likes

Thank you Venerable @Snowbird for starting a really important conversation!

I work in a university, and we share similar concerns, esp. with regard to supporting proper student growth.

One helpful analogy that has helped me sort through all the concerns is via parallels to the massive proliferation of content on the internet over the past two decades, and the relationship of end-users to all this content. History has many lessons for us (both things to worry about and potential solutions), that I hope we can take forward. The SuttaCentral community has been a critical part of managing and mediating this online relationship between dharma and student, and has a lot of institutional experience that is super valuable in this regard.

This is not a perfect parallel, but I think one particularly apt example for thinking through the human-centered part of the equation is wikipedia, because of its purported stature and authority. I came of age as wikipedia was just starting, and back then there were lots of questions that are still pertinent to ChatGPT-like tools today: how much trust to place in it? how to ensure quality? what purpose does it serve/what kind of tool is wikipedia? How to use wikipedia responsibly? How do we educate students in the use of wikipedia?

The long-term answer to these questions, for me, is that the responsible way to use wikipedia is as a map to ideas/content/vocabulary that are new to me, so that I can begin looking for better sources, and ideally, people to talk to.

I suspect that, once all the hype comes back down to reality, something similar will happen with ChatGPT. It will forever change the way people begin to access information (just as google search and wikipedia did). I think it is important to get ahead of the many problems that you have identified. Just as it became important to teach students to develop a measured approach to wikipedia (i.e. it is not “ground truth”), we as a community have an important role in guiding the responsible use of, relationship to, and improvement of ChatGPT-like tools.

Thank you for joining the discussion and guiding us!

1 Like

Yes I believe some plugins will do that @Ric, the web plugins show their URL sources but not how they arrived at their conclusions necessarily.

2 Likes

But the ability to misrepresent information is a design feature, not a limitation. The purpose of these LLM is to convincingly use Language in a fluent way, not to provide correct information. That the information it presents is sometimes correct is a side effect, not a feature.

I played around quite a bit with ChatGPT asking it questions about Buddhism. It was very impressive how convincingly it could present wrong information. But the number of actual suttas it could present wrong information about was quite limited. If your plugin is attempting to train ChatGPT on a larger body of texts, then it will eventually be able to very convincingly present wrong information about a larger body of texts. No matter how much you feed it, it will still not distinguish between true and false information. And the whole while it will be shockingly convincing.

But they don’t understand or know truth at all. That’s the issue. They only know patterns.

I’m sorry if I seem hyperbolic, but you can’t put poison in someone’s drink and absolve yourself by putting a label on it. Especially if you don’t know how that drink is going to be served. If you want my ethical advice, it would be to not put poison in people’s drinks. Don’t ask me about how you can ethically put poison in people’s drinks.

And besides, once someone feeds the texts into the model, they are no longer in a position to even put a warning on things. So then they’ve poisoned the drink but it won’t be served in a glass with the warning label.

I really don’t mean my criticisms personally. It’s a general issue. But I’m still not seeing any benefit in feeding suttas into these LLMs. I’m open to hearing one, but I haven’t so far.

Honestly I don’t think this is a parallel at all. Encyclopedias could never be used as primary sources in academic situations. And honestly, Wikipedia has never been considered an authority. Moreover, the purpose of Wikipeda is in fact to present references to primary sources. And if there are inaccuracies there is a method for correcting it. This is completely different than these LLMs whose purpose is to sound fluent in language and hide the primary sources all the while being impossible to correct.

There is no shortage of articles out there talking about the problems with these LLMs but I found this one especially interesting:

From the article…
5 Likes

@Snowbird I agree with you 100% on all your concerns! And we have no illusions that LLMs are “intelligent.”

Encyclopedias could never be used as primary sources in academic situations. And honestly, Wikipedia has never been considered an authority. Moreover, the purpose of Wikipeda is in fact to present references to primary sources. And if there are inaccuracies there is a method for correcting it.

Re: wikipedia, 20 years ago, when it was hard to find information on the internet and primary sources hard to verify, the conversation point and fear of teachers was precisely that students would treat wikipedia like encyclopedias and as authoritative. Of course we know that’s the wrong way to use wikipedia, but the concern was that people wouldn’t know…

We’re not talking about more academic settings, where everybody knows that encyclopedias don’t serve as proper references in and of themselves for scholarship, but for the general population, your “lay person” so to speak, especially students, who need to look something up but are not necessarily creating more scholarship on top of those resources and might get sloppy with their methods. I think these issues of misplaced trust are enduring perennial topics. I think we all agree because we’re all concerned about people misusing and misplacing trust in LLMs.

There are lots of important topics to cover. Wikipedia had to create governance systems to maintain the quality of the knowledge. Similar things have to be done today.

I think one potential concern, like you said, is naively training LLMs on suttas and trusting its output. This is also something that we are wary about, and we are not currently looking to train on suttas. One potential strategy around this, is to instead only use ChatGPT as a “natural language frontend” for interpreting user input, but on the backend routing that to trusted, non-LLM resources like the primary reference analysis systems michaelh is building, or other translation tools from the SC community. I work in a scientific field where information integrity is similarly important; the strategy that many in the sciences are currently turning towards is similar – using ChatGPT only as a “natural language API”, but doing all the critical information retrieval/processing work using more traditional, transparent, and better validated non-LLM systems.

One of our fears, like you have highlighted, is that the naive student out there will simply use ChatGPT and get all the wrong answers. We want to get ahead of the game and provide something that is as easy to use, but much more trustworthy (e.g. because the backend is non-LLM, or something that we actually validate extensively). Your concerns and conversation are important in helping make an actually trustworthy tool. Building a tool that the community can control and can continuously validate, check, and fix, is of utmost important, just as it is for maintaining wikipedia’s quality. The real work is not in the technology, but in the community, and we are mindful that any tool needs to be developed, improved, and vetted with meticulous care. Hope you can continue offer your wisdom and guidance.

Thank you!

3 Likes

The purpose of a LLM is to process input data into output data. A LLM cannot lie, because it cannot know and it cannot have intentions - there is nothing conscious in it: it is just a program.

The purpose of these LLMs is to convincingly imitate the use of language in a fluent way, but not to use it, because for a program there is no language, only data to process.

They don’t even have the ability to understand anything. They cannot know anything either, they execute an input data into an output according to some patterns.

In other words, an LLM is just a tool, like a hummer or a computer. How one would use this tool is not up to the tool, but only up to the user.

4 Likes

Hello everyone,

I wanted to provide a quick update on our progress with the ChatGPT plugin for SuttaCentral. We’ve made some significant strides and currently have a proof of concept (POC) in place. This is a big step forward and shows the potential of what we’re working on.

However, we’ve hit a bit of a waiting period. To move forward, we need Plugin development access from Chat GPT. Unfortunately, this can take some time - we’ve heard it can be a matter of weeks. We’ve submitted our request and are eagerly awaiting their response.

While we wait, we’re not standing still. We’re going to start figuring out how to train a Language Learning Model (LLM). This is a new challenge, but we’re excited to dive into it.

@SebastianN, we’ve seen the great work you’ve been doing with the Linguae Dharmae project. We’d love to collaborate with you on this if you’re interested.

We appreciate your patience and support during this time. We’re excited about the potential of this project and can’t wait to share more updates soon.

Best regards,
Jon

2 Likes

This video gets very close to my own view of AI as someone with a BS in IT and a translator of ancient texts.

The basic problem with it is that:

  1. It’s not intelligent, therefore,
  2. Someone intelligent has to babysit it at all times,
  3. But it can be useful for basic processing of big data sets.
3 Likes

In early July, OpenAI released Code Interpreter to all ChatGPT Plus users.

This could be an alternative method to SC Plugin. You can just upload an Sqlite database of all suttas and let ChatGPT query it, e.g. with fulltext search. Then it will generate responses based on the query result.

By this method, it is all free, no need to pay for API or ask permission to release the plugin.

1 Like

Code Interpreter is still a premium feature available only to paid subscribers

+1 for LangChain integration (#5)

I think having models that can retrive suttas is more useful than going for trianing a buddhist chat agent, as it would eventually halucinate, adding more confusion to the path. Also, fine-tuning a basic model with dhamma talks and suttas does not feel right either, because the foundation model can have all kinds of rubbish in it with data crawled in unethical ways.

We can also use vector embeddings to increase the relevance of search results, that’s a quick win.

Hey I have been following this with in interest

Any word from CharGPT about the Plugin Development Access request?

Hi thrasymachus,

Welcome to the D&D forum! We hope you enjoy the various resources, FAQs, and previous threads. You can use the search function for topics and keywords you are interested in. Forum guidelines are here: Forum Guidelines. May some of these resources be of assistance along the path.

If you have any questions or need further clarification regarding anything, feel free to contact the moderators by including @moderators in your post or a PM.

Regards,
trusolo (on behalf of the moderators)

1 Like

Nope, we’ve heard nothing. Plus I’ve been too busy working and studying to make any progress with a development version.

Now matter what, people at large will be using ChatGPT for information about Buddhism. This information will source from vast amounts of data outside of the suttas, thus incorporating wrong information into the output.

As a check and balance to that, the advantage of only using data from Sutta Central (the suttas), the output may be significantly more free of misinformation. Then it’s up to

2 Likes

I was listening to a podcast today where a senior leader at Microsoft was describing / selling what sounded like a very simple and potentially helpful use case for LLMs - search.

The example he provided was an internal use case where Microsoft put its internal health insurance plan policy documents through a vectorization process so that search results for terms with a lot of synonyms, eg eyeglasses, glasses, correctional lenses, etc. are all encoded as syntactically close, and a search for any one would pull up them all (so someone could quickly find out which plans cover them).

This reminded me of my experience searching sutta central, how I’ll frequently do several back to back searches trying to find a sutta I half remember, thinking “Was she called Migara’s mother in this sutta, or was she called Visakha?” Since, as has been pointed out, the “foundation” models are already pre-trained on Buddhist texts, I think they’d have many of these syntactically similar Buddhist terms close together in the vector space without requiring any fine-tuning or retraining.

They said of course sometimes the older methods work better, but because in the end you cardinalize search results into a single dimension, you can easily use a combined approach.

I don’t have a real web / software development background (I do data analysis), so I don’t know how sutta central search works right now, or how hard it would be to implement this in practice (probably harder than the Microsoft official doing PR claimed), but this is an example of how I think the technology could be used in an ethically neutral / positive way.

2 Likes

I don’t know where the development of a proper ChatGPT plugin is, but here’s a link to a custom GPT I made to help me with questions of the Dhamma: https://chat.openai.com/g/g-Dk2ZdcXe0-dhamma-guide

Here are its directions: Dhamma Guide embodies the qualities of both a knowledgeable scholar and an empathetic guide. When addressing scholarly topics, it demonstrates deep knowledge and insight, providing detailed and accurate explanations of early Buddhist texts and teachings. In matters of daily practice and meditation advice, it adopts an empathetic and supportive tone, understanding the challenges and queries of practitioners, and offering guidance that is both practical and rooted in early Buddhist principles. This blend of scholarly depth and empathetic guidance ensures that Dhamma Guide meets the diverse needs of its users, whether they seek academic understanding or practical application of Buddhist teachings.

I haven’t plugged any custom data into it or inserted any script, so it’s basically just GPT-4 with a prompt. Still, it’s pretty good. It helps me find suttas on the regular.

2 Likes

These aren’t actually directions; it’s a description, a description what we all wish chatgpt to be! Empathetic!

Do I have to spend $US20 to see what your prompt is, or will you share it here?

1 Like

Namo Buddhaya!

I think that the best possible LLMs that one could make now would be specialized, trained by a certain person, on certain translations & interpretations, and on content such as certain forum posts, all fully customized by the trainer.

I guess that this is the most worthwhile way to implement the LLMs.

My point is that the model will only be as good as the trainer can make it and if the trainer himself holds wrong views then the model will be an extension of that.

3 Likes

It would be cool if it did “…” expansion. I was able to get this to work some of the time using the web interface. It’s really nice to be able to expand a sutta that is a single sentence into a few paragraphs like in the SN.

I made a PDF that is all of suttas, fed it into a PDF plugin and asked chatgpt questions. The results were so so. For example, when I asked it about the tetralemma, I had to explain the tetralemma to it. It thought samadhi and immersion were separate things.

https://chat.openai.com/share/2b396db7-8462-42b6-a6e8-37886706bb54
https://chat.openai.com/share/c6d07f88-2ed5-41bd-9cd3-8084d81a33c9

Combined PDF with Ven. Sujato’s introductions stripped out. When I tried it with the original PDFs, ChatGPT was preferred answering based on Ven. Sujato’s commentary over the suttas. suttas_nO-sujato.pdf - Google Drive

2 Likes