Exploring the Development of a ChatGPT Plugin for SuttaCentral

jgreen01 · June 11, 2023, 10:48pm

Hello everyone,

I am a software developer with a keen interest in Early Buddhism and the potential applications of AI in this field. I’ve been exploring the use of OpenAI’s ChatGPT model with SuttaCentral and have seen some promising results.

Currently, I’m using WebPloit, Link Reader, and MixerBox WebSearchG to facilitate this interaction. While these plugins have been helpful, I believe there’s potential for a more integrated solution - a dedicated ChatGPT plugin for SuttaCentral.

The idea is to leverage the capabilities of ChatGPT to analyze, summarize, and compare early Buddhist texts. This could be a powerful tool for personal study, making it easier to understand complex concepts and explore different interpretations. It could also be valuable for research, helping to identify relevant resources and suttas on less-understood topics.

Before I proceed further, I wanted to check in with this community. Is anyone else working on something similar? Would anyone be interested in collaborating on this project? I believe that by working together, we can create a tool that truly benefits the SuttaCentral community and beyond.

I look forward to hearing your thoughts and suggestions.

Best regards,
Jon

Gabriel_L · June 12, 2023, 1:47am

I think Ayya @Vimala may be interested!
Also, of course, you may check with SC’s mastermind, bhante @sujato his thoughts on the idea and concept!

sujato · June 12, 2023, 1:56am

Also tagging @michaelh

michaelh · June 12, 2023, 5:16am

Hi @jgreen01, welcome to the forum!!

I have been working with the GPT-4 API, and now that ChatGPT plugins can be written by anyone skilled enough it would be incredible to have a plugin for SC that fixes some of the current pitfalls - incredible if you are able to work on this, there are threads with some issues people have run into in these forum posts.

Would be good to craft it such that it always goes back to the text, knows segments, and is able to iterate through such large documents in a relevant/good way and use some of its Pāli knowledge only if it helps. I would love to collaborate, if that’s amenable!

I have around 2800 lines of python that I used to produce what’s currently at GitHub - michaelh-sc/suttamap: Sutta metadata so if any of that is useful to you, I would be happy to share it with you. It goes some way to splitting suttas into a good context length, if that is how you might want to do it. I have limited myself to pulling things out of the text and making it parse-able, with references back to the segment, but a plugin would need to be much more broad.

Some thoughts:

The summarisations that happen between each question and answer to make it fit it into the small context length (4096 tokens) for ChatGPT, if these summarisations can be customised using a plugin that would be great. I wrote my own for supplying previous answers to the next block for identifying the speaker of things inside quotes.
Questions that relate to multiple texts would obviously need many prompts. I suspect it’s already seen a lot of SC in pre-training but for it to answer questions across the corpus it would have to be presented with the whole corpus, which costs something like $17USD per question for the 4 nikayas and it is kind of like starting afresh at each prompt.
GitHub - openai/chatgpt-retrieval-plugin: The ChatGPT Retrieval Plugin lets you easily find personal or work documents by asking questions in natural language. might work OK for cross-corpus questions would be worth testing I think, it might be great for limiting the search targets into a single vector database!
It might be good if specific technical words can be defined before a prompt is sent, much Buddhist technical language is misinterpreted.
I wonder what kind of testing might need to be done to see what sort of pitfalls there are with the web search plugins, or what kind of hallucinations/answers are too “helpful” given the biases to the RLHF processes. I have run into that a bit with many prompts, where it has many assumptions about who it is writing an answer for, even when setting the system prompt and explicitly in the actual prompts.

I can set up a private thread with others on the forum who might be interested in working out the feasibility of this too, if you like? Who is keen, I wonder?

jgreen01 · June 12, 2023, 7:33am

Hello @michaelh,

Thank you for your enthusiastic response and offer to collaborate. Your insights are incredibly valuable. I agree with setting up a private thread for brainstorming and planning. Let’s start with a minimal pilot version and consider adding more advanced functionality later.

I’m looking forward to learning from your experience with the GPT-4 API and contributing to this project. Gabriel_L and Bhante Sujato, thank you again for your recommendations. I’m excited about potentially connecting with Ayya @Vimala.

I’m grateful for the support from everyone in this thread. Looking forward to our collaboration.

Metta,
Jon

Vimala · June 12, 2023, 8:52am

We’ve been working with an integration of GPT4 and our language translations models on the Linguae Dharmae project. This is thanks to the great work that @SebastianN is doing at the moment. The results have been absolutely stunning and created a much higher accuracy of translation. On top of that the system can output an explanation of the text it translates.

To give an example:

Input: 諸一切種諸冥滅拔眾生出生死泥

敬禮如是如理師對法藏論我當說

Output (Explain words and structure):

The English translation is: “Eliminating all kinds of darkness and pulling sentient beings out of the mud of cyclic existence, I pay homage to such a true master of reality. I shall properly expound the Abhidharma Collection.”

Explanation of structure and meaning of Chinese words:

諸一切種諸冥滅 (Eliminating all kinds of darkness): 諸 (all), 一切種 (all kinds), 諸冥 (all darkness), 滅 (eliminate)

拔眾生出生死泥 (Pulling sentient beings out of the mud of cyclic existence): 拔 (pull), 眾生 (sentient beings), 出 (out), 生死 (cyclic existence), 泥 (mud)

敬禮如是如理師 (I pay homage to such a true master of reality): 敬禮 (pay homage), 如是 (such), 如理師 (master of reality)

對法藏論我當說 (I shall properly expound the Abhidharma Collection): 對 (towards), 法藏 (Abhidharma Collection), 論 (discuss/expound), 我當說 (I shall explain)

The sentence structure follows a pattern of describing the actions and qualities of the master of reality, expressing reverence and homage, and then stating the intention to expound the Abhidharma Collection.

The main problem with this system however is the costs of the GPT4 API and this is the main reason why we cannot make it public yet.

What we are doing is just one application and there are many ways in which ChatGPT (cheaper) and GPT4 can be used in SC for sure! I think it would be good to first discuss possible applications without duplicating work that might have been done elsewhere.

jgreen01 · June 13, 2023, 4:25am

Hello everyone,

I appreciate the ongoing discussion and the valuable insights shared. I’d like to clarify a bit about how I envision the ChatGPT plugin for SuttaCentral.

A ChatGPT plugin is a tool that currently only ChatGPT Plus subscribers have access to. Setting it up would involve configuring ChatGPT to communicate with the SuttaCentral API using an OpenAI manifest file. This file would be added to one of the domains that SuttaCentral owns, and then we’d need to register it with OpenAI. This would make the plugin available for everyone to use. I see this as a straightforward way to drive more interactions with users and help people engage with early Buddhist texts.

I imagine this plugin as a tool that allows ChatGPT to access SuttaCentral to answer questions. For example, a user might ask, “Can you summarize sutta MN10 and get a list of its parallels with their summaries as well?” or “Tell me about The Simile of the Saw sutta. What translations do you have available in Bengali and English?” The plugin would retrieve the information or text of the suttas from SuttaCentral, and ChatGPT could process it based on the user’s requests. I see this as more of a study tool.

Interestingly, ChatGPT has been trained on the Pali Canon and some Mahayana sutras. This means it might be able to do more than just respond to direct questions about specific suttas. It could potentially provide information or links to suttas where specific ideas are discussed.

Here are a link to a article that talk about creating a ChatGPT plugin: How to Create a ChatGPT Plugin: This article provides a step-by-step guide on creating a ChatGPT plugin.

Ayya @Vimala, regarding your translation project, this plugin wouldn’t be creating ready-to-publish translations. However, if we configure it to use the API to get different languages, ChatGPT Plus subscribers could ask to translate one of the suttas or sutras that hasn’t been translated to their language yet. It might be a helpful tool for a translator to use.

@michaelh, regarding your idea of making connections for the whole Canon, that might be better done using a custom trained model on the Pali text and translations. As mentioned before, ChatGPT can’t translate Pali well, probably because it wasn’t trained on any Pali. In theory, we could take one of the open source LLMs and train it on the Pali and/or Chinese and/or Tibetan Canons and make an LLM that could translate, summarize, and find connections between them.

Here’s the HuggingFace leaderboard for open source LLMs: HuggingFace Open LLM Leaderboard

Many of these are nearly at ChatGPT’s level and might actually surpass it soon. However, training an LLM would require significant computational resources. Unless someone has a very powerful computer or a lot of credit with one of the cloud computing providers, this might not be feasible for us to do at the moment. However, it’s an exciting possibility to consider for the future.

Thank you all for your enthusiasm and support. I’m looking forward to our collaboration and the journey ahead.

Metta,
Jon

StudentOfLife · June 13, 2023, 1:24pm

Hi everyone, really cool work here!

I would love to help out. I have a computational job – I’m not a software person though I would love to pitch in however I can and learn along the way, whether it’s on the translation or the plugin. One immediate thing I can contribute: I have partial access to some large computers, time permitting. I’m not too familiar with training models, though perhaps I can work with someone here to figure out the resource requirements/specs and train a custom model tailored for our use cases here.

Additionally, one potentially interesting use case to explore: perhaps by using LLMs on different bodies of text (say, scholarly commentary vs average user commentary, or worse yet, what is in the rest of the internet/ChatGPT right now), we could use LLMs in a meta-analysis kind of way, to perhaps bring out common misperceptions, and different levels of understanding and language trip-ups. I would imagine that chatbots customized to draw from different bodies of text would probably give different answers, which could be illuminating.

trusolo · June 13, 2023, 1:27pm

Hi @StudentOfLife
Welcome to Discover and Discuss (D&D) forum. We hope you find the discussions here beneficial and enjoyable. Please make use of the various resources, FAQs, and previous threads; you can use the search function for topics and keywords you are interested in. Forum guidelines are here:
https://discourse.suttacentral.net/guidelines

If you have any questions or need further clarification regarding anything, feel free to contact the moderators by including @moderators in your post or a PM.
With Metta,
trusolo
On Behalf of the Moderators,

jgreen01 · June 13, 2023, 9:18pm

Hello @StudentOfLife,

Thank you for your interest in the project and for bringing up the question about the resources required for training a custom Language Learning Model (LLM) like GPT-3.

Training a model of this scale is a resource-intensive task that typically requires a high-performance computing environment. Here are the estimated hardware requirements:

Minimum Hardware Requirements:

A modern Linux OS (such as Ubuntu 20.04).
An NVIDIA Ampere architecture GPU or newer with at least 8 GB of GPU memory.
At least 16 GB of system memory.
Docker version 19.03 or newer with the NVIDIA Container Runtime.
Python 3.7 or newer with PIP.
A reliable Internet connection for downloading models and data.
Permissive firewall, if serving inference requests from remote machines.

Ideal Hardware Requirements:

Multiple high-end GPUs, such as the NVIDIA A100 GPUs, with NVLink for GPU-to-GPU communication.
A large amount of system memory (RAM), ideally more than 256GB.
High-speed storage (SSD or NVMe SSD) to store the training data and model checkpoints.
A high-speed internet connection for downloading datasets and models.

As for the time it would take, training a model like GPT-3 can take weeks or even months, depending on the specific configuration and scale of the model being trained. This process typically can’t be interrupted and would require the machine to be running continuously during this period.

I hope this gives you a clearer picture of what would be involved in training a custom LLM. It’s a significant undertaking, but the potential benefits for understanding and analyzing early Buddhist texts could be immense.

For more detailed information on deploying a large-scale model like GPT-3, you might find this NVIDIA blog post helpful. It provides a comprehensive guide on the hardware and software requirements, as well as the process of deploying a 1.3 billion parameter GPT-3 model.

StudentOfLife · June 14, 2023, 3:41pm

@jgreen01 Thanks! I’ll go through that blog post, 1.3B is very modest.

I have access to an A100 (80GB VRAM) for development. If 8GB VRAM (like in the blog post) can train a 1.3B model, we could perhaps train up to a 13B model – I’ve seen some translation models that are 13B. From the public discussions, it seems that around 30B is when models acquire more depth and nuance, but 13B is still very functional.

Once we have a better sense of the mechanics, whether we want to train or simply to fine tune, perhaps we could have discussions about the cost-benefit analysis of using cloud servers to train larger models. If we’re really optimistic maybe we could even try to apply for a computing time grant.

In the mean time, perhaps we can brainstorm more concretely what we would like an LLM model to do for us. Seems like there are a few directions/questions right now:

improved translation? would have to think about how this would complement Ayya Vimala and SebastianN. Access commentaries in other languages.
tools to correct misinterpretations.
study tool, as suggested by @jgreen01
how to account for evolving knowledge/understanding? I imagine commentaries from scholars 100 years ago vary differently from more recent scholarship. It’s not so much that newer is necessarily better, but it would be instructive to see how things evolve.
From the discussion raised by @michaelh, it seems like what would make the tool most helpful is it was really good at pointing back to the source material (and perhaps even providing some modest reasoning/explanation for its thought process, e.g. via langchain), instead of simply paraphrasing/summarizing (essentially what Chat-GPT does). For this purpose, it may also be that simply embedding the text into a vector database, a one-off cost, would be sufficieint.
Also from @michaelh, working on prompts to help correct for biases. (I would be very curious to hear about some of these problems that @michaelh encountered).

From the above, I think 3) and 6) are the lowest hanging fruit, while 5) can be partially done via vector embedding, also relatively easy, with interesting questions about context.

Longer term, a better Pali model and translation/mistranslation model would be very interesting, but would need a lot of careful thought because words can be translated in many ways. How do we make sure we retain nuance, and not create a tool that misleads people?

StudentOfLife · June 14, 2023, 4:13pm

Ah, I just came across this older thread that is relevant: Training AI models on the suttas - #6 by pjteh

wondering how we can coordinate/not duplicate efforts? Is there another more appropriate discussion forum for this work?

jgreen01 · June 14, 2023, 9:03pm

Thank you StudentOfLife for your enthusiastic response and for offering to contribute your resources and ideas to this project. Your access to computational resources could indeed be a game-changer for us, especially in the early stages of development and testing.

I appreciate your thoughtful suggestions on the potential directions for the project. The idea of creating a study tool and working on prompts to correct biases resonates with me, and I believe it aligns well with the initial vision for this project.

Your point about accounting for evolving knowledge/understanding is also very insightful. It’s indeed a challenge we need to consider as we move forward.

As for coordinating efforts and avoiding duplicating work, I agree that it’s crucial. Perhaps we could start by mapping out the current landscape of similar projects and identifying areas where our project could add the most value.

I’m looking forward to discussing these ideas further and starting to shape the direction of this project. I’m also eager to hear the thoughts of @Vimala, @michaelh, and others in this discussion.

Thank you again for your valuable input.

Metta,
Jon

Nava · June 20, 2023, 4:58pm

I am not a technical person or computer geek but last two weeks my attention was drawn to artificial intelligence and its scary implications (watched a few youtubes on that). Just watched this video interview with Jack Kornfield and Sam Altman (more positive and encouraging) on mindful possibilities for AI to raise collective wisdom and intelligence in the planet and wanted to share here for all.

Youtube: ChatGPT CEO on Mindfulness, AI and the Future of Life.

peace & metta

michaelh · June 22, 2023, 5:25am

It might be worth setting up a github with a boilerplate ChatGPT plugin, just to see what kind of improvements are possible for English and Pāli. SuttaCentral is CC licensing and you can’t charge for its content, which I think was your first concern @jgreen01. Since the ChatGPT subscription requires people to pay, I wonder if that’s breaking SC’s license. I tend to think people would already paying a subscription and the payment is for ChatGPT rather than SC content, a bit like paying for an Internet subscription.

One latest update from a week ago to gpt-4 that might allow it to refer to MUID/segments easily is the “function calling” that allows output and input as JSON. Getting the plugin to answer an arbitrary question on a large sutta is a good start I think - what do you all think?. Playing with Tree of Thoughts prompting might work for accuracy reasons. When I first read about CoT-SC and ToT it did remind me of some of my own mindfulness practices of re-thinking and reflecting before deciding things - not to add to any of these kind of claims from OpenAI chief scientist or that Sam Altman interview from @Nava (had not seen - thanks!! Some of his answers very glib, but he says he is a meditation practitioner and asks Jack Kornfield for advice - OK!).

From my perspective the first task is to just make answers accurate, allow it to traverse the suttas and answer questions with more truth for Buddhist scriptures from various sources, maybe inspire new dhamma understandings/talks, rather than as a chat-interface to SC only. 6) is probably worth ignoring til later, quite hard, so I think focus on 5) and 3). 1) and 2) would follow - anything that is more accurately reflective of translated texts already assists with translation I think.

Translation is obviously its own specific task for ML, GPT4/ChatGPT must have done a lot to make its translations the quality it is - might be better to inform actual translators rather than be a better translation imo. Just listen to some translation decisions from Bhante @Sujato or Bhikkhu Bodhi and it’s clear how incredibly complex a choice of translation is - context, history, decades of deliberation, living traditions, private debates and private research, use of alternate definitions that are usually ignored, style, consistency, etc. Even obscure texts outside the suttas could lead to the only accurate translation such as with the word Bodhisatta as an example.

On fine-tuning, I’ve looked at some possibilities for doing this, and so far I haven’t found many good domain-specific fine tuned models on large document sources - if you have seen some please let me know! I am yet to to test until I see a working model in the wild. Much fine tuning is not about knowledge bases/knowing texts accurately but more about instruction following. I think the best strategy would be to set up some fine-tuning datasets or some code to produce such training datasets; make it model-agnostic until a domain-specific model is shown to work for quoting texts accurately and with the right licensing. There is already code on SC for ingesting the multiple languages into the Arrango search vector DB, so maybe this can be used.

There’s a real art to pre-training base models specifically, easy to overfit/not do properly (many domain-specific base models released by big tech fell flat) and specific requirements. Would be ideal but as you mentioned @jgreen01, but hey if a short test can be done to fine-tune a 1.3B model or 13B, even on a small section of suttas(?) would be good to see if there are any hints about how a bigger model would go.

Suggested models for fine-tuning tests:

Falcon-40B Instruct model. I don’t think the license allows for supplying the model or its weights publicly, but maybe fine-tuned final layers are ok. License is weird in that if something makes over $10 million you have to give them a cut :/, unsure what other weirdness is in the license. Uses AliBi which allows for training on very large context lengths in theory.
Llama and their counterparts like Vicuna and just this week Orca. Licensing for “research” only - I don’t know what that means in practice, I think we definitely could not distribute the model or its weights given this license.
BigScience Bloom-176B did not measure up so well. But the infrastructure and training set up to share GPUs over the Internet and use LoRa/QLoRa to train large models with consumer GPUs petals.ml might be a good way to train if the base model we choose is too big for your A100 or will take too long for your wonderful offer @StudentOfLife. If they ever get a Llama model or Falcon model up, could be a good alternative. They’ve mentioned on their discord that other open-source models are in the works. I have some old hardware too with 24GB VRAM and a new card with 16GB. Petals devs also said they’ll be introducing a kind of compute-for-compute credit system like torrents. BLOOM also uses AliBi.
PrimeQA (this is transformer based but old). Still gets high benchmark scores for Q&A which is a big use case here I think, and training is short. They keep adding models and it’s by MIT.

jgreen01 · June 26, 2023, 12:34pm

Yes, I agree that developing the plugin is the first step. As for the subscription, I see it more like an internet subscription. It’s important to remember that this technology is incredibly new, and OpenAI is just the first to offer it. In the future, it’s likely that users won’t need a subscription, or there will be free LLM options that support plugins like this.

Full disclosure, while I have a strong background in programming, my experience with AI/ML is limited. However, I’m about to start a Data Science Masters program, so I’m eager to learn more about these fields. Training a model is a new concept for me, but it’s one that I’m keen to explore.

The envisioned plugin would allow ChatGPT to access SuttaCentral to answer specific sutta-related queries. I see great potential in this project as a study and possibly research tool, and I’m excited to collaborate with you all.

jgreen01 · June 30, 2023, 4:59pm

Hello everyone,

I’m excited to share that we’ve made some progress! We’ve set up a GitHub organization called suttaML, thanks to @michaelh. This will be our main workspace for the project.

We’ve also created two projects within the organization: one for the Chat GPT Plugin and another for the Custom LLM. The Chat GPT Plugin project already has some stories added to it, based on our discussions in this thread. However, the Custom LLM project is still empty as we haven’t started working on that aspect yet.

The next steps involve setting up a dev environment and testing a proof-of-concept (POC) version of the plugin. Our focus is on creating a POC version to ensure we all understand how it works.

@Nava, thank you for sharing the interesting video about the Chat GPT CEO. It’s always insightful to hear from the people behind the technology we’re using.

Lastly, I want to express my deepest gratitude to everyone involved in this project. Your contributions, ideas, and support are truly invaluable. A special shout-out to our software development team, specifically @michaelh, @StudentOfLife, and @harishvs. Your expertise, dedication, and collaboration are driving this project forward. Here’s to our continued collaboration, success, and to the exciting journey ahead!

Best regards,
Jon

Snowbird · July 1, 2023, 7:48am

I’m not sure if you are interested in discussing the dangers/drawbacks of “AI” as it relates to your work, but I just found out about this issue raised about MDN’s recent addition of “AI” to explain code examples on their documentation.

I had always thought that one of the areas where these LLMs excelled was in things related to coding. However you can see from the discussion that it’s doing a very bad job. And with all the LLM work, it gives false information quite shamelessly. Of course software does not have shame, which I guess is the root of the problem.

I wonder how you are going to address these issuse with your plugin.

I’m also unclear how a LLM is goign to to a better job than a human in summarizing suttas. We already have summaries of suttas done by a (highly qualified) human in the Dictionary of Pali Proper Names. And there are very brief summaries already here on SC that were created by another (highly qualified) human.

If you aren’t interested in discussing these things in this thread, then perhaps it could be split off.

jgreen01 · July 1, 2023, 9:03am

Hello @Snowbird,

Thank you for your thoughtful response. I understand your concerns and I’d like to clarify a few things about the two projects we’re working on: the Chat GPT plugin and the custom Language Learning Model (LLM).

The Chat GPT plugin is a tool designed to help Chat GPT access up-to-date information. It’s essentially a way for Chat GPT to connect to data that it was not initially trained on and interact with other services. In our case, we’re aiming to use it to make more accurate information from the SuttaCentral database available. Any translation or interpretation that someone may ask from Chat GPT would be outside the control of the plugin. The plugin is solely a conduit for information.

The custom LLM, on the other hand, is a more ambitious project that we haven’t started yet. It would likely take many months to create and months more to test. The goal of the custom LLM is to create a model that’s specifically trained on the Buddhist Canons and related texts, which could potentially provide more accurate and context-aware translations and hopefully facilitate new research that would be too time consuming today. However, the target audience for the custom LLM would be translators, scholars, and monastics who’d all have a high level of knowledge and could spot errors. It’s not intended to replace human translation or research, but rather to be a tool that can assist those who are already deeply familiar with the texts.

I hope this clarifies our intentions with these projects. We’re very much in the exploratory phase and are open to feedback and suggestions!

Best,
Jon

Snowbird · July 1, 2023, 7:33pm

And I think here is the root of my issue with using SC material for LLM training.

At it’s core the purpose of a LLM is to lie effectively. They are not Large Knowledge Models. Their purpose is to make sure that the next word they say convincingly sounds like a human saying it. There is nothing in the software that tries to make sure that the words they are saying are accurate.

It’s for this reason that I feel that these LLMs are actually antithetical to the Dhamma. Lying is not an accident, it’s the point.

By training ChatGPT on Sutta material, all you are doing is teaching it how to lie more effecively when asked questions about the suttas. I think it’s kind of irresponsible to simply say that what people do with the tool is outside of your control.

I’m sorry to sound like such a downer.