Got any idea for a digital analysis of the EBTs?

michaelh · January 24, 2023, 4:08am

@jonas ifyou’d like to collaborate with @chaz and I that’d be great.

I wrote a thing that passes suttas to OpenAI’s API and identify the place name if present in the sutta, the speaker of all text inside quoted dialogue in Bhante Sujato’s translations. It also goes some way to extract where else a sutta took place if the interlocutors move to different locations, if any other place-names are mentioned, and also whether there are lists. Lists only works for text below max_tokens of the model - I’m working on some ways around that now, at the moment each list is generated mostly if it’s a list inside a specific sutta chapter. The list is often in a way that’s direct-quoting, so at the moment it’s hard to compare between suttas if a different term is used for the elements or name of a list.

Other locations and people mentioned inside a speech section is done with Stanford NLP but otherwise it asks a LLM what’s stated as where it took place.

Some examples of outputs are in the summary points in this post (click the triangles in the list). Note the first example of the below is generated from jsons of SuttaCentral MUID ranges, not from a prompt completion:

If I ran the speaker and place identification on all of DN to start, would that help populate your tables for places or at least help you to check it? Also any pointers on what kind of format output might help, if you’re interested?

I love the way you’ve raised the question of identifying a sutta’s mode. I bet a set of terms for Brahmanical technical language mode, a set for more public debate with other public intellectuals, language used in rural towns, with laypeople, royalty, sympathetic people, or even terms used in specific areas might reflect the cultures of the areas implicitly. Awesome! Have you tried this for stemmed pāli too perhaps? Maybe certain pāli synonyms might be more common in certain areas/ suttas in various modes too: