Got any idea for a digital analysis of the EBTs?

Honestly I can’t say. Some things have of course been discussed. But I’m not aware of too much in the area of digital forensics.

A few years ago, they used such tools to figure out that a book published under a pseudonym was in fact by JK Rowling. The holy grail would be to figure out the Buddha’s linguistic fingerprints …

1 Like

This is a great idea that would be easy to implement!

Other interesting data could be linguistic analysis of some sort. Maybe trying to train the system to grow sensitive to various forms of Pāli (older vs. commentarial/later) and analyze metre types present. This could bring up something useful or inform us of larger patterns.

It would also be good to get surveys of formulas. An example would be the jhāna formulas but there are many other very minor ones that are perhaps more specific or less common but nevertheless recur; it may even be able to detect if something is ‘formulaic.’ Comparing a survey of those across the nikāyas, in comparison with the above data, with geography or people, etc. could potentially inform us of some things about the composition and compilation of suttas/formulas, their relative age, perhaps even some of the voices or geographic regions behind them? We may see slightly variant / conflicting forms of formulas that would otherwise be identical that could be interesting.

Just some ideas :person_shrugging:

1 Like

I‘ll try to get an appointment with the professor who did the AI-based analysis of the Agamas in the new year, maybe he can give a few pointers.

That would necessitate a neural network. Usually, you‘d take a pre-made one trained on the language you‘re trying to analyze, but I haven‘t seen one for Pali yet, so it would probably take some rather in-depth knowledge both of Informatics and the Pali language to make a tool like that. Again, going off the Agama thing, it might be possible to retrain a network made for modern Bengali to read Pali instead, but that‘s far beyond my skill level.
I know that there‘s been a lot of scholarly attention to the „voices“ of the bible. For example, the young guy who appears near the ending of the Book of Job has been inserted by a later author, possibly some religious leader trying to justify his position. There‘s probably been something similar going on for the Canon; it would be necessary to get a good overview of this type of literature before letting any neural network loose on the suttas.

If my understanding of the method is correct, topic modeling should be able to detect any formulas; another question I might be able to work on in this project.

This, on the other hand, would need a number steps, and mixed methods. Again, something to keep in mind for later projects or collaborations.

1 Like

Since you’re using R, check out what @chaz put together:


Ajahn J.R.


Is there a version for Python as well?

From what I remember, he wanted to support that. I think it will be possible to import those file format into python/pandas/scikit etc.

1 Like

These are good.

I was thinking that I wanted to do a mind map of the concepts and the suttas which mentions them. But if this can be done via some programming magic, it would be so cool.

That‘s very easy to do from the programming side, but you‘d have to come up with a list of all the stock phrases first. For example, if I remember correctly, some suttas don‘t mention the jhanas by name, but the phrase „secluded from greed and distress with reference to the world“ is code for jhana. The devil‘s in the details :wink:
Still, it would be much faster than going through thousands of pages manually.

Thanks, that‘s a great resource! Working with the annotated files will save a lot of time :slight_smile:

Exactly that it’s not easy, might need some time to look through to gather all the key phrases which refers to the same thing. I keep on not remembering the key phrases for this or that, but translated in my mind this or that concept. So it’s hard to search for the sutta for citation each time.

Perhaps can just do the mind map for all phrases and words first, then key in that this phrase is equal to that and see the changes in the mind map as one manipulates the variables?

1 Like

That should be doable for someone who has both the basic coding skills and advanced sutta knowledge or easy access to someone who can supply the missing part. For my project, I want to keep to something I can do with minimal outside help and additional research (as my Bachelor‘s thesis will consume a lot of time).
We may find the time to cooperate after I have my degree or after I‘m ordained, if it‘s possible to fit it into the monastery schedule. You may also want to try your hand on it; my lecturer has a free online course which should provide you with everything you need. There may also be tutorials on some streaming service.


So we got a bit excited about this a while ago:


I‘m going with variation of modeled topics depending on place for the project. The modeling part is coming along nicely, but I‘m having some trouble with the places.
Specifically, I‘m making a table with town name, country name, government (republic, kingdom, independent), capital (true or false). However, on the Suttacentral map, many republics are also part of a mahajanapada, and I‘m unsure which should take precedence, i.e. what degree of independence conquered or allied republics enjoyed, and whether putting them down as republic, kingdom, or subjugated for the sake of analysis would yield the most insight. This is the case with Licchavi and Videha being part of Vajji, Anguttarapa being part of Anga, and Koliya and Sakya being part of Kosala. In the case of Patitthana, Assaka and Alaka overlap. Any opinions?

I’ve decided to go with countries only, following the convention in the Canon. Further steps can be done in the interpretation stage if needed.
As an appetizer, here’s the proportion of modeled topics per country in the DN:


Topics are still a little rough; toying with their number and common words to exclude can yield different results. For example, there’s usually an “introduction” topic for the parts where the speakers meet and exchange pleasantries (here it’s “sit side bow seat lie householder stand”), and in all previous models, poor Ananda had been secluded to those opening paragraphs, before the current model decided to move him to “fully awaken” instead for some reason.
Additionally, with only 30-something suttas in the DN, I’m surprised at how evenly distributed the topics are. As some countries only feature in one or two suttas, I would have expected the proportions to be way off just because of the low n.

1 Like

@jonas ifyou’d like to collaborate with @chaz and I that’d be great.

I wrote a thing that passes suttas to OpenAI’s API and identify the place name if present in the sutta, the speaker of all text inside quoted dialogue in Bhante Sujato’s translations. It also goes some way to extract where else a sutta took place if the interlocutors move to different locations, if any other place-names are mentioned, and also whether there are lists. Lists only works for text below max_tokens of the model - I’m working on some ways around that now, at the moment each list is generated mostly if it’s a list inside a specific sutta chapter. The list is often in a way that’s direct-quoting, so at the moment it’s hard to compare between suttas if a different term is used for the elements or name of a list.

Other locations and people mentioned inside a speech section is done with Stanford NLP but otherwise it asks a LLM what’s stated as where it took place.

Some examples of outputs are in the summary points in this post (click the triangles in the list). Note the first example of the below is generated from jsons of SuttaCentral MUID ranges, not from a prompt completion:

If I ran the speaker and place identification on all of DN to start, would that help populate your tables for places or at least help you to check it? Also any pointers on what kind of format output might help, if you’re interested?

I love the way you’ve raised the question of identifying a sutta’s mode. I bet a set of terms for Brahmanical technical language mode, a set for more public debate with other public intellectuals, language used in rural towns, with laypeople, royalty, sympathetic people, or even terms used in specific areas might reflect the cultures of the areas implicitly. Awesome! Have you tried this for stemmed pāli too perhaps? Maybe certain pāli synonyms might be more common in certain areas/ suttas in various modes too:


That‘s very flattering, but I think you overestimate my skills. Literally the only thing I‘m doing here is plugging my stuff into tutorial code and fine-tuning it. The stuff you guys are doing is so far beyond me I wouldn‘t be much help to you.

This is quite impressive!

Speaker identification would be super helpful for future projects. Having maybe a separate dataset with an added speaker column might open all sorts of possibilities.
My chosen research question can be answered with place only, though, and I‘m not doing the whole Canon, just MN and DN, which clock in at around 190 suttas or so. After some clumsy attempts at automation, I had a „screw it“ moment yesterday and just did the whole thing by hand. :sweat_smile: Turned out to be the right choice, too, since the opening paragraph only states where the people giving the discourse were staying at that time, but the discourse often took place at some other location.

Considering how similes in the discourses are geared toward the intended audience, I wouldn‘t be surprised if the language differed as well. However, the Canon isn‘t the literal word of the Buddha (and other speakers); we have more layers of interpretation, summary, and preparation for oral transmission here, so interpreting the results would be quite challenging and would require a lot of historical knowledge.
As I don‘t speak even basic Pali and have put off learning until at least after my PoliSci degree, this is not something I‘m competent at, even if there was a plug and play stemmer that didn‘t require to delve into Suttacentral‘s code :confused:

1 Like

Well, I made the error of mentioning to my lecturer a suspicion that sutta theme variation would probably depend more on audience than place, and now I also have to make an analysis based on who each sutta was addressed to. :sleepy: Relevant categories might be monastics, ordinary laypeople, rulers, devas, and asuras.
If your algorithm can handle that (for MN and DN), it would be a great help. Otherwise I‘ll do it by hand :partying_face:


Categories of which type of people are being addressed by the Buddha or sangha for each sutta, or some inferred audience for a sutta? Hard to work out or put people into categories I would say. That custom of addressing the most senior person present could warp the results, would need a learned person to work out the implied audience of a sutta generally.

@chaz has this great strategy of searching the dictionary for each word in the canon and whether it’s a location or person, using R scripts. There may be ‘categories’ of person in that epic 1937 Pali proper name dictionary, unsure if that’s digitised. Or perhaps the current dictionaries on SC or DPD always mention someone as a monastic, king, queen, ascetic etc.

I am thinking of just running this script now how it is on MN and DN, then implied speaker of the Buddha for much of SN and AN where it’s not stated who is speaking, but who is the implied audience for many of these - the sangha? I have only just read that GPT-3 scores 90% on ‘coreference tasks’ like these, hence the non-perfect results it’s producing but Google’s latest ones score 100%. Is 90% okay? I can run it this weekend on gpt-3 and then run it again when a model like that becomes available, would be good to iron out issues and get this over and somewhat done with for me hah! Or you could run Chaz’s script on the remaining suttas, will give you data of all the person-names in the suttas, but not whether they are talked about, talking, or postulated etc.

1 Like

Whoah, hadn‘t thought of that at all. That might complicate things immensely… Still, could we assume that if the most senior person addressed is some king or other, it would probably be a „king“ teaching, and if the person addressed is Sariputta or Mahamogallana, it would probably be a „monastic“ teaching?

Glad I can help. :sweat_smile: Seriously, though, that would be awesome. 90% should be fine, Digital Humanities people are big on „distant reading“.

An early attempt at speaker identification for any speech in MN and DN, and locations where/mentioned for each sutta is present here, as promised! I will try to work on AN and SN this week. It will not be correct, but mostly is okay. I have placed warning labels all around the files.

@jonas if you want to use the “who_mentioned” and “who_else_present” fields as well as the segment ID-only fields you’ll need to split them by comma at the moment, there should be no commas inside names. I’ll keep using this format for the other two Nikayas so you should have some data for all 4 Nikayas.


The speakers and places inside SN is now up on the above github and other suttas in MN have been fixed up. I’ve done a for amount of manual checking too. Much more difficult than the first two, that’s for sure!

It’s done! And just in time for the deadline, phew! Apologies for the quality - I should have realized from the get-go that doing this right would have taken an entire semester exclusively dedicated to coding lessons and reading up on the religious scholarship literature :sweat_smile: Please treat this as the uninformed dabbling that it is.

paper_anon.pdf (206.6 KB)