Got any idea for a digital analysis of the EBTs?

jonas · December 22, 2022, 4:14pm

So, there‘s this project I have to do at university. Had to fill an elective slot and instead of having another political theory seminar, I took Digital Humanities 101. What we have to do to get graded is some (basic) analysis of a large body of text (or related data) using R, and I was thinking that maybe some of you, particularly the Venerables, might have an idea of how to analyze the EBTs in a way that is actually of use, either to practitioners or even to research. The more specific your idea, the better! That means that if the research question we come up with is about one specific word (as is often the case in EBT scholarship), it has a good chance of being both doable and relevant.
Methods we‘re taught and may use in the project:

stylometry (comparing text, f.ex. a computer-based comparison of Bodhi, Sujato, and Thanissaro translations)
topic modeling (finding themes)
network analysis (relating themes to each other, f.ex. links between different factors of awakening found in suttas; particularly, I‘m thinking here of Thanissaro Bhikkhu‘s The Wings to Awakening, which goes in-depth on the „holographic“ nature and non-linear (spiral) patterns evident in the interrelation of factors, but, while intriguing, that is likely too complex a subject for the project)
geovisualization (showing stuff on maps, f.ex. the frequency of places mentioned in suttas)

I‘d prefer to work with translations, but in some special limited cases there might be a possibility to work with Pali.
If we find a useful and doable research question, I‘ll release the paper, scripts, and data here.

Khemarato.bhikkhu · December 22, 2022, 5:24pm

The first idea that pops into my head:

Take a (fairly) large list of doctrinal terms of interest and then do some network analysis based on their co-occurrence in the same suttas. It would be very interesting to see what that graph would look like. What terms occur together rarely? Are there clusters? What are they?

To do this, it would almost certainly be easiest to use Bhante Sujato’s translations as they are in JSON already and use a fairly consistent translation for doctrinal terms.

Anyway, just one idea! Best,

sujato · December 22, 2022, 7:39pm

While we’re on the topic, we could also try collocating various other kinds of data with geolocation.

First, why? Because geographical data, while limited, is often the single most precise and consistent form of information that connects the suttas with the “real world”.

Consider the recent discussions about Kosalan philosophy, the spread of brahmanical ideas, the movement to the south and so on. All these relate the teachings in some way with location.

We could try to see if patterns emerge from comparing location with:

people
caste/social status
political context (did the Buddha teach differently in republics than in kingdoms?)
doctrines (were certain doctrines favored in certain areas?)
literary forms (verse vs prose)
linguistic trends (length of words or word use frequency, which would require Pali)
collections (are certain texts associated with certain regions?)
religious beliefs (particularly the yakkhas and the like, which are likely to be based on local folk beliefs and are often geographically linked, eg. Alavaka is the yakkha of Alavi, Hemavata is the yakkha of the Himalayas.)

Within the geographic data we might compare:

rural vs urban settings
north/south, east/west
inner and outer regions
impact of geography on teachings, eg. the hills of Rajagaha or the flooding of the rivers

Allowing for the limitations of the data, even negative findings would be meaningful. If the Buddha did not change his teaching in different regions, this is significant!

jonas · December 22, 2022, 8:43pm

Great ideas, and a lot of them, too! I imagine that once the code framework is up for one of the easier questions (like caste/status), it wouldn‘t be too much of a hassle to answer a handful more (doctrines per area and @Khemarato.bhikkhu‘s doctrine network idea, verse/prose, linguistic trends, collections). Not promising anything, though, we‘ll have to see how it goes.

But I‘m curious, has none of this been asked and conclusively answered in the scholarly literature? If so, this would be a veritable gold mine! I‘d have to take some more classes and continue this type of research after ordaining.
There has been at least one bigger project my lecturer did with someone from the religious studies department where they retooled a neural network trained on modern Chinese to analyze the Agamas. No idea what they were looking for, though. I‘ll ask for the research paper and link it if it‘s been released.

sujato · December 22, 2022, 11:12pm

Honestly I can’t say. Some things have of course been discussed. But I’m not aware of too much in the area of digital forensics.

A few years ago, they used such tools to figure out that a book published under a pseudonym was in fact by JK Rowling. The holy grail would be to figure out the Buddha’s linguistic fingerprints …

Vaddha · December 23, 2022, 12:10am

This is a great idea that would be easy to implement!

Other interesting data could be linguistic analysis of some sort. Maybe trying to train the system to grow sensitive to various forms of Pāli (older vs. commentarial/later) and analyze metre types present. This could bring up something useful or inform us of larger patterns.

It would also be good to get surveys of formulas. An example would be the jhāna formulas but there are many other very minor ones that are perhaps more specific or less common but nevertheless recur; it may even be able to detect if something is ‘formulaic.’ Comparing a survey of those across the nikāyas, in comparison with the above data, with geography or people, etc. could potentially inform us of some things about the composition and compilation of suttas/formulas, their relative age, perhaps even some of the voices or geographic regions behind them? We may see slightly variant / conflicting forms of formulas that would otherwise be identical that could be interesting.

Just some ideas
Mettā

jonas · December 23, 2022, 8:34am

I‘ll try to get an appointment with the professor who did the AI-based analysis of the Agamas in the new year, maybe he can give a few pointers.

That would necessitate a neural network. Usually, you‘d take a pre-made one trained on the language you‘re trying to analyze, but I haven‘t seen one for Pali yet, so it would probably take some rather in-depth knowledge both of Informatics and the Pali language to make a tool like that. Again, going off the Agama thing, it might be possible to retrain a network made for modern Bengali to read Pali instead, but that‘s far beyond my skill level.
I know that there‘s been a lot of scholarly attention to the „voices“ of the bible. For example, the young guy who appears near the ending of the Book of Job has been inserted by a later author, possibly some religious leader trying to justify his position. There‘s probably been something similar going on for the Canon; it would be necessary to get a good overview of this type of literature before letting any neural network loose on the suttas.

If my understanding of the method is correct, topic modeling should be able to detect any formulas; another question I might be able to work on in this project.

This, on the other hand, would need a number steps, and mixed methods. Again, something to keep in mind for later projects or collaborations.

Jhanarato · December 23, 2022, 12:20pm

Since you’re using R, check out what @chaz put together:

Cheers,

Ajahn J.R.

kora · December 23, 2022, 3:27pm

Is there a version for Python as well?

Jhanarato · December 24, 2022, 12:48am

From what I remember, he wanted to support that. I think it will be possible to import those file format into python/pandas/scikit etc.

NgXinZhao · December 27, 2022, 5:35pm

These are good.

I was thinking that I wanted to do a mind map of the concepts and the suttas which mentions them. But if this can be done via some programming magic, it would be so cool.

jonas · December 27, 2022, 10:11pm

That‘s very easy to do from the programming side, but you‘d have to come up with a list of all the stock phrases first. For example, if I remember correctly, some suttas don‘t mention the jhanas by name, but the phrase „secluded from greed and distress with reference to the world“ is code for jhana. The devil‘s in the details
Still, it would be much faster than going through thousands of pages manually.

Thanks, that‘s a great resource! Working with the annotated files will save a lot of time

NgXinZhao · December 28, 2022, 12:04am

Exactly that it’s not easy, might need some time to look through to gather all the key phrases which refers to the same thing. I keep on not remembering the key phrases for this or that, but translated in my mind this or that concept. So it’s hard to search for the sutta for citation each time.

Perhaps can just do the mind map for all phrases and words first, then key in that this phrase is equal to that and see the changes in the mind map as one manipulates the variables?

jonas · December 28, 2022, 4:42pm

That should be doable for someone who has both the basic coding skills and advanced sutta knowledge or easy access to someone who can supply the missing part. For my project, I want to keep to something I can do with minimal outside help and additional research (as my Bachelor‘s thesis will consume a lot of time).
We may find the time to cooperate after I have my degree or after I‘m ordained, if it‘s possible to fit it into the monastery schedule. You may also want to try your hand on it; my lecturer has a free online course which should provide you with everything you need. There may also be tutorials on some streaming service.

Jhanarato · December 29, 2022, 2:46am

So we got a bit excited about this a while ago:

jonas · January 21, 2023, 1:26pm

I‘m going with variation of modeled topics depending on place for the project. The modeling part is coming along nicely, but I‘m having some trouble with the places.
Specifically, I‘m making a table with town name, country name, government (republic, kingdom, independent), capital (true or false). However, on the Suttacentral map, many republics are also part of a mahajanapada, and I‘m unsure which should take precedence, i.e. what degree of independence conquered or allied republics enjoyed, and whether putting them down as republic, kingdom, or subjugated for the sake of analysis would yield the most insight. This is the case with Licchavi and Videha being part of Vajji, Anguttarapa being part of Anga, and Koliya and Sakya being part of Kosala. In the case of Patitthana, Assaka and Alaka overlap. Any opinions?

Edit:
I’ve decided to go with countries only, following the convention in the Canon. Further steps can be done in the interpretation stage if needed.
As an appetizer, here’s the proportion of modeled topics per country in the DN:

dn_topicbycountry

Topics are still a little rough; toying with their number and common words to exclude can yield different results. For example, there’s usually an “introduction” topic for the parts where the speakers meet and exchange pleasantries (here it’s “sit side bow seat lie householder stand”), and in all previous models, poor Ananda had been secluded to those opening paragraphs, before the current model decided to move him to “fully awaken” instead for some reason.
Additionally, with only 30-something suttas in the DN, I’m surprised at how evenly distributed the topics are. As some countries only feature in one or two suttas, I would have expected the proportions to be way off just because of the low n.

michaelh · January 24, 2023, 4:08am

@jonas ifyou’d like to collaborate with @chaz and I that’d be great.

I wrote a thing that passes suttas to OpenAI’s API and identify the place name if present in the sutta, the speaker of all text inside quoted dialogue in Bhante Sujato’s translations. It also goes some way to extract where else a sutta took place if the interlocutors move to different locations, if any other place-names are mentioned, and also whether there are lists. Lists only works for text below max_tokens of the model - I’m working on some ways around that now, at the moment each list is generated mostly if it’s a list inside a specific sutta chapter. The list is often in a way that’s direct-quoting, so at the moment it’s hard to compare between suttas if a different term is used for the elements or name of a list.

Other locations and people mentioned inside a speech section is done with Stanford NLP but otherwise it asks a LLM what’s stated as where it took place.

Some examples of outputs are in the summary points in this post (click the triangles in the list). Note the first example of the below is generated from jsons of SuttaCentral MUID ranges, not from a prompt completion:

If I ran the speaker and place identification on all of DN to start, would that help populate your tables for places or at least help you to check it? Also any pointers on what kind of format output might help, if you’re interested?

I love the way you’ve raised the question of identifying a sutta’s mode. I bet a set of terms for Brahmanical technical language mode, a set for more public debate with other public intellectuals, language used in rural towns, with laypeople, royalty, sympathetic people, or even terms used in specific areas might reflect the cultures of the areas implicitly. Awesome! Have you tried this for stemmed pāli too perhaps? Maybe certain pāli synonyms might be more common in certain areas/ suttas in various modes too:

jonas · January 24, 2023, 8:53am

That‘s very flattering, but I think you overestimate my skills. Literally the only thing I‘m doing here is plugging my stuff into tutorial code and fine-tuning it. The stuff you guys are doing is so far beyond me I wouldn‘t be much help to you.

This is quite impressive!

Speaker identification would be super helpful for future projects. Having maybe a separate dataset with an added speaker column might open all sorts of possibilities.
My chosen research question can be answered with place only, though, and I‘m not doing the whole Canon, just MN and DN, which clock in at around 190 suttas or so. After some clumsy attempts at automation, I had a „screw it“ moment yesterday and just did the whole thing by hand. Turned out to be the right choice, too, since the opening paragraph only states where the people giving the discourse were staying at that time, but the discourse often took place at some other location.

Considering how similes in the discourses are geared toward the intended audience, I wouldn‘t be surprised if the language differed as well. However, the Canon isn‘t the literal word of the Buddha (and other speakers); we have more layers of interpretation, summary, and preparation for oral transmission here, so interpreting the results would be quite challenging and would require a lot of historical knowledge.
As I don‘t speak even basic Pali and have put off learning until at least after my PoliSci degree, this is not something I‘m competent at, even if there was a plug and play stemmer that didn‘t require to delve into Suttacentral‘s code

jonas · February 16, 2023, 1:56pm

Well, I made the error of mentioning to my lecturer a suspicion that sutta theme variation would probably depend more on audience than place, and now I also have to make an analysis based on who each sutta was addressed to. Relevant categories might be monastics, ordinary laypeople, rulers, devas, and asuras.
If your algorithm can handle that (for MN and DN), it would be a great help. Otherwise I‘ll do it by hand

michaelh · February 16, 2023, 2:59pm

Categories of which type of people are being addressed by the Buddha or sangha for each sutta, or some inferred audience for a sutta? Hard to work out or put people into categories I would say. That custom of addressing the most senior person present could warp the results, would need a learned person to work out the implied audience of a sutta generally.

@chaz has this great strategy of searching the dictionary for each word in the canon and whether it’s a location or person, using R scripts. There may be ‘categories’ of person in that epic 1937 Pali proper name dictionary, unsure if that’s digitised. Or perhaps the current dictionaries on SC or DPD always mention someone as a monastic, king, queen, ascetic etc.

I am thinking of just running this script now how it is on MN and DN, then implied speaker of the Buddha for much of SN and AN where it’s not stated who is speaking, but who is the implied audience for many of these - the sangha? I have only just read that GPT-3 scores 90% on ‘coreference tasks’ like these, hence the non-perfect results it’s producing but Google’s latest ones score 100%. Is 90% okay? I can run it this weekend on gpt-3 and then run it again when a model like that becomes available, would be good to iron out issues and get this over and somewhat done with for me hah! Or you could run Chaz’s script on the remaining suttas, will give you data of all the person-names in the suttas, but not whether they are talked about, talking, or postulated etc.