Pali Canon Stylometric Analysis

Hello everyone,

I’m Massimiliano from Rome, Italy and nice to meet you!

I’m new here. Background in Oriental Studies and programming, which led me to this project combining both.

I’ve been working on a stylometric analysis of the Pāli Canon - using machine learning to try to identify chronological layers in the texts. The model is trained on passages with broad scholarly consensus (early Suttanipāta vs. later Khuddaka material) and then scores segments across the entire Canon.

The model gets 0.859 AUC at segment level, 0.973 at block level. What’s interesting is what comes out of the terminological analysis - patterns in how tilakkhaṇa formulas, brahmavihāra, jhāna descriptions and anattā teachings score. Some of this aligns with existing scholarship, some might be new. If the methodology holds up, there’s potential to expand this quite a bit.

Paper: https://zenodo.org/records/18188407 Code: https://github.com/Kingelanci/graphysics/blob/main/pali_stylometry_v20_FINAL_CORRECT_(10).ipynb

Would love to hear from people who actually know these texts. Criticism welcome - probably missing things.

Thanks for your time!

13 Likes

Welcome to the forums! :kissing_face_with_closed_eyes:

And just what would that mean for the peasants like me? :sweat_smile: One of the jobs of scholars is to translate their works into a language that ordinary people can understand as well, after all!

This part in conclusion seems important:

What stylometry can offer is not definitive dating but testable hypotheses. If the patterns are genuine, they suggest
the earliest Buddhist teaching was more phenomenological, more practice-oriented, and less systematized than
later standard doctrine. The archaic stratum may have been: a pragmatic teaching centered on the cessation
of craving through non-establishment, expressed through paradox and phenomenological description rather
than systematic doctrine, practiced by solitary contemplatives who refused to take a stand on metaphysical
questions—not because such questions are meaningless, but because taking any stand is itself the disease that
the teaching aims to cure.

In other words, which parts do you hypothesise are earlier? It would be great if you could provide some examples for the kind of different layers you’ve found, if any?

For example, what passages or formulas in the four principal Nikāyas are closer to early SNP voice?

It would be useful for the general public if you can talk about how such things actually line up. :slight_smile:

This work reminds me of something Bruce Linnell has done on Dao De Jing’s voices, might be of interest to you:

Anyway, sadhu for the hard work! Perhaps Venerable @Vimala might be interested in this stuff. :slight_smile:

5 Likes

I am very excited about this, just one point so far; the Nalaka is thought to be early only in it’s second half, the introductory verses where probably added later, did you remove these from your training data?

Also, you provide in table 6 a “filtered by Nikaya” list, could you provied us with which suttas are actually on this list? Like, which 9 suttas from DN? etc.

This is really interesting work!

2 Likes

Thanks for sharing this! Very similar to what I’m doing with the Pāli Canon! It’s kind a geek job but maybe could be interesting.

So what’s this all about: I trained a computer model to distinguish early vs late Buddhist texts based purely on writing style - not meaning, just statistical patterns in letter sequences and other features.

Some possible findings:

  • Analyzed ~240,000 segments across the Canon;

  • ~4,600 segments pass strict multi-level filtering as “early-style”;

  • Suttas with parallels in Chinese Āgamas, sanskrit, Gandhara consistently score higher than the canon average;

  • Late markers (cosmological elaboration, lists of previous Buddhas, stupa veneration) are effectively 99.9% absent from filtered corpus;

  • Anattā interesting patter: “do not regard as self” (attato samanupassati) scores much higher than “X is non-self” - suggesting anattā may have originated as practical instruction rather than metaphysical claim. But different pattern for Anicca!;

  • mettā and upekkhā in brahmavihara score early, karuṇā and muditā score late - the four-fold set might be later systematization?;

  • Phenomenological descriptions consistently score higher than categorical doctrinal assertions and systematic formulas.

The model has zero understanding of Pāli or Buddhism - yet it converges with what philology already told us: the earliest teaching was more practice-oriented and less doctrinal. Some are indipendent confirmations and some maybe interesting, hopefully! And if methodology is valid can be expanded a lot, this is just kind of proof of concept.

If you want you can skip all the methodology part and give a look to the patterns. Thank a lot again and look forward your feedback but no expetations whatsoever!

4 Likes

How so? :slight_smile:

This is an interesting find. I wonder if fourfold Brahmaviharas are found in ither Indic Dharmas (I’m woefully ignorant in this remark).

Again, I can kinda get what you’re saying here, and the rest is found in the paper; I’m just trying to point out the ways you could express these for people who might be confused with the paper, and need more direct expression of lists. :slight_smile:

Personally, I’ll try to jot down these items from your list later if you don’t already, but I’m also kinda busy these few days, so yeah. :slight_smile:

I mean, @Sphairos can elaborate better (or correct me on this), but it still hinges on the assumption that earlier parts of SNP are indeed earlier.

Now, that’s an assumption I hold as well, but not everyone actually believes that either. :slight_smile:

Take everything from here as my opinion, with a grian of salt!

Vectoral Analysis is a powerful tool, so it should be interesting to to see how it can be used wisely.

If I’m right, your study does show how different layers are found comparatively, but we still need to apply context to deduce which of these layers might be actually earlier or later. Right now, as I understand it, it places SNP againt late Khuddakha material, and ranks them accordingly.

That’s a valuable start, but it can’t yet be anything definitive, can it? :slight_smile:

(The point about shared parallels scoring higher is definitely a point for the argument, though!)

And philological analysis is another powerful tool, but it’s not infallible either, since it can be kind of a circular argument: We say X is early, then Y is like X, so it’s early. But how can we definitely say X is early? Or can we disregard potentially anarchronistic usages of language? Etc…

So yeah, a broader hermeneutics seems indispensable for such analyses. And it’s nice to see that you do that as well, unsurprising considering your Oriental Studies! :slight_smile:

I believe Vectoral Analysis like this is great for informing us of patterns, for us to ask more questions that would be informed with philological, hermeneutical, spiritual inquiries, rather than assuming that “machine tells us this is early / late” or so. So, yeah, some nuance and clear reading of data is needed!

There might be some people who find this kind of data useless, and if you jump hastily to conclusions / present them as facts, they’ll jump right to it (rightfully too, I might add!).

So, this kind of study clearly needs to understand and present its limitations, and acknowledge how pattern-matching can only raise new questions for experts to analyse anew. :slight_smile:

The ideal end-game for such tools, IMO, would be to raise such questions to which once we find some answers with spiritual, philological and historical analyses, in our final arguments, we won’t even have to reference the machine data. Does that make sense? :slight_smile:

Anyways, lots of exciting stuff. I’d been expecting someone to do something like this already, so now that it’s here, there’s quite the work to be done. :smiley:

:lotus:

1 Like

Thanks so much! Really glad you find it interesting :blush:

On Nalaka: Good catch! I used the full Snp 3.11 to avoid overoptimization but ablation tests show Snp IV-V which is around 75% of segments alone gives virtually identical results, even slightly higher- so those intro verses don’t really matter too much.

On the sutta list: Here you go:

  • DN (9): 1, 4, 5, 6, 7, 8, 10, 13, 15

  • MN (37): 4, 12, 16, 24, 27, 29, 30, 31, 41, 42, 53, 55, 59, 60, 63, 72, 74, 75, 76, 78, 94, 96, 97, 102, 105, 108, 109, 121, 122, 124, 126, 132, 135, 136, 138, 148, 150

  • SN (101): Heavily in SN 12, 22, 35 - the paṭiccasamuppāda, khandha, saḷāyatana material

  • AN (82): Spread around, clusters in AN 10-11

  • KN (38): Mixed bag - Udāna, Theragāthā, Sutta Nipāta bits

But honestly the sutta-level list is less interesting than looking inside suttas, block by block. The model distinguishes editorial frames from doctrinal cores pretty well - titles score ~0.03, but core teaching hits 0.95+. In SN 22.55: title gets 0.04, but “rūpaṁ attato samanupassati” hits 0.99!

What’s cool is that high-scoring stuff is doctrinally coherent - lots of paṭiccasamuppāda, khandha, sense-bases, craving/letting go. The model knows zero Buddhism, yet what it flags as “early” actually hangs together as a teaching. That’s reassuring!

Happy to share segment data if you want to dig deeper!

5 Likes

Thanks for the thoughtful feedback! Really appreciate the nuance :blush:

On anicca vs anattā: With anicca, ontological statements (“X is impermanent”) actually score higher than process descriptions. With anattā it’s the opposite - the practical instruction “do not regard as self” scores higher than “X is non-self”. Reversed patterns!

My guess: impermanence is something you can just see - everything changes, it’s obvious once you look. So “all this is impermanent” works as a direct observation. But anattā is trickier, more counterintuitive - so maybe it was originally presented more as a technique, an upāya: “try not regarding these things as yourself” rather than “these things are not-self.” Just speculation though!

Honestly, on Brahmavihara in other traditions I don’t know enough here - would love input from someone more knowledgeable!

You’re right that the model hinges on Snp IV-V being early. It’s not just scholarly consensus though - there’s independent stuff: archaic Vedic meters, pre-standard Pāli features, Āgama parallels, even Gandhāri manuscripts and Aśokan references for some texts. But yes, it’s still an assumption and the whole thing stands or falls with it.

The way I see it: the model can’t prove anything is early. It just says “if you accept Snp IV-V as early, here’s what else looks stylistically similar.” Whether similarity means chronological proximity - that’s for philologists to figure out! Interestingly most of the segments with high scores are not in training sets.

The parallels correlation is probably the best independent check - model wasn’t trained on that, yet texts with attestation score higher.

Also, stuff that most of scholars agrees is late - names of previous Buddhas, mahāpurisalakkhaṇa lists, cakkavatti mythology, stupa veneration - is virtually absent from the filtered corpus (99.9%+ excluded). I shuffled the “early/late” labels randomly 1000 times and retrained. Real model scores way higher than any shuffled version (p < 0.001). Also tested model using only grammatical particles, not content words. Still discriminates well (AUC 0.84). So it’s not just detecting topics - there’s genuine stylistic signal. the model could be not picking up random noise. I did my best to my knowledge of machine learning.

Totally agree this raises questions more than answers. So I asked for tour feedback. Happy to be corrected to improve the paper! :folded_hands:

3 Likes

Right. And your willingness to engage with these matters frankly is inspiring. :slight_smile:

I’ll be honest - with the many problems and criticism of AI, many people will hear “Machine Learning” and place this kind if analysis in the same boat. Or those who’re open to ML analysis might not necessarily be excited with Late/Early analysis per se. So that’s a tough job ahead, good luck in your endeavours. :slight_smile:

On fine tuning the model: As @josephzizys pointed out, not everything in SNP IV is the same, and SNP V doesn’t have parallels either. Perhaps cutting out Chapter V might give a different result.

On layers of SNP, Venerable’s paper gives a good insight on the general scholarly opinions:

Pre-institutional Buddhist Traditions in the Arthapada by Seongryong Lee:

His paper (and himself perhaps, if you can reach out) could be insightful for your work.

And also, seeing how you use the SC data for training, and given Bhante @Sujato’s stance on AI (He asks people not to use SC data to train AI), I’m not quite sure if this falls under the same “No AI category” for him, so paging him to give a clarity on the subject hopefully. :slight_smile:

Even so, seeing how you’re just relying on Pāli texts, it might be prudent to use data from somewhere else. Other projects like @abuddhistview’s Sutta Search that aren’t necessarily Gen AI but just using LLMs to decode user input and match it with sutta descriptions have also refrained from using SC data, so. :slight_smile:

1 Like

Welcome @maxro

Thanks for this. As @Dogen already said, I find it very interesting. I would have to take some time to study it and might get back to you after that.

Of course, this is just an additional indication of earlyness / lateness and not an absolute. Some years ago, I did an analysis of word length in Pali texts, which showed, at least on the level of collections, a remarkable resemblance to what we know about the relative age of these collections. This study is able to look at it on the level of individual suttas.

3 Likes

Thanks so much for the thoughtful response and the paper - will definitely check it out!

Fair point on fewer parallels Sn V, but Pārāyanavagga shares several key features with Aṭṭhakavagga: archaic Vedic-style meters you don’t find elsewhere in the canon, pre-standardized Pāli linguistic features, the emphasis on individual wanderers rather than organized saṅgha, the question-answer format with strong focus on letting go of views, and minimal doctrinal systematization. Plus it has its own Gandhāri attestation and the Niddesa commentary in the canon itself (which suggests antiquity). Norman, Bodhi, and Wynne all treat IV-V together as the archaic core. But sure, IV is the stronger case. The training set should have a minimum quantity of segments so I needed to add the texts I believe more archaic, but this is somawhat discretionary I know.

Anyway - if you here can help to suggest an alternative training set, I’m happy to retrain and see what changes! The method is flexible.

Thanks for flagging on ai. To be clear: I’m not training AI but is a ML. Orther canon and text has been analyzed throught ML. This is just ki d of automatic and objective analysis - like Vimala’s word-length study she mentioned. No text generation, no chatbot, just pattern detection. But I totally understand the concern and happy to hear Bhante Sujato’s take on it.

Thanks for engaging so thoughtfully - this is exactly the conversation I was hoping for to discover if this approach could be useful! :blush:

1 Like

Thanks so much, Ayya Vimala! Really appreciate it if you find the time to take a look :folded_hands:

Your word-length study sounds really interesting and relevant to me. If converging on similar patterns would be reassuring. Is it published somewhere I can read it?

3 Likes

The draft is here: (PDF) A statistical analysis of word-lengths in the Pali canon

But it is only very rudimentary, using a python script to go over all the Pali texts and determine the average word length. It is not exact science of course. Verses use different language then prose text and headings are often added later (so I removed those).

6 Likes

Thank you Ayya Vimala! Your AWL study is really interesting and using a different methodology with interesting patterns.

My model works at segment/block level so I tried a quick test semi-manually classifying SN 22.59 to see what happens. The result (if confirmed by other tests): doctrinal content scores higher p_early (0.79) but has lower AWL (6.14) than the frame narrative (AWL 7.63, p_early 0.50).

So within a single sutta, the model seems to pick up something that aligns with your AWL finding - the “early” material uses simpler vocabulary and maybe is a the original material used for elaborating suttas.

I attached the classification in case it’s useful. (Note: I kept one complete formula per khandha where the text has …pe… abbreviations.)

sn22_59_classification.pdf (58.2 KB)

1 Like

Plato specialist here happy to see the method that has been fruitfully there applied here, with all the usual caveats so thoughtfully pointed out by the OP. Thank you!

2 Likes

Good work! How do you control for oral-formulaic effects? In a canon shaped by mnemonic constraints, stylistic signals often track formula density rather than chronology. For example, how do you prevent jhāna or brahmavihāra passages from clustering simply because of shared recitative structure?

Relatedly: have you tested robustness by down-weighting high-formula phrases or collapsing parallel passages across Nikāyas? If the signal survives that, the chronological claim becomes much stronger!,:flexed_biceps:t2::flexed_biceps:t2::flexed_biceps:t2:

@Irene1 Thank you!—this is exactly the right question to ask! If jhāna or brahmavihāra passages look similar just because monks memorized and repeated them the same way, that’s not chronology. That’s just oral tradition doing its thing.

Here’s what gives me some confidence the signal could catch something real:

I tried stripping out all the Buddhist vocabulary**.** Just kept the little grammatical words—the Pali equivalents of “and,” “but,” “thus,” “indeed.” No “jhāna,” no “mettā,” nothing doctrinal. The model still works at 84% accuracy. If shared formulas were driving everything, removing them should break it. It didn’t. The formulas themselves weren’t in the training data**.** The brahmavihāra phrases, the numbered jhānas, the Eightfold Path formula—the model never saw any of these during training. So when it scores mettā high and karuṇā low, it’s not just recognizing something it memorized. It’s picking up on something in the surrounding language.

More context makes the signal even stronger. When I group five segments together instead of looking at them one by one, accuracy jumps from 86% to 97%. That’s the opposite of what you’d expect if I were just catching short repetitive phrases. Stock formulas are often relatively brief—adding more text around them should dilute the signal. Instead it gets clearer. That suggests the model is seeing more: how ideas connect, how sentences flow, not just isolated ritual chunks.

I also retrained with different technical settings n-gram for onger words, and seven of nine findings held up—including jhāna and brahmavihāra. If these patterns were flukes of one particular setup, they’d vanish. They didn’t.

I tried quickly still need to work on it your nice suggestion,deliberately suppressing high-formula phrases to see if the segment-level patterns survive, or collapsing parallel passages across Nikāyas to check if block-level stratification still holds. Signals seems preliminary to survive around 95%. Will let you know more about it!

2 Likes

Hi, I would be really interested in this more detailed segment data. Best regards!

Yes, I think so.

Choong Mun-keat in “A comparison of the Pāli and Chinese Saṃyutta/Saṃyukta discourses on the housemaster Citta/Citra, a respected layman dhamma/dharma-teacher”, The Indian International Journal of Buddhist Studies, vol. no. 23 (2023), pp. 93-123 (published in 2024), suggests thus (p. 104):

It is possible that the set of four immeasurables connected to appamāṇa-cetovimutti is an expanded version. The main teaching may have had just one focus, mettā. The Vinaya (at Cūḷavagga, Sattasatikakkhandhaka), only mentions mettā-vihāra, which is observed as kullaka-vihāra, a ‘family-meditative state’, whereas suññatā-vihāra is considered as mahāpurisa-vihāra, the ‘meditative state of great men’.”

1 Like

Hi @mfbilea, thanks for your interest. The data can currently be obtained by running the code on GitHub. I will also upload the segmented data files directly to the GitHub project within the next couple of days so that you can download them from there.

1 Like

great, thanks for sharing this is the kind of indipendent confirmations that can give a reason to exist to computer models like mine!

1 Like