Statistical analysis of Early Buddhist Texts

I am interested in how you conduct your statistical analysis. Could you please describe the steps ? Do you have a database of all the suttas in pali or English ?
Do you use software ? Does it make an output of texts or numbers in excel, etc ? Or is it simply through the search function of Sutta Central for example ?
Thank you


Are you speaking to somebody in particular @animitto, or referring to a particular analysis you’ve seen? A link would help. :slight_smile:


This is the original post On the authenticity of modern meditation methods as a reply to On the authenticity of modern meditation methods (for context).

I believe @animitto wished to split this question into a separate topic, but has created a new topic instead and this is why the context is missing?

It is possible to post an off-topic reply as a new topic like this:


I would also love to hear about what might be the best way to search the suttas for certain words, etc? I have all the suttas combined into one pdf, but it seems to be too clunky to efficiently search. I’m wondering if there’s a better way.


Voice has actually been built around this thought.

As a technological background, Voice relies on segmented texts. For English, right now Bhante Sujato’s translations are segmented according to the Pali root text, so Voice can use them with full functionality.

This means that if you search for a word or a term like “root of suffering” in Voice, it will search Bhante Sujato’s translations. As they are very consistent in terminology the results are usually meaningful, however not complete in several ways.

  • Firstly, results are incomplete as they leave out all other English translations.
  • Secondly, Voice only searches Suttas at the moment, no Vinaya, no Abhidhamma. And only Suttas that have already been translated by Bhante Sujato, i.e. many texts of the Khuddaka Nikaya are excluded from search.

This avoids being flooded by results, but you should be aware of what is searched and what isn’t.

(In order to avoid flooding, Voice only returns up to 5 results by default. If you wish more you can adjust this in the settings.)

Thank you, I learned a new trick! :smiley:


So what category does the question best belong in? Meta?

1 Like

Thank you! One issue I see is the max results it will return is 50. So, for example, I wanted to see how many suttas Brahma appears in. It maxed out at 50. That said, this feature will be useful for other purposes, and I’m very thankful for it.


Okay, in this case you could use a little hack:

  • Disable cookies first.

  • Then select search results “50”.

  • Search for “brahma”.

  • Look at the URL line:

    Screenshot from 2020-07-12 19-23-47

  • Manipulate the number of maxResults. Replace, for example, “50” by “100”.

  • Click “reload” in the left upper corner of your screen.

I tried it out with 100 and got 100 results, so you might have to try still a higher number. Not sure how far it will still work. Maybe 1000 will be a bit much … (The scope that has been translated by Bhante Sujato and is published are around 4000 Sutta files.)

You shouldn’t do this too often though; if many people do it often Voice will become v-e-r-y - - - s–l--o–w …


Thank you! I’ll try it.


Let us know the result. :smiley:

  1. I’m going to post this in the recent discussion about deities.

Wow! Glad it worked! :+1:


We haven’t heard back from Animitto, so I’ll just make some general remarks.

  • If anyone wants to do statistical analysis of Pali texts use GitHub - suttacentral/bilara-data: Content for Bilara translation webapp.
    • Segmented translations in English are also found there, as well as a growing collection of other languages.
  • For remaining texts, use sc-data/html_text at html-clean5 · suttacentral/sc-data · GitHub
  • If you want more precise or specialized information than a regular search engine provides, clone the git repo and search it locally using Sublime text or some other tool.
  • To export texts into a spreadsheet, use Bilara i/o. Bilara i/o
  • The main SC search uses elasticsearch, SC-Voice uses ripgrep, while our translation webapp Bilara uses ArangoDB. All these have advantages and disadvantages, so you may get somewhat different results.

This thread might interest you @animitto :slight_smile:


So far I have used the software cst4 (Chaṭṭha Saṅgāyana Tipiṭaka Version 4.0), available on It’s not the best of worlds, but it works. Has anyone worked both with cst4 and Github and describe the differences?

What I like about cst4 is that I can work with wildcards/asterisks both for word beginning and ending, and can search for two terms and define their maximum distance. The search is not for translations. I don’t think it’s possible to export anything to excel.

Tools for corpus analysis are getting more and more user friendly by the way.

Check out Orange (open source python software):

There’s also NLTK for python:

IMO, the next step of statistical analysis would maybe be to make use of these new tools that are becoming available :slight_smile:


Github is just the place where the texts are stored, it’s not an application. Or it is, but not that kind of application.

CST4 or DPR—or for that matter SC—offer an integrated package that facilitates certain kinds of search and analysis. These cover a lot of ordinary use cases and are fine for most people.

However, if you need some kind of specialized analysis not covered by these apps, one approach you could use is to clone the bilara-data repo locally. Then you can search or analyze it with any tool you like. Since it is pure and battle-tested JSON, it is easy to transform into any format, or just treat it as plain text.

For myself, my main tools are Sublime Text, which lets me do rich searching and regular expressions across the whole corpus, or a defined subset; and Libreoffice Calc, where I can import texts via bilara i/o and query or manipulate them in the various ways that a spreadsheet makes possible.

For example, i might want to search for cases where “dhamma” is translated as “thing”. Searches for “dhamma” or “thing” would be painful, but using bilara i/o I can do both at once.

(Incidentally, this is possible also in the Bilara webapp for translators, and we hope to bring to SC one day!)

More ambitiously, someone with some basic programming skills can use one of the tools mentioned below by Erik, to which I would add texthero;

More advanced still, neural nets offer new possibilities, as can be seen at Buddhanexus:

(Not to be a party-pooper, but neural nets in their current form have, in my view, over-promised and under-delivered, and we seem to be approaching the diminishing returns phase of their evolution. Still, they may yet make significant contributions to Buddhist studies.)

That’s cool. You can also do this using sublime text and regular expressions, but it’s more of a learning curve.


Could you please outline what the advantages are for using Sublime text over CST4? I understand that it can make use of the translations and also generate a result sheet. Maybe you could take a screenshot of a result page?

Here’s btw a screenshot from CST4

1 Like

Sorry everyone and @Gillian . I was away. Thank you to @musiko to put my post into context. Very interesting replies. Thank you all !


Thanks. I used cst4 before. Very interesting tool.

1 Like