Statistical analysis of Early Buddhist Texts

I would also love to hear about what might be the best way to search the suttas for certain words, etc? I have all the suttas combined into one pdf, but it seems to be too clunky to efficiently search. I’m wondering if there’s a better way.

3 Likes

Voice has actually been built around this thought.

As a technological background, Voice relies on segmented texts. For English, right now Bhante Sujato’s translations are segmented according to the Pali root text, so Voice can use them with full functionality.

This means that if you search for a word or a term like “root of suffering” in Voice, it will search Bhante Sujato’s translations. As they are very consistent in terminology the results are usually meaningful, however not complete in several ways.

  • Firstly, results are incomplete as they leave out all other English translations.
  • Secondly, Voice only searches Suttas at the moment, no Vinaya, no Abhidhamma. And only Suttas that have already been translated by Bhante Sujato, i.e. many texts of the Khuddaka Nikaya are excluded from search.

This avoids being flooded by results, but you should be aware of what is searched and what isn’t.

(In order to avoid flooding, Voice only returns up to 5 results by default. If you wish more you can adjust this in the settings.)

Thank you, I learned a new trick! :smiley:

7 Likes

So what category does the question best belong in? Meta?

1 Like

Thank you! One issue I see is the max results it will return is 50. So, for example, I wanted to see how many suttas Brahma appears in. It maxed out at 50. That said, this feature will be useful for other purposes, and I’m very thankful for it.

2 Likes

Okay, in this case you could use a little hack:

  • Disable cookies first.

  • Then select search results “50”.

  • Search for “brahma”.

  • Look at the URL line:

    Screenshot from 2020-07-12 19-23-47

  • Manipulate the number of maxResults. Replace, for example, “50” by “100”.

  • Click “reload” in the left upper corner of your screen.

I tried it out with 100 and got 100 results, so you might have to try still a higher number. Not sure how far it will still work. Maybe 1000 will be a bit much … (The scope that has been translated by Bhante Sujato and is published are around 4000 Sutta files.)

You shouldn’t do this too often though; if many people do it often Voice will become v-e-r-y - - - s–l--o–w …

4 Likes

Thank you! I’ll try it.

2 Likes

Let us know the result. :smiley:

2 Likes
  1. I’m going to post this in the recent discussion about deities.
3 Likes

Wow! Glad it worked! :+1:

3 Likes

We haven’t heard back from Animitto, so I’ll just make some general remarks.

  • If anyone wants to do statistical analysis of Pali texts use GitHub - suttacentral/bilara-data: Content for Bilara translation webapp.
    • Segmented translations in English are also found there, as well as a growing collection of other languages.
  • For remaining texts, use sc-data/html_text at html-clean5 · suttacentral/sc-data · GitHub
  • If you want more precise or specialized information than a regular search engine provides, clone the git repo and search it locally using Sublime text or some other tool.
  • To export texts into a spreadsheet, use Bilara i/o. Bilara i/o
  • The main SC search uses elasticsearch, SC-Voice uses ripgrep, while our translation webapp Bilara uses ArangoDB. All these have advantages and disadvantages, so you may get somewhat different results.
7 Likes

This thread might interest you @animitto :slight_smile:

6 Likes

So far I have used the software cst4 (Chaṭṭha Saṅgāyana Tipiṭaka Version 4.0), available on tipitaka.org. It’s not the best of worlds, but it works. Has anyone worked both with cst4 and Github and describe the differences?

What I like about cst4 is that I can work with wildcards/asterisks both for word beginning and ending, and can search for two terms and define their maximum distance. The search is not for translations. I don’t think it’s possible to export anything to excel.

Tools for corpus analysis are getting more and more user friendly by the way.

Check out Orange (open source python software):

There’s also NLTK for python:

https://www.nltk.org/

IMO, the next step of statistical analysis would maybe be to make use of these new tools that are becoming available :slight_smile:

2 Likes

Github is just the place where the texts are stored, it’s not an application. Or it is, but not that kind of application.

CST4 or DPR—or for that matter SC—offer an integrated package that facilitates certain kinds of search and analysis. These cover a lot of ordinary use cases and are fine for most people.

However, if you need some kind of specialized analysis not covered by these apps, one approach you could use is to clone the bilara-data repo locally. Then you can search or analyze it with any tool you like. Since it is pure and battle-tested JSON, it is easy to transform into any format, or just treat it as plain text.

For myself, my main tools are Sublime Text, which lets me do rich searching and regular expressions across the whole corpus, or a defined subset; and Libreoffice Calc, where I can import texts via bilara i/o and query or manipulate them in the various ways that a spreadsheet makes possible.

For example, i might want to search for cases where “dhamma” is translated as “thing”. Searches for “dhamma” or “thing” would be painful, but using bilara i/o I can do both at once.

(Incidentally, this is possible also in the Bilara webapp for translators, and we hope to bring to SC one day!)

More ambitiously, someone with some basic programming skills can use one of the tools mentioned below by Erik, to which I would add texthero;

More advanced still, neural nets offer new possibilities, as can be seen at Buddhanexus:

https://buddhanexus.net/

(Not to be a party-pooper, but neural nets in their current form have, in my view, over-promised and under-delivered, and we seem to be approaching the diminishing returns phase of their evolution. Still, they may yet make significant contributions to Buddhist studies.)

That’s cool. You can also do this using sublime text and regular expressions, but it’s more of a learning curve.

5 Likes

Could you please outline what the advantages are for using Sublime text over CST4? I understand that it can make use of the translations and also generate a result sheet. Maybe you could take a screenshot of a result page?

Here’s btw a screenshot from CST4

1 Like

Sorry everyone and @Gillian . I was away. Thank you to @musiko to put my post into context. Very interesting replies. Thank you all !

2 Likes

Thanks. I used cst4 before. Very interesting tool.

1 Like

Well, for a start, Sublime is cross-platform, so I can actually use it at all, so there’s that!

It’s apple and oranges, but for what it’s worth, I can trivially search for a segment ID and get the results of all the associated information.

Or search for Pali segments that have the phrases “ca” and “nigaṇṭhe”, separated by any number of characters.

Or anything else that a regex can do.

But it has limitations. Searching for multiple words in different segments is not trivial.

But it’s not so much about the specific tools, it’s about the clarity of the underlying data. Our JSON files allow you to convert and process the content of the suttas easily and reliably in ways that are just not possible in an XML-based system like CSCD. Here, try it out. I just ran bilara i/o on DN, and generated a CSV file.

dn.zip (816.3 KB)

Unzip it and open in a spreadsheet, or import it to a data-table, or open in a text editor, or run it through a NL program, or play with it in airtable, or whatever. The point is, the data is not bound to the application. I’m hoping that, over time, lots of people will take it and use it in fun ways!

7 Likes

I think I get a vague idea of the freedom of this approach. Still, I think I speak for many people who are just mildly tech-savvy and see that it would take time to get into such an application-free data processing approach. Plus, I know the effect that even if I’d learn it once I would probably forget it again if I didn’t use it for a while. Like I was pretty good with SPSS back 1997, and in 2003 I had to learn it all over again. Thanks for the examples!

1 Like

Sure. But there are lots of programmers in the world, now they have a new toy to play with!

1 Like