Statistical analysis of Early Buddhist Texts

Okay, in this case you could use a little hack:

  • Disable cookies first.

  • Then select search results “50”.

  • Search for “brahma”.

  • Look at the URL line:

    Screenshot from 2020-07-12 19-23-47

  • Manipulate the number of maxResults. Replace, for example, “50” by “100”.

  • Click “reload” in the left upper corner of your screen.

I tried it out with 100 and got 100 results, so you might have to try still a higher number. Not sure how far it will still work. Maybe 1000 will be a bit much … (The scope that has been translated by Bhante Sujato and is published are around 4000 Sutta files.)

You shouldn’t do this too often though; if many people do it often Voice will become v-e-r-y - - - s–l–o–w …

5 Likes

Thank you! I’ll try it.

2 Likes

Let us know the result. :smiley:

2 Likes
  1. I’m going to post this in the recent discussion about deities.
3 Likes

Wow! Glad it worked! :+1:

3 Likes

We haven’t heard back from Animitto, so I’ll just make some general remarks.

  • If anyone wants to do statistical analysis of Pali texts use GitHub - suttacentral/bilara-data: Content for Bilara translation webapp.
    • Segmented translations in English are also found there, as well as a growing collection of other languages.
  • For remaining texts, use sc-data/html_text at html-clean5 · suttacentral/sc-data · GitHub
  • If you want more precise or specialized information than a regular search engine provides, clone the git repo and search it locally using Sublime text or some other tool.
  • To export texts into a spreadsheet, use Bilara i/o. Bilara i/o
  • The main SC search uses elasticsearch, SC-Voice uses ripgrep, while our translation webapp Bilara uses ArangoDB. All these have advantages and disadvantages, so you may get somewhat different results.
7 Likes

This thread might interest you @animitto :slight_smile:

6 Likes

So far I have used the software cst4 (Chaṭṭha Saṅgāyana Tipiṭaka Version 4.0), available on tipitaka.org. It’s not the best of worlds, but it works. Has anyone worked both with cst4 and Github and describe the differences?

What I like about cst4 is that I can work with wildcards/asterisks both for word beginning and ending, and can search for two terms and define their maximum distance. The search is not for translations. I don’t think it’s possible to export anything to excel.

Tools for corpus analysis are getting more and more user friendly by the way.

Check out Orange (open source python software):

There’s also NLTK for python:

https://www.nltk.org/

IMO, the next step of statistical analysis would maybe be to make use of these new tools that are becoming available :slight_smile:

2 Likes

Github is just the place where the texts are stored, it’s not an application. Or it is, but not that kind of application.

CST4 or DPR—or for that matter SC—offer an integrated package that facilitates certain kinds of search and analysis. These cover a lot of ordinary use cases and are fine for most people.

However, if you need some kind of specialized analysis not covered by these apps, one approach you could use is to clone the bilara-data repo locally. Then you can search or analyze it with any tool you like. Since it is pure and battle-tested JSON, it is easy to transform into any format, or just treat it as plain text.

For myself, my main tools are Sublime Text, which lets me do rich searching and regular expressions across the whole corpus, or a defined subset; and Libreoffice Calc, where I can import texts via bilara i/o and query or manipulate them in the various ways that a spreadsheet makes possible.

For example, i might want to search for cases where “dhamma” is translated as “thing”. Searches for “dhamma” or “thing” would be painful, but using bilara i/o I can do both at once.

(Incidentally, this is possible also in the Bilara webapp for translators, and we hope to bring to SC one day!)

More ambitiously, someone with some basic programming skills can use one of the tools mentioned below by Erik, to which I would add texthero;

More advanced still, neural nets offer new possibilities, as can be seen at Buddhanexus:

(Not to be a party-pooper, but neural nets in their current form have, in my view, over-promised and under-delivered, and we seem to be approaching the diminishing returns phase of their evolution. Still, they may yet make significant contributions to Buddhist studies.)

That’s cool. You can also do this using sublime text and regular expressions, but it’s more of a learning curve.

5 Likes

Could you please outline what the advantages are for using Sublime text over CST4? I understand that it can make use of the translations and also generate a result sheet. Maybe you could take a screenshot of a result page?

Here’s btw a screenshot from CST4

1 Like

Sorry everyone and @Gillian . I was away. Thank you to @musiko to put my post into context. Very interesting replies. Thank you all !

2 Likes

Thanks. I used cst4 before. Very interesting tool.

1 Like

Well, for a start, Sublime is cross-platform, so I can actually use it at all, so there’s that!

It’s apple and oranges, but for what it’s worth, I can trivially search for a segment ID and get the results of all the associated information.

Or search for Pali segments that have the phrases “ca” and “nigaṇṭhe”, separated by any number of characters.

Or anything else that a regex can do.

But it has limitations. Searching for multiple words in different segments is not trivial.

But it’s not so much about the specific tools, it’s about the clarity of the underlying data. Our JSON files allow you to convert and process the content of the suttas easily and reliably in ways that are just not possible in an XML-based system like CSCD. Here, try it out. I just ran bilara i/o on DN, and generated a CSV file.

dn.zip (816.3 KB)

Unzip it and open in a spreadsheet, or import it to a data-table, or open in a text editor, or run it through a NL program, or play with it in airtable, or whatever. The point is, the data is not bound to the application. I’m hoping that, over time, lots of people will take it and use it in fun ways!

7 Likes

I think I get a vague idea of the freedom of this approach. Still, I think I speak for many people who are just mildly tech-savvy and see that it would take time to get into such an application-free data processing approach. Plus, I know the effect that even if I’d learn it once I would probably forget it again if I didn’t use it for a while. Like I was pretty good with SPSS back 1997, and in 2003 I had to learn it all over again. Thanks for the examples!

1 Like

Sure. But there are lots of programmers in the world, now they have a new toy to play with!

1 Like

I am quite interested in this topic, discussed a bit here and in a few scattered threads like

and

recently there was

Inspired by this I asked:

and got no on topic replies, so I thought I would bump this thread, I am going to try and use some of the info gleaned from some of @sujato 's posts in this thread to use

to work out a couple of things that I would have thought would be of quite general interest but which I can’t find any definitive information about, namely;

  1. how many letters long are each of the Nikayas? this would give some frame to questions of how often a given phrase appears in say, MN vs AN. Is AN much longer than SN? if so how much longer? I was surprised that this is not more easily available as info.

  2. how many different words are in each of the Nikayas? i.e what is the vocabulary of each Nikaya/Vagga/whatever, this is of similar interest it seems to me to @Vimala 's word length research.

About 20 odd years ago I was a web developer, hacking together things in asp and coldfusion, working with sql etc, but that was a long time ago, more recently I got it in my head to try and host a local copy of suttacentral and see if I could implement the new search engine idea, well, just getting docker and the site to actually run as is took so much out of me that I have avoided anything technical ever since, but I have licked my wounds enough now and will try to get back to this thread with answers to my 2 questions and a description of how I did it.

Maybe if other people have pet projects, examples or other threads of interest they could post here and we could get some mutual support going?

Metta

ok for question one I forked bilara-data and downloaded a zip, extracted the files and navigated to root/pli/ms/sutta then I ran

cat * | wc -w

but sn and an where in subfolders so for those i first ran

find . -mindepth 2 -type f -print -exec mv {} . \;

my results where

DN: 173,906
MN: 294,643
SN: 357,384
AN: 388,889

Hooray!

for question 2 I did the same thing but with

cat * | sort | uniq -c | wc -l

and got

DN: 16403
MN: 27197
SN: 43464
AN: 41843

Now I have no idea if this is in any way accurate, for starters I haven’t removed any of the json segment numbers so they will all come up as unique i guess…

Keep in mind that the concept of letters and words is not easy to map between Pali and English. Perhaps syllables would be more accurate.

For example “dh” is one letter. Really, “dha” is one letter. I’m not sure what you are trying to prove by counting letter, though.

BTW, you can edit your post to include more information rather than replying to yourself multiple times.

1 Like