Statistical analysis of Early Buddhist Texts

I am quite interested in this topic, discussed a bit here and in a few scattered threads like


recently there was

Inspired by this I asked:

and got no on topic replies, so I thought I would bump this thread, I am going to try and use some of the info gleaned from some of @sujato 's posts in this thread to use

to work out a couple of things that I would have thought would be of quite general interest but which I can’t find any definitive information about, namely;

  1. how many letters long are each of the Nikayas? this would give some frame to questions of how often a given phrase appears in say, MN vs AN. Is AN much longer than SN? if so how much longer? I was surprised that this is not more easily available as info.

  2. how many different words are in each of the Nikayas? i.e what is the vocabulary of each Nikaya/Vagga/whatever, this is of similar interest it seems to me to @Vimala 's word length research.

About 20 odd years ago I was a web developer, hacking together things in asp and coldfusion, working with sql etc, but that was a long time ago, more recently I got it in my head to try and host a local copy of suttacentral and see if I could implement the new search engine idea, well, just getting docker and the site to actually run as is took so much out of me that I have avoided anything technical ever since, but I have licked my wounds enough now and will try to get back to this thread with answers to my 2 questions and a description of how I did it.

Maybe if other people have pet projects, examples or other threads of interest they could post here and we could get some mutual support going?
