SuttaCentral has 30,000,000 words of Dhamma

sujato · February 23, 2015, 2:10am

Chinese characters were counted as one word (about 9,000,000), and HTML markup was only partly excluded. It includes both original texts and translations; for comparison, the Pali canon is about 2,000,000 words, of which about half are in the main nikayas.

It’s only a rough estimate, but that’s the ballpark.

frankk · February 24, 2015, 5:14pm

Sadhu!

I presume the word count is based on elided “peyalla” versions of sutta pitaka.

It would get great one of these days to have non-elided versions of the suttas, so keyword searches across the canon could be more accurate.

sujato · February 24, 2015, 10:19pm

Yes, it’s a very rough count, and doesn’t expand the peyyalas. The four nikayas in Pali are around 1 million words, the whole Pali canon about 2 million. By comparison, the Chinese texts that we have, including the Vinaya (but not Abhidhamma, and lacking the Mulasarvastivada VInaya), make about about 9 million, counting each character as a word. Most of the remainder is the translations.

It would indeed be nice to have a fully expanded version, but it’s far from being a trivial exercise.

There are many cases when an expanded text would be obviously worthwhile. Consider, say, the “gradual training” suttas of the first chapter of DN. In these cases, most of the suttas appear in highly truncated form, which is merely a function of the fact that they appear later in the chapter. This gives an unnatural weight to the importance of DN2, where the passage appears in full. In a digital text, there would seem to be no reason not to expand the text fully so that each sutta could be read as it was intended, but freed of the limitations of the manuscript form.

On the other hand, consider the notorious “gangapeyyala” sections in SN. These have, presumably, never been fully written out or recited: they are just too much. They are not really “suttas”, just instantiations of a template. If we did, say, a statistical analysis of how frequently a particular word or idea appeared in the canon, would it be accurate to include these, or would it be preferable to use the canon as it has actually be maintained and used?

These are extreme cases, and the practical solution may be somewhere in between. But whatever approach is taken, the result will be guided by editorial decisions. There is, IMHO, no such thing as the archetypal “perfect” version. But digital texts do allow us to express the underlying text in more flexible ways.