Gretil parallels visualization

Vimala · February 14, 2019, 6:34am

Ven. Anandajoti made me aware of this very interesting project:
http://buddhist-db.de/sanskrit-html/index.html
(I know, the visualization of the tables is dreadful!)

The Github repository for it is here:

This repository contains the code and input data for the calculation of possible quotations and similar passages within the gretil corpus based on SIF-weighted averages of word vectors. The output is both a set of tables as well as the visual representation above.

We can do the same thing for pali texts and possibly between pali and sanskrit texts. It would show a whole lot of possible parallels and connections between suttas.

This will be of substantial benefit to parallels-research. There is only one very big catch: it needs a whole lot of computer power, far more than I have. I can probably adapt the program to work with the pali sources, but running it would be a very different matter.

So I guess the first step to do here is: visit this guy in Hamburg

Another very interesting project this guy has written is an allignment between sanskrit and tibetan texts:

Viveka · February 14, 2019, 8:04am

Regarding the image…

Where statistics and Dhamma intersect = beauty

Vimala · March 22, 2019, 8:58am

Just as an update: I’ve been working with the author of this software to develop something similar for the pali texts. We have used the segments of the files used by our Pootle translation software for both suttas, abhidhamma and part of the Vinaya as a test. Once the Vinaya segmentation is done we can run the program in full. We will also have to come up with some better visualization of the tables.

I have however one question for Bhante @Sujato: while taking some random samples of what the program is outputting, I tried to compare these to existing suttas. We are currently trying to tweek the sensitivity of the program to output not too much so taking out standard passages that appear too often.

In doing so, I found that for instance for MN10 segment 47.1 is paralleled to SN 47.1 segment 2.7 in our data. This is correct of course but it is rather a standard passage that is found elsewhere also. For instance in SN 47.18 segment 1.3 and many other places (32 in total).
So how far do we go to show such parallels in our data, both for this project as for our parallels-tables on SC?

Sebastian’s suggestion was to limit output to 10 in order to reduce the occurance of standard passages in our output data but that would eliminate this parallel for instance. On the other hand, if we don’t the dataset becomes completely huge.

Another suggestion is using word-weights on the parallels. So word-weights are calculated by simply taking the occurences of a word in the corpus divided by all the words in the corpus, so we get a clear picture how rare (or frequent) the word is. Based on this we can rank out parallels i.e. if they consist mainly of stock material which occurs frequently, they will be ranked lower and we might introduce a threshold where we filter out parallels that only consist of very frequent words. Of course this is not so easy to do.

So any feedback on this will be very much appreciated.

sujato · March 25, 2019, 11:44am

Oh, crikey, that is not an easy problem at all.

The flip side is just as relevant: passages may be abbreviated, hence only implied. That gives a over-weighting to certain things. For example, vivicceva kamehi appears at the start of the jhana formula, and it is going to appear much more commonly than, say, sato ca sampajano from the third jhana, but in principle both should appear the same number of times. Perhaps a neural net might be able to expand these.

Now, on the one hand, we might think that having lots of repetitious pericopes is wasteful and clogs up the data. Do we really need to expand all the ganga peyyala sections? On the other hand, the very existence of the preicopes does say something: clearly these were teachings regarded as important.

It is not really possible to draw clear-cut boundaries, but as a general rule it takes more than just a shared pericope to justify being listed as a parallel. there has to be something about the text as a whole that suggests they are the same “thing”. In longer texts this is obvious, but in the shorter SN and AN texts it trails off to become arbitrary, and the most objective we can do is say that certain scholars considered them as parallels.

I suspect that there is no one right approach to this. Probably we will have to try some different methods and see what happens.