Towards a complete directory of Dhammapada Parallels

Vimala · February 26, 2020, 12:52pm

There are only few people in this world who are crazy about Dhammapada parallels. My friend Ven. Anandajoti and myself are two of them.

So when Ven. Anandajoti approached me last year to make me aware of the work of Sebastian Nehrdich, I was very enthousiastic because I could see the huge potential for us to find parallels. Sebastian was using a neural network approach to find similarities within the Sanskrit texts of the Göttingen Register of Electronic Texts in Indian Languages (GRETIL). However, the display of his output was very rudimentary and not very easy to work with.

I approached Sebastian with the request to run the Pali texts through the neural network. The initial idea was to have this for just the EBTs on SuttaCentral. Sebastian agreed and I, Bhante Sujato and Blake constructed the necessary input files for the neural network from SuttaCentral’s Bilara and html files in combination with SC’s hyphenation code.

In order to create a better user interface I started making a UI for this, based on SuttaCentral’s system of webcomponents. When I showed this to Sebastian, he became enthousiastic about it so we decided to combine forces. Now, almost a year later, we have a launch date for our new BuddhaNexus website at the end of June and the project has become a full-fledged 3-tier website, with 4 developers working on it and backed by the Numata Centre of Buddhist Studies at the University of Hamburg. The site hosts all available digitized Buddhist texts in Pali, Tibetan, Chinese and Sanskrit as well as other Sanskrit texts like Vedic and Jaina. And most importantly, it shows the connections between texts.

In first instance, the site will only be able to show parallels within the same language, but as we have recently been given access to a super-computer from Hamburg University, we will also be able to run the neural network finding parallels between languages.

In the mean time, Ven. Anandajoti asked my help in identifying Hindi parallels of the Dhammapada and I decided to test our BuddhaNexus project with this. It was good test-case that allowed me to test the system and identify bugs. After a few hours of work I had identified over 100 new and unknown parallels in the Sanskrit texts alone and the results are displayed in the below spreadsheet.

dhpparallels.zip (17.9 KB)

Note that this spreadsheet only lists entries that are not already listed in Ven. Anandajoti’s excellent book on this subject:

Some are listed on SuttaCentral, but most of these are not.

In addition to this, I plan to do the same for the Tibetan, Chinese and Pali to identify where Dhammapada parallels are in the entire corpus of texts.

The below pie-charts show that there are a large number of parallels to be found in various texts.

I’ve also added a sankey-chart for the Dhammapada links to other Pali texts:

This small test already shows BuddhaNexus’ enormous potential; it will give researchers worldwide a very powerful tool for finding parallels and it will open up previously unknown possibilities in this field.

Of course none of this would be possible without the help and support of so many people. Bhante Sujato helped to create the Pali input files for the network and supported with advice, Aminah helped setting up a ZenBoard and management system and not to forget my old friend from the SuttaCentral STXNext team, Hubert, who joined our fixed team of developers with much needed advice and input on how to handle these huge amounts of data.

I also want to mention here the great work of @lemon, who has been working hard to correct SuttaCentral entries of the Udanavarga and compare these with Chinese and Tibetan texts and this material is extremely valueble for helping us to train our neural network to read and compare different languages but also for me to identify new Dhammapada parallels.

When the site is launched I will post it here.

Javier · February 26, 2020, 1:29pm

Amazing, this is revolutionary! Sadhu sadhu sadhu !

ERose · February 26, 2020, 2:23pm

Thank you, Venerable. A gift of Dhamma to the world.

in a smaller way, also crazy, with friends, about Dhammapada parallels.

karl_lew · February 26, 2020, 3:18pm

Wowzer!

I’m especially intrigued by the use of Sankey diagrams, which describe flow relationships.

I wonder what the Sankey diagrams for the four pairs of noble ones might look like. In other words, imagine Sankey diagrams that end with each of the four stages. For example, left side could be fetters with nikaya subgroups and right side would be the four pairs. This would show how different suttas address different challenges for each pair.

Matthew333 · February 26, 2020, 3:38pm

Or even features of a stream entrant or an arahanth, it would produce. Great work.

Akaliko · February 27, 2020, 8:02am

Great work Ven @Vimala and beautiful too! Keep going!

Vimala · February 27, 2020, 8:11am

Thank you @Javier. I also feel this will revolutionize the way we look for parallels. Before we were dependent on the knowledge of the various languages and texts of just a few scholars and it is impossible for anybody to keep all the information of all the collections in mind simultaniously. Moreover, I found that a large amount of such research of parallels is copyrighted or only available in expensive specialized magazines, but not freely available to everybody.

With BuddhaNexus we made it very clear from the start that everything should be open source, including all our input data (which we have in segmented form on our GitHub), the neural network’s output data and the source codes for front- and backend. The site too should be open for everybody to use.

I will post more about it closer to the time.

Matthew333 · February 27, 2020, 9:56am

This neural network, is it an artificial intelligence or something like that.

Vimala · February 27, 2020, 10:41am

Yes, it’s a form of AI. It uses deep learning, a bit like DeepL translator.
The currently applied strategy to calculate parallel passages within monolingual corpora is based on fasttext word embedding, a pooling strategy for the representation of phrases with a fixed length and Approximate Nearest Neighbor Search (ANN) to efficiently retrieve possible parallel sequences for the size of the entire corpora. Fasttext word embedding [Bojanowski et al., 2016] is calculated via a lightweight neural network that does not require GPU processing.

Javier · February 27, 2020, 2:57pm

Will the results be added to suttacentral’s list of parallels? Or better yet, is there a way to include a link to the Buddhanexus results from the sutta pages in suttacentral?

Matthew333 · February 27, 2020, 3:42pm

That Sound good!

Vimala · February 28, 2020, 9:32am

This is a complicated question for a couple of reasons.

Sofar, I had added a few hindi and Jaina parallels to SuttaCentral. There were only a handfull of them then. But with hundreds of new parallels, for a large part in commentarial texts and non-Buddhist texts, question is what is still useful and wanted.

SuttaCentral’s focus is on the Early Buddhist texts but even within that, not all texts in languages other than Pali are on the site. This is still the aim for the future and will be easier done once the Bilara system is completely up and running. BuddhaNexus uses similar segmented json files as Bilara (although slightly different in places) and these are available on our GitHub repro (only Sanskrit still needs a lot of work). Once those texts are added, it will be easier to refer to them with parallels in SC’s system, especially since BuddaNexus uses the same numbering system for Pali and Chinese texts.

Another difference is that BuddhaNexus does not actually find “parallels”; it finds matches i.e. similar parts of texts. Of course there is a large overlap but whether or not a match can be considered a useful parallel is up to the user to determine. BuddaNexus is just a powerful tool for that.

For instance, the words “evaṃ me sutaṃ” are in many suttas. Technically, the neural network would see this as a match. But of course it is not a parallel. So we have put a cap on the findings that says that if a match appears 50 times in different texts, it is discarded. And then in the frontend the default filter for such co-occurances is 15 (with the option to go to 50 if you really want to). Therefore the enormous amount of data from the neural network has to be filtered and for this reason every match comes with 3 values: similarity-score in % (how similar the passages are), length in characters and number of co-occurances. The user can change these filters to get more or less matches.

A not unimportant issue to be addressed is what exactly is a parallel? For some texts it is of course very clear but there is also a very large grey area that on SuttaCentral is addressed with the term “resembling parallel”. In practice, going over parallels for SuttaCentral I found it extremely difficult to determine when something can be considered a parallel and how to note it down and I found myself constantly having to make choices that I was unsure of. For verses like the Dhammapada it is easy enough: when all 4 lines of the verse are the same, it’s a full parallel, with 2-3 lines it’s resembling. But with prose I found it sometimes very difficult. BuddhaNexus can certainly be a tool to help address such issues, especially in languages I do not understand myself like Chinese and Tibetan.

Vimala · March 13, 2020, 9:31am

Here is the new version of the book we compiled with the above research.