Yet another app for studying the suttas

georg · February 27, 2021, 10:09pm

Dear Venerable @sujato ,

I have been working this weekend on this experimental app that loads your translations and pali text from *.po files that I found on sc-data-master. I was wondering if it would be okay if your files would be copied into the repo of the app (https://github.com/gboenn/suttas_text_and_audio)? If not I will take them away and just put a link in the readme. Also, may I ask, what happened to these files as I could not find them anymore on github?
It’s still very early days and my aim is to study for personal use and also to learn a bit more about text processing along the way. As a visual person I also enjoy exploring the imagery and art associated with early Buddhism.
I managed to parse the files and use p5js to display and p5js speech for text-to-speech. This is a sample screenshot:

Here is the link:
https://github.com/gboenn/suttas_text_and_audio

sujato · February 27, 2021, 10:20pm

Hi Georg this sounds great.

but do not use PO files! These are deprecated, we do not use them any more and they will never be updated.

We are using JSON for our texts, I am sure you will find these much nicer to work with. The source for all texts is now in bilara-data:

You are most welcome to use this in any way you like. I would suggest that you fork the repo for yourself, and update it from time to time.

@karl_lew may be interested in this too.

georg · February 28, 2021, 12:16am

Thank you, Venerable, for allowing me to use your translations. I will use the json files instead. Makes a lot of sense. There was an older thread on D&D about creating an API, but i could not find any further info. Thanks for the link.

With metta

karl_lew · February 28, 2021, 12:50am

Georg, as Venerable Sujato mentions, the JSON files are in bilara-data. Voice.suttacentral.net itself uses these JSON files to create audio files for personal study. The Voice website is designed for accessibility. My sight is quite impaired and I can no longer read print. Hearing the suttas has therefore become quite important to me.

I see from your website that you’re interested in voice navigation. So am I. As my ability to see text on a display lessens, it will become quite helpful to me to have a voice navigator for Voice itself. It would be great to have a voice navigator. I use Google Home and it is quite helpful. Surely there must be a way for us to connect a voice navigator to Voice to hear Ven. Sujato’s translations?

georg · February 28, 2021, 3:50am

Hi Karl, the people who created p5.js-speech have done some research projects on accessibility. Of course, it’s google so everything one says, after one has allowed the browser to access the microphone, will go into some cloud somewhere, who knows who’s listening. But, perhaps they might even benefit from hearing some dhamma snippets once in a while
It is impressive how well it works. I am of course, more than happy to share code, and once I have something going I will let you know. There is this demo too:

karl_lew · February 28, 2021, 2:09pm

Yes. This is exactly what we need. I’ve been quite impressed with Google voice recognition and use it daily. I’ve added an issue for us to incorporate p5.speech into Voice.

Voice currently uses AWS Polly for TTS. But AWS Polly doesn’t have a Sinhala voice. Google does have a Sinhala voice, so we may be able to offer Voice listeners Sinhala as well in the future via p5.speech.

Oh, and if you could wrap up your wonderful image library as an npm library, then we could add it to Voice as well.

georg · February 28, 2021, 8:09pm

yes, that would be wonderful.

please feel free to copy them, they are not mine, but they have a CC license. For now I’m only using the jetavana image, which I have attributed with a link in the readme. (The person who took the photo used the tag ‘myself’, but it’s not me. I guess the pun was intended.)

BTW, I’ve rewritten my code to make use of the bilara json files. So much easier to work with.

sabbamitta · February 28, 2021, 9:04pm

The actual problem with your words being stored in a cloud is not so much that someone listens to what you say. It’s rather that training voice recognition systems will allow for using people’s voice to create fake recordings of things they never said and never would say. Yet, it’s their voice and it sounds 100% authentic.

This is why I am a bit apprehensive of using this sort of technology, although I also clearly see the advantages it has, especially to remove accessibility obstacles.

karl_lew · February 28, 2021, 10:35pm

Indeed.

But even with bilara-data, libraries can be helpful. Voice uses scv-bilara for all operations on bilara-data. It even has a search that searches multiple languages and provides a relevance score. SC itself uses ArangoDB for searching stuff, so that is another approach. If you have any questions, just let us know…

Thanks for the pictures!

georg · March 1, 2021, 4:38am

Hi Sabbamitta.

There are many valid concerns one can have with modern communication systems. For example, what happens to what we post on youtube, say in zoom meetings etc.? It all has valuable voice data stored on a server that can be used to train a machine-learning system.

Thanks, I agree.

georg · March 22, 2021, 12:14am

Hi Karl,

Much mudita for all your work on Voice, and I’m impressed by the search tool in scv-bilara.
Given all the wonderful new developments of SC, I am very humbled to give you an update from my end.

I have written a python script to implement a search of bilara for my own local study and for getting an idea about how voice navigation could work in the future. The script is named text_analysis.py (link: text_analysis.py) and here are the features:

Searching arbitrary english words or sentences in the entire translation of the Tipitika by Ven. Sujato.

For example, the shell command

python3.7 text_analysis.py “the five grasping aggregates”

will return an exhaustive list of results, as follows:

A. Returning the verse where the search string has been found together with its key, as in the output from the above:

“an10.60:4.4”: "And so they meditate observing impermanence in the five grasping aggregates. ",

B. Returning underneath the corresponding Pali verse from the root text of the above:

an10.60:4.4: Iti imesu pañcasu upādānakkhandhesu aniccānupassī viharati.

C. Returns a list of all sutta numbers where the search string has been found

D. Returns an ordered list of suttas where the words have been found most frequently, as from the above example:

[(6, ‘sn22.122’), (6, ‘dn22’), (5, ‘sn22.89’), (5, ‘sn22.123’), (4, ‘sn22.82’), (4, ‘mn109’), (3, ‘sn22.48’), (3, ‘mn44’), (3, ‘mn149’), (3, ‘mn141’), (3, ‘mn10’), (2, ‘sn4.16’), (2, ‘sn22.100’), (2, ‘mn28’), (2, ‘mn23’), (2, ‘mn122’), (2, ‘an8.2’), (2, ‘an4.90’), (1, ‘sn56.11’), (1, ‘sn46.30’), (1, ‘sn45.178’), (1, ‘sn45.159’), (1, ‘sn35.245’), (1, ‘sn35.238’), (1, ‘sn22.79’), (1, ‘sn22.47’), (1, ‘sn22.105’), (1, ‘sn22.104’), (1, ‘sn22.103’), (1, ‘mn9’), (1, ‘mn75’), (1, ‘mn151’), (1, ‘mn112’), (1, ‘dn34’), (1, ‘dn33’), (1, ‘dn14’), (1, ‘an9.66’), (1, ‘an6.63’), (1, ‘an5.30’), (1, ‘an4.41’), (1, ‘an3.61’), (1, ‘an10.60’)]

E. Analyzes and orders output according to the frequency of the next following word after the search string:

[[‘’, 50], [‘are’, 11], [‘in’, 8], [‘I’m’, 3], [‘as’, 3], [‘for’, 3], [‘.’’, 1], [‘?’’, 1], [‘is’, 1], [‘that’, 1], [‘’’, 1]]

Another example for this feature: Finding the names of all Venerables and their frequency is easy:

python3.7 text_analysis.py “Venerable”

returns

[[‘Ānanda’, 303], [‘Sāriputta’, 207], [‘sir’, 105], [‘Mahāmoggallāna’, 63], [‘Anuruddha’, 49], [‘Mahākaccāna’, 32], [‘Rādha’, 29], [‘’, 25], [‘Udāyī’, 25], [‘Mahākassapa’, 24], [‘Channa’, 20], [‘Vaṅgīsa’, 19], [‘Bhāradvāja’, 16], [‘Rāhula’, 15], [‘s’, 15], [‘Mahākoṭṭhita’, 14], [‘Nārada’, 14], [‘Kassapa’, 13], [‘Bakkula’, 12], [‘Dhammika’, 11], [‘Mahācunda’, 11], [‘Samiddhi’, 11], [‘Upavāṇa’, 11], [‘Anurādha’, 9], [‘Phagguṇa’, 9], [‘Khemaka’, 8], [‘Meghiya’, 8], [‘Aṅgulimāla’, 7], [‘Isidatta’, 7], [‘Māluṅkyaputta’, 7], [‘Puṇṇa’, 7], [‘Bāhiya’, 6], [‘Mahākappina’, 6], [‘Nandaka’, 6], [‘Uttara’, 6], [‘Vakkali’, 6], [‘Bhaddiya’, 5], [‘Citta’, 5], [‘Kimbila’, 5], [‘Koṇḍañña’, 5], [‘Kāmabhū’, 5], [‘Raṭṭhapāla’, 5], [‘Sāriputta’s’, 5], [‘Upāli’, 5], [‘Assaji’, 4], [‘Bhūmija’, 4], [‘Mahaka’, 4], [‘Moggallāna’, 4], [‘Nanda’, 4], [‘Nāgita’, 4], [‘Uttiya’, 4], [‘Visākha’, 4], [‘Abhiya’, 3], [‘Bhadda’, 3], [‘Bhaddāli’, 3], [‘Gavampati’, 3], [‘Girimānanda’, 3], [‘Godhika’, 3], [‘Gotama’, 3], [‘Kassapagotta’, 3], [‘Lomasakaṅgiya’, 3], [‘Migajāla’, 3], [‘Nāgadatta’, 3], [‘Nāgasamāla’, 3], [‘Phagguna’, 3], [‘Piṇḍola’, 3], [‘Pukkusāti’, 3], [‘Soṇa’, 3], [‘Upasena’, 3], [‘Vacchagotta’, 3], [‘Bhaddaji’, 2], [‘Brahmadeva’, 2], [‘Candikāputta’, 2], [‘Cundaka’, 2], [‘Gavesī’, 2], [‘Godatta’, 2], [‘Kappa’, 2], [‘Khema’, 2], [‘Kosiya’, 2], [‘Lomasavaṅgīsa’, 2], [‘Mahāmoggallāna’s’, 2], [‘Māgaṇḍiya’, 2], [‘Nanda—who’, 2], [‘Pilindavaccha’, 2], [‘Puṇṇiya’, 2], [‘Revata’, 2], [‘Sandha’, 2], [‘Saviṭṭha’, 2], [‘Saṅgāmaji’, 2], [‘Sela’, 2], [‘Seniya’, 2], [‘Subhadda’, 2], [‘Subhūti’, 2], [‘Surādha’, 2], [‘Susīma’, 2], [‘Tissa’, 2], [‘Udena’, 2], [‘Vidhura’, 2], [‘Yasoja’, 2], [‘Ambaṭṭha’, 1], [‘Anuruddha’s’, 1], [‘Ariṭṭha’, 1], [‘Bhagu’, 1], [‘Brahmadeva’s’, 1], [‘Bāhuna’, 1], [‘Cetaka’, 1], [‘Cūḷapanthaka’, 1], [‘Dabba’, 1], [‘Dāsaka’, 1], [‘Gotama.’’, 1], [‘Gotama?’’, 1], [‘Isidāsī’, 1], [‘Kaccāna’, 1], [‘Kaccānagotta’, 1], [‘Lakkhaṇa’, 1], [‘Mahākaccāna’s’, 1], [‘Mogharāja’, 1], [‘Musila’, 1], [‘Māluṅkya’, 1], [‘Nanda—the’, 1], [‘Nigrodhakappa’, 1], [‘Raṭṭhapāla’s’, 1], [‘Sabhiya’, 1], [‘Sañjīva’, 1], [‘Senior’, 1], [‘Sujāta’, 1], [‘Tissa—the’, 1], [‘Tāḷapuṭa’, 1]]

Furthermore, it returns all verses associated with a key, for example:

python3.7 text_analysis.py “mn21”

“mn21:0.1”: "Middle Discourses 21 ",

mn21:0.1: Majjhima Nikāya 21

“mn21:0.2”: "The Simile of the Saw ",

mn21:0.2: Kakacūpamasutta

“mn21:1.1”: "So I have heard. ",

mn21:1.1: Evaṁ me sutaṁ—

etcetera for the rest of the sutta

Returns a specific verse from a key:

python3.7 text_analysis.py “dn1:0.2”

“dn1:0.2”: "The Prime Net ",

dn1:0.2: Brahmajālasutta

My idea is to let the program speak at least some of those search results given a voice nav input.

Finally, sadhu to everyone for making this resource available.

karl_lew · March 23, 2021, 12:34pm

That’s so fascinating that Ananda is mentioned more that Sariputta and Mahamoggallana combined!

Sadly, my own favorite Dhammadinna is not in the list.

Congratulations! You have seen and implemented the heart of scv-bilara in Python!

And as you’ve surmised, the key may actually be the use of recurring phrases such as “the five grasping aggregates”. These recurring phrases bind the Dhamma together and allow us to bridge meaning, giving us paths to explore the Dhamma. If we memorized these phrases ourselves then all we would need to do is speak a phrase to review and study the Dhamma.

What do you think?

georg · March 23, 2021, 4:00pm

One has to be careful how to interpret the search results. The fact that Ven. Dhammadinnā is not mentioned in the list is because her name apparently does not follow the word ‘Venerable’, but rather that she is always referred to as ‘bhikkhuni’, which Ajahn Sujato translates as ‘nun’. Therefore, if one searches for the word ’ nun ’ (note the spaces) then one would find the names of all bhikkhunis that are mentioned like this in Ajahn’s translation:

[[‘who’, 37], [‘with’, 28], [‘disciples’, 12], [‘named’, 8], [‘Khemā’, 7], [‘Kajaṅgalikā’, 5], [‘Kisāgotamī’, 5], [‘’, 4], [‘Dhammadinnā’, 4], [‘Selā’, 4], [‘Somā’, 4], [‘Uppalavaṇṇā’, 4], [‘Vajirā’, 4], [‘Vijayā’, 4], [‘lives’, 4], [‘Āḷavikā’, 4], [‘Nandā’, 3], [‘Thullanandā’, 3], [‘flying’, 3], [‘Cālā’, 2], [‘Sukkā’, 2], [‘Sīsupacālā’, 2], [‘Thullatissā’, 2], [‘Upacālā’, 2], [‘has’, 2], [‘should’, 2], [‘would’, 2], [‘Cīrā’, 1], [‘Jaṭilagāhikā’, 1], [‘Jentā’, 1], [‘Muttā’, 1], [‘Puṇṇā’, 1], [‘Subhā’, 1], [‘Sumedhā’, 1], [‘accomplished’, 1], [‘addressed’, 1], [‘asked’, 1], [‘disciple’, 1], [‘in’, 1], [‘or’, 1], [‘out’, 1], [‘rose’, 1], [‘saw’, 1], [‘she’, 1], [‘was’, 1]]

If one searches for ‘nun named’ one finds also Ven. Asokā.

I suppose one could implement certain filters (e.g. stop words) to refine the searches.

That is the idea that i have in mind.

georg · March 23, 2021, 4:19pm

Sorry, I forgot to mention that if anyone on this forum, especially the Venerables, find any error or mistake in what I discussed here with Karl, please point it out to me. Thank you.

karl_lew · March 23, 2021, 4:28pm

Excellent.

We are currently generating this list by hand for all segmented languages. If an automated way existed to pre-populate lists of key phrases, it would be very helpful to Voice and similar applications.

For such automation, consistency of translation is absolutely critical. For this reason, we have restricted our own searches to the segmented translations maintained with Bilara. Bilara suggests previously used phrases for translators. In this manner, Bilara itself encourages the translator’s use of key phrases.

georg · March 23, 2021, 5:02pm

I’d be happy to think about this.
One way to start might be to take one of the great summaries of the teaching, e.g. MN117, SuttaCentral, or MN44 that you mentioned earlier, and compile a list of key words from there.
This sounds a bit like you are creating a concordance. Is this the next step of development?

karl_lew · March 23, 2021, 5:14pm

It’s not exactly a concordance, which is word-based.

Instead, we’re developing lists of examples that resolve to at most about 20-30 suttas when using ripgrep. Ripgrep is so fast that there’s no need for a concordance. We simply have a list of examples to present to the user (as you’ve seen in Voice).

Consider “root of suffering”, which occurs in only 8 documents (suttas:7, vinaya:1):

translation/en/sujato/sutta/dn/dn16_translation-en-sujato.json:1
translation/en/sujato/sutta/mn/mn105_translation-en-sujato.json:3
translation/en/sujato/sutta/mn/mn1_translation-en-sujato.json:2
translation/en/sujato/sutta/mn/mn66_translation-en-sujato.json:1
translation/en/sujato/sutta/mn/mn116_translation-en-sujato.json:1
translation/en/brahmali/vinaya/pli-tv-kd/pli-tv-kd6_translation-en-brahmali.json:1
translation/en/sujato/sutta/sn/sn42/sn42.11_translation-en-sujato.json:5
translation/en/sujato/sutta/sn/sn56/sn56.21_translation-en-sujato.json:1

Examples are generated by trial and error. As we study and translate the suttas, we notice key phrases. Then we verify their specificity using ripgrep. Examples found in << 50 suttas are a good starting point for human editorial review. So the open research issue that suggests itself is this:

Is it possible to automate the discovery of new examples?

We have found examples to be absolutely critical as a mnemonic basis for studying the suttas. People have trouble remembering sutta identifiers. But everybody remembers the root of suffering.

sabbamitta · March 23, 2021, 5:34pm

I am not so convinced. I suppose a human review will always be required.

karl_lew · March 23, 2021, 5:52pm

Yes. And we need help finding examples. I believe that Georg may be able to help us.

georg · March 23, 2021, 6:33pm

Now i understand. It is possible to get the next word’s frequency with my script. My intention was to create something like a tree from a single word or phrase, thinking of auto-completion. For example, beginning a search for ‘root’ leads to ‘root of’ (234 matches) and then branches out to ‘root of suffering’ (14 matches) ‘root of arguments’ (12 matches), ‘root of misery’ (8 matches), ‘root of craving’ (4), and ‘root of rebirth’ (2 matches), but also the much more frequent phrase ‘root of a’ (162 matches, root of a tree for example). The key is to make informed decisions about what to filter and when. If one filters out the word ‘of’ too early then the tree behind ‘root of’ is not found. I have seen libraries for detecting ‘parts of speech’ in text (noun, verb, etc. for the english language) that might be useful here.