Yet another app for studying the suttas

Dear Venerable @sujato ,

I have been working this weekend on this experimental app that loads your translations and pali text from *.po files that I found on sc-data-master. I was wondering if it would be okay if your files would be copied into the repo of the app (https://github.com/gboenn/suttas_text_and_audio)? If not I will take them away and just put a link in the readme. Also, may I ask, what happened to these files as I could not find them anymore on github?
It’s still very early days and my aim is to study for personal use and also to learn a bit more about text processing along the way. As a visual person I also enjoy exploring the imagery and art associated with early Buddhism.
I managed to parse the files and use p5js to display and p5js speech for text-to-speech. This is a sample screenshot:

Here is the link:
https://github.com/gboenn/suttas_text_and_audio

5 Likes

Hi Georg this sounds great.

but do not use PO files! These are deprecated, we do not use them any more and they will never be updated.

We are using JSON for our texts, I am sure you will find these much nicer to work with. The source for all texts is now in bilara-data:

You are most welcome to use this in any way you like. I would suggest that you fork the repo for yourself, and update it from time to time.

@karl_lew may be interested in this too.

7 Likes

Thank you, Venerable, for allowing me to use your translations. I will use the json files instead. Makes a lot of sense. There was an older thread on D&D about creating an API, but i could not find any further info. Thanks for the link.

With metta
:pray:

4 Likes

Georg, as Venerable Sujato mentions, the JSON files are in bilara-data. Voice.suttacentral.net itself uses these JSON files to create audio files for personal study. The Voice website is designed for accessibility. My sight is quite impaired and I can no longer read print. Hearing the suttas has therefore become quite important to me.

I see from your website that you’re interested in voice navigation. So am I. As my ability to see text on a display lessens, it will become quite helpful to me to have a voice navigator for Voice itself. It would be great to have a voice navigator. I use Google Home and it is quite helpful. Surely there must be a way for us to connect a voice navigator to Voice to hear Ven. Sujato’s translations?

7 Likes

Hi Karl, the people who created p5.js-speech have done some research projects on accessibility. Of course, it’s google so everything one says, after one has allowed the browser to access the microphone, will go into some cloud somewhere, who knows who’s listening. But, perhaps they might even benefit from hearing some dhamma snippets once in a while :grinning:
It is impressive how well it works. I am of course, more than happy to share code, and once I have something going I will let you know. There is this demo too:

5 Likes

Yes. This is exactly what we need. I’ve been quite impressed with Google voice recognition and use it daily. I’ve added an issue for us to incorporate p5.speech into Voice.

Voice currently uses AWS Polly for TTS. But AWS Polly doesn’t have a Sinhala voice. Google does have a Sinhala voice, so we may be able to offer Voice listeners Sinhala as well in the future via p5.speech.

Oh, and if you could wrap up your wonderful image library as an npm library, then we could add it to Voice as well.

6 Likes

yes, that would be wonderful.

please feel free to copy them, they are not mine, but they have a CC license. For now I’m only using the jetavana image, which I have attributed with a link in the readme. (The person who took the photo used the tag ‘myself’, but it’s not me. I guess the pun was intended.)

BTW, I’ve rewritten my code to make use of the bilara json files. So much easier to work with. :smiling_face_with_three_hearts:

4 Likes

The actual problem with your words being stored in a cloud is not so much that someone listens to what you say. It’s rather that training voice recognition systems will allow for using people’s voice to create fake recordings of things they never said and never would say. Yet, it’s their voice and it sounds 100% authentic.

This is why I am a bit apprehensive of using this sort of technology, although I also clearly see the advantages it has, especially to remove accessibility obstacles.

3 Likes

Indeed.

But even with bilara-data, libraries can be helpful. Voice uses scv-bilara for all operations on bilara-data. It even has a search that searches multiple languages and provides a relevance score. SC itself uses ArangoDB for searching stuff, so that is another approach. If you have any questions, just let us know…

Thanks for the pictures!

4 Likes

Hi Sabbamitta.

There are many valid concerns one can have with modern communication systems. For example, what happens to what we post on youtube, say in zoom meetings etc.? It all has valuable voice data stored on a server that can be used to train a machine-learning system.

Thanks, I agree.

3 Likes

Hi Karl,

Much mudita for all your work on Voice, and I’m impressed by the search tool in scv-bilara.
Given all the wonderful new developments of SC, I am very humbled to give you an update from my end.

I have written a python script to implement a search of bilara for my own local study and for getting an idea about how voice navigation could work in the future. The script is named text_analysis.py (link: text_analysis.py) and here are the features:

  1. Searching arbitrary english words or sentences in the entire translation of the Tipitika by Ven. Sujato.

For example, the shell command

python3.7 text_analysis.py “the five grasping aggregates”

will return an exhaustive list of results, as follows:

A. Returning the verse where the search string has been found together with its key, as in the output from the above:

“an10.60:4.4”: "And so they meditate observing impermanence in the five grasping aggregates. ",

B. Returning underneath the corresponding Pali verse from the root text of the above:

an10.60:4.4: Iti imesu pañcasu upādānakkhandhesu aniccānupassī viharati.

C. Returns a list of all sutta numbers where the search string has been found

D. Returns an ordered list of suttas where the words have been found most frequently, as from the above example:

[(6, ‘sn22.122’), (6, ‘dn22’), (5, ‘sn22.89’), (5, ‘sn22.123’), (4, ‘sn22.82’), (4, ‘mn109’), (3, ‘sn22.48’), (3, ‘mn44’), (3, ‘mn149’), (3, ‘mn141’), (3, ‘mn10’), (2, ‘sn4.16’), (2, ‘sn22.100’), (2, ‘mn28’), (2, ‘mn23’), (2, ‘mn122’), (2, ‘an8.2’), (2, ‘an4.90’), (1, ‘sn56.11’), (1, ‘sn46.30’), (1, ‘sn45.178’), (1, ‘sn45.159’), (1, ‘sn35.245’), (1, ‘sn35.238’), (1, ‘sn22.79’), (1, ‘sn22.47’), (1, ‘sn22.105’), (1, ‘sn22.104’), (1, ‘sn22.103’), (1, ‘mn9’), (1, ‘mn75’), (1, ‘mn151’), (1, ‘mn112’), (1, ‘dn34’), (1, ‘dn33’), (1, ‘dn14’), (1, ‘an9.66’), (1, ‘an6.63’), (1, ‘an5.30’), (1, ‘an4.41’), (1, ‘an3.61’), (1, ‘an10.60’)]

E. Analyzes and orders output according to the frequency of the next following word after the search string:

[[‘’, 50], [‘are’, 11], [‘in’, 8], [‘I’m’, 3], [‘as’, 3], [‘for’, 3], [‘.’’, 1], [‘?’’, 1], [‘is’, 1], [‘that’, 1], [‘’’, 1]]

Another example for this feature: Finding the names of all Venerables and their frequency is easy:

python3.7 text_analysis.py “Venerable”

returns

[[‘Ānanda’, 303], [‘Sāriputta’, 207], [‘sir’, 105], [‘Mahāmoggallāna’, 63], [‘Anuruddha’, 49], [‘Mahākaccāna’, 32], [‘Rādha’, 29], [‘’, 25], [‘Udāyī’, 25], [‘Mahākassapa’, 24], [‘Channa’, 20], [‘Vaṅgīsa’, 19], [‘Bhāradvāja’, 16], [‘Rāhula’, 15], [‘s’, 15], [‘Mahākoṭṭhita’, 14], [‘Nārada’, 14], [‘Kassapa’, 13], [‘Bakkula’, 12], [‘Dhammika’, 11], [‘Mahācunda’, 11], [‘Samiddhi’, 11], [‘Upavāṇa’, 11], [‘Anurādha’, 9], [‘Phagguṇa’, 9], [‘Khemaka’, 8], [‘Meghiya’, 8], [‘Aṅgulimāla’, 7], [‘Isidatta’, 7], [‘Māluṅkyaputta’, 7], [‘Puṇṇa’, 7], [‘Bāhiya’, 6], [‘Mahākappina’, 6], [‘Nandaka’, 6], [‘Uttara’, 6], [‘Vakkali’, 6], [‘Bhaddiya’, 5], [‘Citta’, 5], [‘Kimbila’, 5], [‘Koṇḍañña’, 5], [‘Kāmabhū’, 5], [‘Raṭṭhapāla’, 5], [‘Sāriputta’s’, 5], [‘Upāli’, 5], [‘Assaji’, 4], [‘Bhūmija’, 4], [‘Mahaka’, 4], [‘Moggallāna’, 4], [‘Nanda’, 4], [‘Nāgita’, 4], [‘Uttiya’, 4], [‘Visākha’, 4], [‘Abhiya’, 3], [‘Bhadda’, 3], [‘Bhaddāli’, 3], [‘Gavampati’, 3], [‘Girimānanda’, 3], [‘Godhika’, 3], [‘Gotama’, 3], [‘Kassapagotta’, 3], [‘Lomasakaṅgiya’, 3], [‘Migajāla’, 3], [‘Nāgadatta’, 3], [‘Nāgasamāla’, 3], [‘Phagguna’, 3], [‘Piṇḍola’, 3], [‘Pukkusāti’, 3], [‘Soṇa’, 3], [‘Upasena’, 3], [‘Vacchagotta’, 3], [‘Bhaddaji’, 2], [‘Brahmadeva’, 2], [‘Candikāputta’, 2], [‘Cundaka’, 2], [‘Gavesī’, 2], [‘Godatta’, 2], [‘Kappa’, 2], [‘Khema’, 2], [‘Kosiya’, 2], [‘Lomasavaṅgīsa’, 2], [‘Mahāmoggallāna’s’, 2], [‘Māgaṇḍiya’, 2], [‘Nanda—who’, 2], [‘Pilindavaccha’, 2], [‘Puṇṇiya’, 2], [‘Revata’, 2], [‘Sandha’, 2], [‘Saviṭṭha’, 2], [‘Saṅgāmaji’, 2], [‘Sela’, 2], [‘Seniya’, 2], [‘Subhadda’, 2], [‘Subhūti’, 2], [‘Surādha’, 2], [‘Susīma’, 2], [‘Tissa’, 2], [‘Udena’, 2], [‘Vidhura’, 2], [‘Yasoja’, 2], [‘Ambaṭṭha’, 1], [‘Anuruddha’s’, 1], [‘Ariṭṭha’, 1], [‘Bhagu’, 1], [‘Brahmadeva’s’, 1], [‘Bāhuna’, 1], [‘Cetaka’, 1], [‘Cūḷapanthaka’, 1], [‘Dabba’, 1], [‘Dāsaka’, 1], [‘Gotama.’’, 1], [‘Gotama?’’, 1], [‘Isidāsī’, 1], [‘Kaccāna’, 1], [‘Kaccānagotta’, 1], [‘Lakkhaṇa’, 1], [‘Mahākaccāna’s’, 1], [‘Mogharāja’, 1], [‘Musila’, 1], [‘Māluṅkya’, 1], [‘Nanda—the’, 1], [‘Nigrodhakappa’, 1], [‘Raṭṭhapāla’s’, 1], [‘Sabhiya’, 1], [‘Sañjīva’, 1], [‘Senior’, 1], [‘Sujāta’, 1], [‘Tissa—the’, 1], [‘Tāḷapuṭa’, 1]]

:pray:

  1. Furthermore, it returns all verses associated with a key, for example:

python3.7 text_analysis.py “mn21”

“mn21:0.1”: "Middle Discourses 21 ",

mn21:0.1: Majjhima Nikāya 21

“mn21:0.2”: "The Simile of the Saw ",

mn21:0.2: Kakacūpamasutta

“mn21:1.1”: "So I have heard. ",

mn21:1.1: Evaṁ me sutaṁ—

etcetera for the rest of the sutta

  1. Returns a specific verse from a key:

python3.7 text_analysis.py “dn1:0.2”

“dn1:0.2”: "The Prime Net ",

dn1:0.2: Brahmajālasutta

My idea is to let the program speak at least some of those search results given a voice nav input.

Finally, sadhu to everyone for making this resource available.
:smiling_face_with_three_hearts:

5 Likes

That’s so fascinating that Ananda is mentioned more that Sariputta and Mahamoggallana combined!

Sadly, my own favorite Dhammadinna is not in the list. :cry:

Congratulations! You have seen and implemented the heart of scv-bilara in Python!

And as you’ve surmised, the key may actually be the use of recurring phrases such as “the five grasping aggregates”. These recurring phrases bind the Dhamma together and allow us to bridge meaning, giving us paths to explore the Dhamma. If we memorized these phrases ourselves then all we would need to do is speak a phrase to review and study the Dhamma.

What do you think?

1 Like

One has to be careful how to interpret the search results. The fact that Ven. Dhammadinnā is not mentioned in the list is because her name apparently does not follow the word ‘Venerable’, but rather that she is always referred to as ‘bhikkhuni’, which Ajahn Sujato translates as ‘nun’. Therefore, if one searches for the word ’ nun ’ (note the spaces) then one would find the names of all bhikkhunis that are mentioned like this in Ajahn’s translation:

[[‘who’, 37], [‘with’, 28], [‘disciples’, 12], [‘named’, 8], [‘Khemā’, 7], [‘Kajaṅgalikā’, 5], [‘Kisāgotamī’, 5], [‘’, 4], [‘Dhammadinnā’, 4], [‘Selā’, 4], [‘Somā’, 4], [‘Uppalavaṇṇā’, 4], [‘Vajirā’, 4], [‘Vijayā’, 4], [‘lives’, 4], [‘Āḷavikā’, 4], [‘Nandā’, 3], [‘Thullanandā’, 3], [‘flying’, 3], [‘Cālā’, 2], [‘Sukkā’, 2], [‘Sīsupacālā’, 2], [‘Thullatissā’, 2], [‘Upacālā’, 2], [‘has’, 2], [‘should’, 2], [‘would’, 2], [‘Cīrā’, 1], [‘Jaṭilagāhikā’, 1], [‘Jentā’, 1], [‘Muttā’, 1], [‘Puṇṇā’, 1], [‘Subhā’, 1], [‘Sumedhā’, 1], [‘accomplished’, 1], [‘addressed’, 1], [‘asked’, 1], [‘disciple’, 1], [‘in’, 1], [‘or’, 1], [‘out’, 1], [‘rose’, 1], [‘saw’, 1], [‘she’, 1], [‘was’, 1]]

If one searches for ‘nun named’ one finds also Ven. Asokā.

:pray:

I suppose one could implement certain filters (e.g. stop words) to refine the searches.

That is the idea that i have in mind.

2 Likes

Sorry, I forgot to mention that if anyone on this forum, especially the Venerables, find any error or mistake in what I discussed here with Karl, please point it out to me. Thank you.
:pray:

1 Like

Excellent.

We are currently generating this list by hand for all segmented languages. If an automated way existed to pre-populate lists of key phrases, it would be very helpful to Voice and similar applications.

For such automation, consistency of translation is absolutely critical. For this reason, we have restricted our own searches to the segmented translations maintained with Bilara. Bilara suggests previously used phrases for translators. In this manner, Bilara itself encourages the translator’s use of key phrases.

:pray:

1 Like

I’d be happy to think about this.
One way to start might be to take one of the great summaries of the teaching, e.g. MN117, SuttaCentral, or MN44 that you mentioned earlier, and compile a list of key words from there.
This sounds a bit like you are creating a concordance. Is this the next step of development?

2 Likes

It’s not exactly a concordance, which is word-based.

Instead, we’re developing lists of examples that resolve to at most about 20-30 suttas when using ripgrep. Ripgrep is so fast that there’s no need for a concordance. We simply have a list of examples to present to the user (as you’ve seen in Voice).

Consider “root of suffering”, which occurs in only 8 documents (suttas:7, vinaya:1):

translation/en/sujato/sutta/dn/dn16_translation-en-sujato.json:1
translation/en/sujato/sutta/mn/mn105_translation-en-sujato.json:3
translation/en/sujato/sutta/mn/mn1_translation-en-sujato.json:2
translation/en/sujato/sutta/mn/mn66_translation-en-sujato.json:1
translation/en/sujato/sutta/mn/mn116_translation-en-sujato.json:1
translation/en/brahmali/vinaya/pli-tv-kd/pli-tv-kd6_translation-en-brahmali.json:1
translation/en/sujato/sutta/sn/sn42/sn42.11_translation-en-sujato.json:5
translation/en/sujato/sutta/sn/sn56/sn56.21_translation-en-sujato.json:1

Examples are generated by trial and error. As we study and translate the suttas, we notice key phrases. Then we verify their specificity using ripgrep. Examples found in << 50 suttas are a good starting point for human editorial review. So the open research issue that suggests itself is this:

  • Is it possible to automate the discovery of new examples? :wink:

We have found examples to be absolutely critical as a mnemonic basis for studying the suttas. People have trouble remembering sutta identifiers. But everybody remembers the root of suffering.

3 Likes

I am not so convinced. I suppose a human review will always be required. :thinking:

1 Like

Yes. And we need help finding examples. I believe that Georg may be able to help us.

1 Like

Now i understand. It is possible to get the next word’s frequency with my script. My intention was to create something like a tree from a single word or phrase, thinking of auto-completion. For example, beginning a search for ‘root’ leads to ‘root of’ (234 matches) and then branches out to ‘root of suffering’ (14 matches) ‘root of arguments’ (12 matches), ‘root of misery’ (8 matches), ‘root of craving’ (4), and ‘root of rebirth’ (2 matches), but also the much more frequent phrase ‘root of a’ (162 matches, root of a tree for example). The key is to make informed decisions about what to filter and when. If one filters out the word ‘of’ too early then the tree behind ‘root of’ is not found. I have seen libraries for detecting ‘parts of speech’ in text (noun, verb, etc. for the english language) that might be useful here.

2 Likes