Sorting translated texts by themes and characters

sabbamitta · December 4, 2020, 9:03am

This is about a similar idea than what we are developing with Voice.

We already have the “examples” feature which links texts that have the same key phrase, like “root of suffering”. You can see our current list of examples terms here. We are constantly adding new terms we find interesting and which relate, ideally, to a reasonable number of results. There’s no point in adding “so have I heard” to the examples list.

Our idea is to list terms that lead to important doctrinal points or other interesting and distinctive features. One way of doing this is using the similes that point to them. Try for example typing “water”, and you’ll see, among others, these terms:

water overgrown with moss
water stirred by the wind
water that was cloudy, murky, and muddy
water that was heated by fire
water that was mixed with dye

which lead to Suttas that have the related similes for the five hindrances.

In the latest development steps we have included a preview function so that when starting to type “water”, as in the current case, you can see which phrases are available that have this word. Try typing “nun” for example, and you’ll get this interesting choice of terms:

Screenshot from 2020-12-04 09-43-06

(Note in particular the “pleasure of renunciation”! ) Click on one of them and you’ll get a list of the search results that belong to this term.

For the future we have still some ideas how to develop this further. For example, we’d like to put links into the Sutta texts themselves that highlight any term in that Sutta that occurs in the “examples” list. So if you see “pleasure of renunciation” highlighted in a Sutta, click on it, and you’ll see a list of all Suttas that have this term. But that’s for the future …

All this is however only possible because we limit our search to segmented texts that are consistent in their terminology—as are currently Bhante Sujato’s English translations, and in a state of being gradually created, my own German ones. For exactly this reason the Jatakas are currently not part of the scope of Voice.

sujato · December 4, 2020, 9:47am

Right, that’s really great and we’ll be looking at this. The next step would be to separate the “keys” from the specific text so we can apply the key to any list of texts, regardless of whether it included that exact phrase. There’s a lot of manual work to do that. However, if there are already-existing data, this can be a start.

sabbamitta · December 4, 2020, 12:14pm

I am not exactly sure what you mean by this …

sujato · December 4, 2020, 10:45pm

Rather than relying on a specific wording, we can tag texts or passages by meaning. For example, it might have a simile of a quail, we can tag that, even if it doesn’t say “the simile of the quail”. Then, once the content is semantically marked like this, it can be applied across languages, so that any translation can be searched the same way. That’s the power of abstraction.

The thing is, it’s a lot of work, and we’d have to balance the effort, the cost of other things not done, and whether the process might be made obsolete by more advanced methods, such as neural nets like Buddhanexus.

sabbamitta · December 5, 2020, 10:00am

You are a true visionary, Bhante!

By contrast, my thoughts are rather staying within a limited scope, according to something I’ve seen you say somewhere else here on this forum:

Life is short, Suttas are long. Note to self: don’t translate Jatakas!

I took this great piece of advise for me to mean:

Life is short, Suttas are long. Don’t look too much on other projects, but stick to translating the Suttas!

I can’t help it wandering a little bit astray every now and again, and especially I can’t help being with Voice, as this has been deeply interlinked with my translation project from the very beginning. But what you propose seems huge, and, yes, so many other things would just not be done!

If anything, wouldn’t it still be more worthwhile—despite your “note to self” quoted above—to go for the Jatakas? And leave the “tagging passages by meaning” job to developments like Buddhanexus. I’d feel that would rather be the way to get there. And it would be great to have Jataka translations that are consistent with the rest of the canon! (Even if I doubt that for me, there would be enough time left to translate them too; but never mind, maybe someone else can resume what I will have to leave undone.)

Without visionaries like you, no new things would be developed. Without consistent workers (like I try to be one) no work would be completed.

(I am not saying that you are not able to do consistent work; you’ve translated the canon and that is pretty amazing! But you have been able to do that within 2.5 years. For me, I have to calculate 10 years, if I am lucky. Keeping consistent work over such a period of time is a somewhat different matter.)

sabbamitta · December 5, 2020, 1:10pm

Bhante—my goodness! Only now do I see that they did exactly this on the page you pointed to! That is incredible! How did they do this? You are so right that this is a huge amount of work. I am really impressed!

But the Jatakas are some 500 texts; the Suttas are some 4,000! We really need a Buddhanexus to get that done …

karl_lew · December 5, 2020, 5:37pm

Voice will rely on actual text from the translations. These are actually the examples in .helpers/examples. With trilingual search and segment correspondence we haven’t found a need for abstraction. For example a search for “root of suffering” will actually return the german segments associated.

scv-bilara/scripts/js/search.js -l de -om3 -sl en root of suffering

SN42.11:2.13: For desire is the root of suffering.’”
SN42.11:2.13: Denn Sehnen ist die Wurzel des Leidens.‘“
SN42.11:2.17: Chando hi mūlaṁ dukkhassā’ti.

sabbamitta · December 5, 2020, 5:39pm

But we have to admit, Karl, that it only works as long as we stay within the consistent segmented translations. It’s not something that can be applied to other texts.

karl_lew · December 5, 2020, 5:44pm

Quite right. And that is why I am absolutely enthusiastic about Bhante’s segmentation. Others may segment legacy text to match.