Searching translations using parallels

sujato · December 16, 2016, 12:51am

In the next-g translations by @Brahmali and myself, while we have similar aims, the translations of corresponding passages are not going to be the same. Hopefully we can review passages and harmonize them to some degree, but at the end of the day they are distinct translations, with different scopes. For example, in the Vinaya it is essential to use gendered terms such as “monk” and “nun” where these can usually be avoided in the suttas.

I’m not worried about these differences as such, but it would be nice to be able to search across the two translations.

So if someone, say, encounters a phrase in a Sutta text, they should be able to search it across the corpus and get similar phrases. I’ve used “stilt longhouse” for pāsāda, for example. Suppose Brahmali uses “MacMansion”. I mean, let’s face it, it’s not too bad as a rendering, right? So if I search for “stilt longhouse” I won’t get any hits, even though there is useful info in the Vinaya on the topic. Sure, a Pali scholar will search the original text, but that’s not for most people.

I’m wondering if we could include the Sutta translation as an alias of the Vinaya translation, and vice versa, for search purposes. The relevant parallels will be marked in our data. So the search results will effectively pull relevant parallels, even if the translations don’t actually include the term in question.

@blake, @vimala, what do you think? Sound like a good idea?

Vimala · December 16, 2016, 7:35am

Using Standoff it should be possible to have this in the long run i.e. just an alternative translation of part of the text.

blake · December 17, 2016, 4:32pm

In general equivalent terms is a very standard things in full text search engines, it just requires manually generating synonyms. For example, with elasticsearch right now this table of synonyms is used:

    "synonyms": [
        "bhikkhu,bhiksu,bhikksu,biksu,monk => bhikkhu",
        "bhikkhuni,bhiksuni,bhikksuni,biksuni,nun => bhikkhuni",
        "dhamma,dharma => dhamma",
        "kamma,karma => kamma",
        "nibbana,nirvana => nibbana"
    ]

Elasticsearch actually provides multiple ways of handling synonymous, it can get pretty twisty when there are multiple words, like if you want to make “dunny” (one word) a synonym for “stilt longhouse” (two words), but there are strategies for handling that described in the elasticsearch docs. Of course, synonyms is not only about what words are used to translate a term, but also what words users might use when searching.

sujato · December 18, 2016, 12:01am

I’m not thinking of equivalent terms as such, since generating the lists would be a major undertaking.

I’m thinking of more a process of inference.

Search for “longhouse”. This returns my translations, but not Brahmali’s which use “MacMansion” for the same term. However, “longhouse” is found in a passage that is listed as a parallel with one of Brahmali’s translations. So the search results display the relevant text of Brahmali as a possible result.

Obviously this is far from perfect. The two parallels are not identical, so they may not, in fact, share the same search term. What the parallel term actually is may not be clear. There may be other passages that are not parallels that still include the term. And so on. Still, I was wondering whether this would be one, relatively easy, way to expand the scope of relevant hits.

By the way, I came across this the other day. It is made of awesome: