On the cool bits of 84000.co that we might shamelessly (or respectfully) copy

sujato · March 30, 2018, 11:29pm

The 84000.co project for translating Tibetan Buddhist texts is in many ways a sister project to SC. It shares in common the aim of making texts and translations of Buddhist scriptures. The scope is different, of course, as it focuses on the Tibetan texts, most of which are either Abhidharma or Mahayana. There are some EBT translations, though, so there’s always some overlap!

But here I am having a look at some of their textual features to see what they’re doing, and to consider whether any of them might be worth putting on a 2-do list for SC. Word to the wise: don’t get your hopes up! There are plenty of things on our 2-do list, and only so many of us.

First thing is, their presentation is closer to a traditional western book presentation, whereas ours is closer to a traditional Buddhist manuscript. That is, they encase their translations in Forewords, Table of Contents, notes, and the like, none of which are found in traditional editions. We just present the text, and prefer to keep all other matter in the background. Traditionally, of course, a manuscript would contain only the text itself.

In line with this philosophical difference in approach, they display the reference numbers by default, like a book, whereas we hide them. As to implementation, they use a system where all the reference numbers have the same class, and are identified with a unique ID. I am not sure of the exact system, but the ID appears to be arbitrary, and does not correspond with the displayed number. So we have something like:

<a class="milestone from-tei" title="Bookmark this section" href="#UT22084-031-002-125" id="UT22084-031-002-125">1.3</a>

I’m also not sure whether the numbers correspond to any previous system, or whether they are simply introduced by 84000. Presumably these details are explained somewhere, but they’re not evident to me. In addition they have other numbers wrapped in a ref class, which do correspond to earlier editions, but are presented as simple plain text.

What this system does emphasize is that you can click on a link and “bookmark” it. The bookmarking system relies on setting cookies on your computer, which they politely tell you. Thank you! Once bookmarked, the links you have selected are available from a bar on the side. This system is handy, but it’s also fragile. If you delete your cookies, or use a different browser or device, the bookmarks are gone.

The SC system aims to preserve continuity with previous reference systems, and to enable multiple overlapping hierarchies if they apply. Thus we need to identify each type of reference, both for ourselves and for the user. A typical SC reference looks like this:

<a class="nya textual-info-paragraph" id="nya3" title="Nyanamoli section number." href="#nya3">3</a>

Thus the emphasis of the title is, not to tell of a functionality, but to inform of the kind of reference. However, if you click on any of these references, it will create a bookmarked ID in the browser, and you can simply tell your browser to bookmark this page and it will save that for you. You can do as many of these as you like, and it will save them.

Browser bookmarks aren’t sexy, but they are a robust and time-tested technology. They have some crucial advantages over the cookie-based approach, namely that there is zero (0 as in none) development time, and you can easily share them between browsers and devices. Also, there are a bunch of bookmark addons for browsers that let you organize them in various ways, whereas if implemented on the site there is only one way of doing it.

A cardinal rule in Sujato’s Theory of Web Development is:

Thou shalt not reduplicate functionality in thy website that is already present in thy browser.

So it seems to me that, while it appears like a nice bit of tech, the bookmarking feature adds little functionality, and introduces brittleness and complexity. Another issue to bear in mind is that almost all the texts on 84000 are very long, while most of the texts on SC are quite short. Obviously there’s not so much need for bookmarking in short texts.

Another nifty feature in 84000 is the terminology lookup. This is a little akin to the Pali or Chinese lookups we have for original language texts, but here it is applied to terms in the translated texts.

The implementation is nice and discreet: by default, no terms are highlighted, but when you hover over, a subtle underline appears, and you can click to show the “glossary”. The glossary is not pre-loaded, which is good, as it doesn’t download excess data. You get a drawer from the bottom of the page, which contains the English translation, Tibetan, Sanskrit, and a definition of the term. It also gives you a list of “passages that contain this term”.

To take the last feature first, what this appears to do is to list the various occurrences of the term in question, but only counts them once per “passage”, where “passage” is presumably one numbered section. This is therefore pretty similar to simply using your “find” function in a browser, except that it eliminates duplicates, which is presumably because of the large number of repetitions. Might this come in handy? I don’t know, I’m struggling to think of a case where I’d want it. It seems, again, to add only a little to a basic browser function.

Okay, so back to the glossary. This is implemented with the following markup:

<a href="#UT22084-031-002-3679" class="glossary-link pop-up">engage in union</a>

Once again, it appears as if they use an arbitrary ID, and a class to get the behavior. Presumably the IDs are mapped on to the terms somehow, we’d have to look closer to see how that is implemented. But this system can be nice in that if different texts use a different underlying Tibetan rendering, or a different English rendering, of a term, they can all be simply assigned the same arbitrary ID and the term is uniquely tagged.

We have considered implementing something similar for SC, but it’s just so complex. 84000 deals with a much smaller range of source texts, and a single target language. Also, all the texts are, so far as I know, produced “in house”, so they can control the terminology at least to some degree. I’m not sure if the list of terms applies across all their translations, or on a per-translation basis.

How might we do something like that for SC? The classical approach would be to hand-code the terminology in the text files. This is technically simple, but a lot of work, and it’s really hard to implement across languages.

Another way might be to do it in the front end, leveraging the special qualities of our segmented texts.

The big problem in automating this process is the inherently loose nature of natural language. How can we determine that a given translation word or phrase is, in fact, representing a given underlying text?

Since we have (relatively) consistent matching of terminology between text and translation, perhaps we could achieve this by doing a double matching. A terminology link could be applied in the case where a given term appears both in the translation and the text, as determined by a pre-existing list of terms.

So, for example, assume we have a segment where the word “form” appears as a translation of the Pali word rūpa. The front-end parser would check that segment, find the word “form”, then consult a terminology list. There, the word “form” is given as a rendering of the Pali word rūpa, so it would go back and check the matching original text segment for that word. If it’s there, hooray!

Since we match both text and translation, it’s likely that there will be hardly any false positives. It may happen sometimes, but hey, no-one’s perfect. On the other hand, we are more likely to omit some matches, as for example if the text and translation segments don’t entirely agree. That’s okay, it’s not so important to get every single instance, so long as there are not many false positives.

Since this scanning has to be done only on a per-segment basis, it will (hopefully) not be too resource-hungry.

Once the identification has been made, we can create a span to wrap the term, and supply a glossary entry to define it.

This approach has a number of advantages. It keeps the text markup clean, not cluttering it with endless arrays of spans. And it is relatively easy to adapt to other languages. All you need is a list of terms with text and translation. Say there’s 500 terms to start with. For each language, you simply provide the 500 terms, and hopefully a brief definition for each. Once we have that, we can apply the same process to that language, so long as it has segmented translations.

Of course, this approach would not work with the legacy translations. But that is okay: going forward, I want to focus our efforts on the segmented texts, precisely because they offer this kind of promise.

DaguerreoObservatory · March 31, 2018, 1:03am

Certainly the “context feature” that pops up within the window is the most useful feature on the site.

Even in shorter Suttas; being able to click on a terms, names, or places and get definitions, references, etc… is awesome. There are multiple little tweaks that could be done to enhance that feature. Maps would be neat as well if possible. Also the audio pronunciation would be a beloved feature for new students coming to the teachings.

A repeat of proposal, just keeping the conversation up-to-date.

sujato · March 31, 2018, 6:44am

If the place names are tagged in the text, getting a map is easy. It’s preparing the data that’s hard!

I agree, for Pali/Sanskrit words this would be very cool. It should be possible to implement this on the front end, I think. But we’ll see!

Vimala · March 31, 2018, 7:02am

You can have a json file with all the terms (like placenames) and the respective map references and then simply tell it to scan for those placenames and create a reference <a> tag around them with the link to those map references or dictionary entries. We do the same already for dictionary words in the lookup: when you click on a lookup word it takes you to the dictionary entry for that word.

I tried how this works for Pali words like rājagahaṃ, but it divides that up into rāja and gaha and refers to those words separately. But when you look up campāyaṃ it correctly refers to the map reference for campā.

In any case, this should be fairly easy to implement in the frontend overall in the case of placenames.

This would work on both legacy translation and segmented texts. You can also extend it to specific words (have an array of possible translations for such a word i.e. ["form", "forma", "objekt"]). However, I see a problem here with various languages. For instance, the Norwegian translation of “form” is “objekt”. But the word “objekt” has a different meaning in german and dutch.

sujato · March 31, 2018, 10:24pm

Right, for place names and other proper names in Pali, we should be able to implement something for all texts. There are, of course, complications, as the spelling of proper names is quite inconsistent, so we’d need to create a set of aliases. In addition, the same name is sometimes used for different people, so I guess we’d have to simply list all the people of that name.

The problem I forsee is when trying to identify terms, not in Pali/Sanskrit, but in the native tongue. Words are simply used in a variety of different senses, and it will, I fear, prove virtually impossible to identify technical terms reliably. This is especially the case when there are multiple translators (or even the same translator) using different renderings of technical terms in the same language; or when different terms are rendered with the same word. That is to say, we will frequently face both one term to many translations, and many terms to one translation.

This is why I suggest, for the technical terms, we restrict it to segmented texts only, and do a double check in the original text and translation to ensure that the correct technical term is being identified. This is the only way I can see to avoid false positives.

Vimala · April 1, 2018, 7:17am

Yes, this is simple with putting all the possible spellings in an array.

That’s no problem, because all the different people are mentioned in the same dictionary entry. See for instance: https://suttacentral.net/define/vimala

Yes, this will be easiest.

I suggest making a GitHub issue of this (or 2: one for places/people as in the dppn and one for technical terms). Maybe somebody would be interested in making a list of placenames/peoplenames together with their different possible aliases in the different texts. We already have the list, all we need is the aliases as they are used in various texts as well as their different spellings (like above rājagahaṃ) then I can do the JS which should be fairly easy.

(tagging in @Aminah on this)

Aminah · April 1, 2018, 12:15pm

I’ve made tickets for each of the mentioned lists and put them under the larger epic of a new “Terminology lookup” feature.

If might be good if one of you filled in the “[Clear, concise description of feature details to be edited in]” placeholder I’ve left (with perhaps bulleted specs, or in any case a short summary) defining what this whole feature should look like when it can be roughly speaking be considered done (naturally, more individual issues can be added to the epic as we go along).

sujato · April 1, 2018, 8:49pm

Cheers, thanks to both of you.