The 84000.co project for translating Tibetan Buddhist texts is in many ways a sister project to SC. It shares in common the aim of making texts and translations of Buddhist scriptures. The scope is different, of course, as it focuses on the Tibetan texts, most of which are either Abhidharma or Mahayana. There are some EBT translations, though, so there’s always some overlap!
But here I am having a look at some of their textual features to see what they’re doing, and to consider whether any of them might be worth putting on a 2-do list for SC. Word to the wise: don’t get your hopes up! There are plenty of things on our 2-do list, and only so many of us.
First thing is, their presentation is closer to a traditional western book presentation, whereas ours is closer to a traditional Buddhist manuscript. That is, they encase their translations in Forewords, Table of Contents, notes, and the like, none of which are found in traditional editions. We just present the text, and prefer to keep all other matter in the background. Traditionally, of course, a manuscript would contain only the text itself.
In line with this philosophical difference in approach, they display the reference numbers by default, like a book, whereas we hide them. As to implementation, they use a system where all the reference numbers have the same class, and are identified with a unique ID. I am not sure of the exact system, but the ID appears to be arbitrary, and does not correspond with the displayed number. So we have something like:
<a class="milestone from-tei" title="Bookmark this section" href="#UT22084-031-002-125" id="UT22084-031-002-125">1.3</a>
I’m also not sure whether the numbers correspond to any previous system, or whether they are simply introduced by 84000. Presumably these details are explained somewhere, but they’re not evident to me. In addition they have other numbers wrapped in a ref
class, which do correspond to earlier editions, but are presented as simple plain text.
What this system does emphasize is that you can click on a link and “bookmark” it. The bookmarking system relies on setting cookies on your computer, which they politely tell you. Thank you! Once bookmarked, the links you have selected are available from a bar on the side. This system is handy, but it’s also fragile. If you delete your cookies, or use a different browser or device, the bookmarks are gone.
The SC system aims to preserve continuity with previous reference systems, and to enable multiple overlapping hierarchies if they apply. Thus we need to identify each type of reference, both for ourselves and for the user. A typical SC reference looks like this:
<a class="nya textual-info-paragraph" id="nya3" title="Nyanamoli section number." href="#nya3">3</a>
Thus the emphasis of the title is, not to tell of a functionality, but to inform of the kind of reference. However, if you click on any of these references, it will create a bookmarked ID in the browser, and you can simply tell your browser to bookmark this page and it will save that for you. You can do as many of these as you like, and it will save them.
Browser bookmarks aren’t sexy, but they are a robust and time-tested technology. They have some crucial advantages over the cookie-based approach, namely that there is zero (0 as in none) development time, and you can easily share them between browsers and devices. Also, there are a bunch of bookmark addons for browsers that let you organize them in various ways, whereas if implemented on the site there is only one way of doing it.
A cardinal rule in Sujato’s Theory of Web Development is:
Thou shalt not reduplicate functionality in thy website that is already present in thy browser.
So it seems to me that, while it appears like a nice bit of tech, the bookmarking feature adds little functionality, and introduces brittleness and complexity. Another issue to bear in mind is that almost all the texts on 84000 are very long, while most of the texts on SC are quite short. Obviously there’s not so much need for bookmarking in short texts.
Another nifty feature in 84000 is the terminology lookup. This is a little akin to the Pali or Chinese lookups we have for original language texts, but here it is applied to terms in the translated texts.
The implementation is nice and discreet: by default, no terms are highlighted, but when you hover over, a subtle underline appears, and you can click to show the “glossary”. The glossary is not pre-loaded, which is good, as it doesn’t download excess data. You get a drawer from the bottom of the page, which contains the English translation, Tibetan, Sanskrit, and a definition of the term. It also gives you a list of “passages that contain this term”.
To take the last feature first, what this appears to do is to list the various occurrences of the term in question, but only counts them once per “passage”, where “passage” is presumably one numbered section. This is therefore pretty similar to simply using your “find” function in a browser, except that it eliminates duplicates, which is presumably because of the large number of repetitions. Might this come in handy? I don’t know, I’m struggling to think of a case where I’d want it. It seems, again, to add only a little to a basic browser function.
Okay, so back to the glossary. This is implemented with the following markup:
<a href="#UT22084-031-002-3679" class="glossary-link pop-up">engage in union</a>
Once again, it appears as if they use an arbitrary ID, and a class to get the behavior. Presumably the IDs are mapped on to the terms somehow, we’d have to look closer to see how that is implemented. But this system can be nice in that if different texts use a different underlying Tibetan rendering, or a different English rendering, of a term, they can all be simply assigned the same arbitrary ID and the term is uniquely tagged.
We have considered implementing something similar for SC, but it’s just so complex. 84000 deals with a much smaller range of source texts, and a single target language. Also, all the texts are, so far as I know, produced “in house”, so they can control the terminology at least to some degree. I’m not sure if the list of terms applies across all their translations, or on a per-translation basis.
How might we do something like that for SC? The classical approach would be to hand-code the terminology in the text files. This is technically simple, but a lot of work, and it’s really hard to implement across languages.
Another way might be to do it in the front end, leveraging the special qualities of our segmented texts.
The big problem in automating this process is the inherently loose nature of natural language. How can we determine that a given translation word or phrase is, in fact, representing a given underlying text?
Since we have (relatively) consistent matching of terminology between text and translation, perhaps we could achieve this by doing a double matching. A terminology link could be applied in the case where a given term appears both in the translation and the text, as determined by a pre-existing list of terms.
So, for example, assume we have a segment where the word “form” appears as a translation of the Pali word rūpa. The front-end parser would check that segment, find the word “form”, then consult a terminology list. There, the word “form” is given as a rendering of the Pali word rūpa, so it would go back and check the matching original text segment for that word. If it’s there, hooray!
Since we match both text and translation, it’s likely that there will be hardly any false positives. It may happen sometimes, but hey, no-one’s perfect. On the other hand, we are more likely to omit some matches, as for example if the text and translation segments don’t entirely agree. That’s okay, it’s not so important to get every single instance, so long as there are not many false positives.
Since this scanning has to be done only on a per-segment basis, it will (hopefully) not be too resource-hungry.
Once the identification has been made, we can create a span to wrap the term, and supply a glossary entry to define it.
This approach has a number of advantages. It keeps the text markup clean, not cluttering it with endless arrays of spans. And it is relatively easy to adapt to other languages. All you need is a list of terms with text and translation. Say there’s 500 terms to start with. For each language, you simply provide the 500 terms, and hopefully a brief definition for each. Once we have that, we can apply the same process to that language, so long as it has segmented translations.
Of course, this approach would not work with the legacy translations. But that is okay: going forward, I want to focus our efforts on the segmented texts, precisely because they offer this kind of promise.