Data structures, IDs, and headings

sujato · February 8, 2016, 10:00am

@blake, @Vimala and I have been discussing data structures for parallels recently and I wanted to raise a few long term issues.

The treatment of internal ID, and hence the way of marking such parallels, is not as clean as it should be, and it would be good to improve it. Here I will just mention a few points.

Segmenting: For the Pali text we have segmented it by punctuation, and this is the basis of the new translation, which will be segmented on similar lines. It would be good to consistently segment our texts on the site in a similar way, and use this as a basis for both IDs and translations. However this is not as simple as it sounds.

Chinese texts have a granular reference system already, i.e. the line numbers of the Taisho edition. This is so widely used it is a defacto standard and our reference system should use it. However this is useless when it comes to segmenting text for translations, as it has no bearing on the semantic structure of the text. In fact the current text is poorly punctuated and, unless we can address this, it is hard to know how it could be segmented. Regardless, even if it is segmented, the Taisho line numbers will always remain the primary reference.
With Sanskrit text, there is a wide variety, derived from different sources. In some cases it may be useful to segment these for translation, especially in the case of longer texts like the Mahaparinirvana Sutra. It would be good to make such segments as sub-sets of the existing references, however this will often tricky, as the references are vol/page rather than semantic. Still, we could treat the segments as subsets of the existing vol/page system. In fact a similar principle could be applied to the Chinese texts, except based on line numbers. Doing this would reduce the amount of independent reference systems we use, and make the references more useful and robust. Here’s what I mean. (For the example, existing vol/page references are in [square brackets], added segment IDs are in {curly}, and segments are assumed to apply to sentences.)

[V/P1]{V/P1a} A sentence that starts at the start of a page. {V/P1b} Another sentence. {V/P1c} A sentence that [V/P2] continues across a page break. {V/P2a} The first sentence that starts on the next page.

Headings: There’s a basic principle in how documents are structured in HTML: the structure is inferred from the headings. This is very useful as it allows us to list parallels by matching the headings. I’ve found myself wanting to do this in the Mahavastu frequently. However it is a blunt instrument, as headings are not always consistent, and not everything has a heading (verses being a case in point). To be consistent, we would have to segment each text, assign each segment a unique ID, and identify parallels by referring to an ID or a range of IDs. This would replace both “embedded parallels” and parallels based on headings.

blake · February 8, 2016, 10:24am

I strongly agree on the usefulness of revisiting segmenting - probably every section of a text (normally starting with a heading) should be wrapped in a <section> or some other markup <div class="segment"> or most minimally, any block level element with an id can be considered a segment.

Incidentally that last idea is related to something I brought up before - that I wanted to replace <p><a class="sc" id="1"> with <p id="1">, the basic idea would be to move away from ids referring to a point in the text, to ids referring to a block of the text. When an id is explicitly attached to a block you know exactly what is being referred to by that id, and you can also build adhoc ranges (like block 1-3) and know exactly what is being referred to - it refers to blocks 1,2 and 3 - and can’t be confused with referring to the text only between anchor 1 and 3.

Vimala · February 8, 2016, 12:38pm

I agree with moving away from the sc tags to <p id="1">. Right now I refer to the sc id’s in the parallels for the dhp verses that I’m working at but it would be easier if it were the paragraph.
Also, I wonder about the usefulness of all the various references in some texts. Do we need them all and if so, do we need them all to show up when Textual Information is clicked? Do we for instance need WP tags in the pali texts?
In some texts there are so many different references like the English pi-tv-kd1. Is this really useful?

There just does not seem to be much of a system of referring to specific parts of the text so it would be good to have id’s on every paragraph.

sujato · February 9, 2016, 4:16am

Are we talking at cross purposes here? I’m using “segment” to refer to the inline segments as done in pootle, and considering how to apply this to create a granular reference system site-wide.

Okay, this is a complex issue, and I’m not persuaded—yet. The advantage is that it makes the scope of elements explicit: an ID scope = the scope of that HTML element.

However we run into problems when it comes to the nitty gritty. Consider, say, vol/page IDs. These typically occur in the middle of a paragraph. If we are to explicitly define the scope of these it is a complex matter indeed, as we run up against the old “<p> tags are an absolute barrier to inline tags” problem in HTML. We would have to do something like:

<p>Some text. <span class="vp" id="vp1" some-other-identifying-thing="X"> Some more text. and some more text.</span></p>
<p><span class="vp" some-other-identifying-thing="X">Some more text. And more</span></p>
<p><span class="vp" some-other-identifying-thing="X">Some more text. And more</span></p>
<p><span class="vp" some-other-identifying-thing="X">Some more text.</span><span class="vp" id="vp2" some-other-identifying-thing="Y"> And more</span></p>

And so on. I don’t know about you, but I’m not feeling it. Or, we could do as we do now:

<p>Some text. <a class="vp" id="vp1"></a> Some more text. and some more text.</p>
<p>Some more text. And more.</p>
<p>Some more text. And more.</p>
<p>Some more text. <a class="vp" id="vp2"></a>And more</span></p>

Clean and simple. This is a common use case, but it is just one of the complexities that we would rapidly encounter.

We could, of course, adopt a hybrid system, with IDs defined by the scope of the HTML element in some cases, and in other cases not. Yeah, i don’t think so.

It seems to me that the current system works basically fine, and, importantly, it doesn’t fight against the nature of HTML. It is what it is. HTML simply doesn’t offer a native way of defining things across multiple paragraphs like that: that’s not how it works. You can define things as a block level, a span inside a block, or a point, and that’s about it. Of course you can do all sorts of complicated things to work around this, but you need a really good reason.

Think about how HTML normally defines document structure. A section of a document can be explicitly marked as such, but normally it is just by heading level, as I mention above. The scope of a section headed by <h2> is not defined by the </h2> but by the start of the next <h2>. Document structure is inferred, not explicit. Of course you can make it explicit, and the <section> tag is there for that, but it is not needed unless there is a special reason.

By defining IDs as points, we are working in a similar way. The scope of <a class="sc" id="sc1"></a> is inferred to be up to the beginning of <a class="sc" id="sc2"></a>. As long as this is understood consistently, it shouldn’t be a problem.

Agreed, the potential for confusion is there, if we’re clear and careful I can’t see a problem.

Note also that this is not just in line with how HTML document structure works, but also with how our source texts work. The scope of a page number is defined by the next page number. You don’t explicitly write “This page ends, new page begins”. There is a point, and the scope is inferred.

Actually this was something I was going to mention. The “sc” numbers are essentially a fallback. In general, they should only be used in the case where there are no suitable references in the source text. The idea is that we should avoid adding our own referencing system but should, so far as is possible, inherit the system developed by others. Occam’s principle of referencing: thou shalt not multiply reference systems unnecessarily. We haven’t always been as clear with this as we should, but that’s the general idea.

This is why, in my original post, I suggested we treat text segments as subsets of an existing reference system. This is something I’ve discussed with Blake previously. On Pootle, the segments are simply numbered sequentially for the beginning of the text. However, I am suggesting we restart them after each occurrence of our main reference in that particular text. That way, the reference is still meaningful even if the segments aren’t there.

Our system needs to not only work in and of itself, but, so far as is possible, to be consistent and interoperable with other works.

With regard to the Dhammapadas, the IDs should be assigned a suitable class, and additional information given in the “title” attribute. Eg the Gdhp is from the Brough edition, so <a class="brough" id="brough1" title="Verse number in the Brough edition."></a>

Absolutely, these are critical. We need to display ways of getting to the text. It is an awesome feature of SC: not only do we supply the information, we tell you what it means when you need it. No other site does this, so far as I know, and it makes searching for references much, much easier.

Again, absolutely! These were hand-added by Ven KB for SC, and they are awesome. If you’re reading the English translation, you can go straight to the Pali passage. Again, nowhere else does this, so far as I know. Also, these numbers are used in many of our other languages, so they are useful for more than English.

Again, yes, it certainly is. Consider, for example, the referencing used in the English translation itself. Even there, when IB Horner is referring to the same text she is translating, she uses multiple systems: the volume/page of the Pali, the volume/page of the English, the rule number, or the section number of the Pali. It is a nightmare. Our SC text smoothes the path by defining each reference in each occurence so you always know what is being referred to.

Yes, as a minimum, if a text does not have paragraph numbers, we should add them. (But as <a> tags, unless we decide to adopt Blake’s suggestion).

Vimala · February 10, 2016, 6:10am

The Dharmapadas I have already changed - I only used sc tags there as temporary markers (mc and jb were not yet in the paragraph-listings in css and js files). And as you say, we always use the id’s of the original documents when they exist. But in this case I was actually talking about the id’s for the parallels like in the Mahavastu you just did: I added the sc tags to your documents in order be able to link the verse-parallels to them.

sujato · February 10, 2016, 7:48am

Here’s another ID issue to think about.

We have our parallel tables, and one of the foundations of SC is the interconnection of the texts.

However, there is a massive source of parallel information that we do not exploit, and that is the ID tags in the texts.

There are many texts that share inline IDs with other texts. Whether this be the various translations of MN and DN, which include the “WP” tags; or, say the translation of the Vinaya that includes V/P references for the Pali.

I’m wondering whether we can develop a widget that will scan for such equivalent IDs, and make them available when “Textual Information” is exposed. Perhaps we’d add a or something, and say “go to original text” or something like that, eg:

<a class="wp" class="wp12" title="Go to original text or other translation">WP 12 ▶</a>

Then a click to call up a bunch of options as appropriate, perhaps like the Translations widget in the Division pages.

LXNDR · February 10, 2016, 10:45am

if i’m understanding the idea correctly that it’s about browsing between specific sections of equivalent texts rather than between whole texts, then is would be a massive feature

sujato · February 10, 2016, 11:01pm

That’s right. Just an idea so far!

sujato · February 21, 2016, 3:04am

I’m just thinking through some applications for our new data structure, and would like to throw out some ideas. This is not concerning the handling of parallels per se, but the hierarchy of texts.

Consider the Pali texts. They have a well defined hierarchy, which unfortunately we don’t represent well on SC. This is because we started with the “suttta” as fundamental entity, and have no complete representation of the hierarchy.

So, fine, we do that, it’s not difficult. But how to represent it on the site?

One thing we can do is make more of a “breadcrumb” style navigation. Many sites, such as AtI, do this, and in a hierarchical setting it makes sense.

Let’s take a simple example.

MN → Mulapannasa → Mulapariyayavagga → Mulapariyaya Sutta

We put this on the page or the sidebar or wherever. Note that this overlaps with the current system of putting extra information above the main <h1>, so that would have to change.

This serves two purposes.

It gives the user a visual indication of where they are in the collections.
It aids navigation since the elements are clickable.

Fine: but what do they click through to? Well, the page for that thing. This implies that the views we provide will be far more flexible. You can see a vagga, a pannasa, or whatever. Obviously we need to have a URL pattern for these.

So far, SC’s URLs are designed for maximum brevity. This works well in things like the nikayas, but it starts to break down in the Vinayas, where the URLs, while still meaningful, are far from intuitive.

My point here is that if we try to assign URLs to these that are as concise as we have elsewhere, it will rapidly become so confusing as to lose the point. So I suggest that for this purpose we assign URLs that simply give the plain text name of whatever the thing is, as defined in the JSON hierarchy data. This won’t make for easily writable URLs, but at least you can read them and know what they mean.

So the four entities we have in the list above would have the following URLs:

majjhimanikaya
majjhimanikaya/mulapannasa
majjhimanikaya/mulapannasa/mulapariyayavagga
majjhimanikaya/mulapannasa/mulapariyayavagga/mulapariyayasutta

Obviously we are going to have to implement synonyms, especially for the main levels that we already show on SC, such as division, subdivision, and sutta. I’m not suggesting that majjhimanikaya/mulapannasa/mulapariyayavagga/mulapariyayasutta becomes the default URL for MN1!

In different collections we should use the various structures as appropriate: samyutta, nipata, khandhaka, vastu, juan, and so on.

Here comes the cool part: we could also implement parallels! There are lots of higher-level structures with parallels. This is a way of exposing a richness of intertextuality that is not captured when considering only sutta-sutta parallels.

majjhimanikaya
↳madhyamagama
majjhimanikaya/mulapannasa
majjhimanikaya/mulapannasa/mulapariyayavagga
↳madhyamagama/avaggathatparallelsmulapariyayavaggaifsuchathingexists
majjhimanikaya/mulapannasa/mulapariyayavagga/mulapariyayasutta
majjhimanikaya/mulapannasa/mulapariyayavagga
↳madhyamagama/avaggathatparallelsmulapariyayavaggaifsuchathingexists/paralleltomulapariyayasutta

Something to bear in mind is that higher level structures, like suttas, don’t always have neat parallels. Consider the SA. In the Chinese text, this has become disordered, so that the sequence of suttas does not always fit into the correct samyutta. This has been noted and corrected by a number of scholars, including our own Rod Bucknell. By including such data, we could display SA by default as it is now, with the disordered texts as found in the Taisho edition. But we could also display it as parallel samyuttas to the Pali, with the texts in their reconstructed sequence. (In this case, incidentally, the reconstruction is done with a high level of confidence. It was done by Yinshun on the basis of the Pali and Chinese alone, while meanwhile, unbeknown to him, an actual Sanskrit text was discovered that confirmed his findings.) This has a definite practical advantage, since it means that someone studying a particular samyutta can do so much more easily.

As a further detail: each URL should have a description. As I’m doing my translations, I’m writing a short description of each sutta. Ultimately we can aim to have such a description for each text on SC, in each language. But there’s no need to restrict descriptions to suttas. On AtI, for example, they have a description of the nikaya. And for many of the entities we have, a description can be meaningful and useful.

Obviously we will never really have descriptions for everything in every language, and anyway there are many things that wouldn’t really benefit from a description (for example, each sutta in a repetition series). So the design will have to allow for a description to be present or not.

Equally obviously, we will have to carefully think about how these are to be used. My point here is simply to indicate some of the possibilities that our new approach open up for us.

sujato · February 26, 2016, 10:50am

I thought some more about this, and I think it would be better to simply use the new data structure to replace the existing heading structure, rather than adding anything new.

The reason I was hesitant was because I didn’t want to lose any special information that might be found in those headings. But really there isn’t much to worry about.

And makes the whole thing simpler, more consistent, and richer, and doesn’t break with our current view. Another important advantage is that the current system lists the texts vertically, whereas in a breadcrumb they’re horizontal. A horizontal layout would be problematic, because these sometimes get quite long, and we have to squeeze it on a mobile screen. Also, the length varies a lot between languages, so this also becomes a problem. Vertical is easy, though!

So we nix the existing heading structure and supply it from the JSON data. Where the data doesn’t exist or hasn’t been translated, we’d have to fall back to the existing hard-coded headings.

As described above, these JSON-ified headings should all be active links that lead to their appropriate view. Perhaps the descriptions can be supplied here as well, but that might be overkill.

mikenz66 · February 27, 2016, 12:05am

I’d certainly like more headings/data. For example, the books of the SN, the titles of the samyuttas and suttas in one’s native language, and so on.

I often find that I’ll go to ATI to find a sutta, because I know the structure, and I have the translations of the titles. I might remember that I want one of those “elephant’s footprints” suttas in the MN, but can’t remember the Pali for “elephant”…

ATI also has abstracts for the suttas, but that may be a bit too much to implement.

sujato · February 27, 2016, 12:29am

Yes, it would be nice, wouldn’t it? Because our original texts come from so many varied sources, we just use the lines above the main heading for whatever was convenient based on the original source. This is a great chance to make this more consistent and useful.

This is in the pipeline. We’re calling them “descriptions”, but it’s the same thing. I’m doing them for all the suttas as I go with the translations.

We will make this part of the translation pipeline. In some cases these already exist in the various languages, in some cases they can be made anew.

mikenz66 · February 27, 2016, 2:07am

Hi Bhante,

That’s all good to know. Yes, descriptions. I’m in academic mode here today…