How to indicate ranges of bilara segments?

sujato · December 5, 2019, 10:04am

A question just came up in discussion with Ven Brahmali, and I wanted to bring @blake, @hongda, @chansik_park, and @karl_lew and any other interested parties in on it.

This is a situation that we have not yet considered, and we should do so. We need an internal standard for referring to ranges of segments in a consistent way. This would be used in such places as:

Here in the context of notes.
“text highlighting” for the text pages, based on sectional parallels
Links in dictionaries and essays

The first thing is that the # means a “link to something inside the page or document”. It is how URLs specify internal links. We used to use this for our IDs, but it is ugly, so now we use : and transform it when needed. Or at least, that’s the theory!

Bottom line is, in a Bilara ID, everything that specifies a particular text (i.e. the part of the ID that is the same for every segment in a text) is followed by a colon. Everything after the colon identifies segments within the text.

So it should be:

pli-tv-bu-vb-np1:3.2.1

Since whatever follows the colon specifies the segments in the page, that can be used to indicate a range. The most obvious way to do this would be to make a rule:

Ranges of segments are defined using the full number following the colon.

pli-tv-bu-vb-np1:3.2.5-3.2.25

Question 1

Now, sometimes within segments we use a hyphen. This indicates a different kind of range. Not a range of segments, but where a segment captures a range within the original text (typically where a text is so abbreviated that it is impossible to specific each item).

dn11:9-66.1

Is this a problem? Do we need to disambiguate the two kinds of ranges? To be clear, the range of segments proposed here will never appear as a segment ID. It is a canonical way to refer to a range of segments, for example, in a note.

Question 2

Another question. It is common in similar cases to present only as much detail in the range as is necessary. Thus we might have:

pli-tv-bu-vb-np1:3.2.5-25

Or in another case:

pli-tv-bu-vb-np1:3.2.5-3.3.4

This allows more concision, but I think it adds complexity and should be avoided. What do you think?

Any other thoughts?

karl_lew · December 5, 2019, 1:46pm

Voice today uses ranges for selections at the sutta or segment level:

ID	Meaning	Notes
MN1-3	MN1 MN2 MN3	Nikaya id range
AN1.9-11	AN1.9 AN1.10 AN1.11	Subnikaya id range
MN1:9-10	MN1:9.1-4 MN1:10.1-4	Major segment id range
MN1:9.1-4	MN1:9.1 MN1:9.2 MN1:9.3 MN1:9.4	Minor segment id range

Range selection is implemented using sutta-central-id.js, which is used to match suttas and/or segment ranges.

Voice does not distinguish between root and user ranges. They use the same syntax. AN1.1-10 and AN1.2-11 have the same semantic user meaning. Voice automatically chooses the correct files to display to the user. I.e., for AN1.2-11, Voice would choose AN1.1-10 and AN1.11-20.

Voice implements this simple range paradigm where the hyphen defines a range between the atomic numbers it joins (i.e., 5 through 25)

This would really confuse Voice. Voice hyphens bind tighter than period or colon. The semantic organization of segments itself argues that one would rarely reference fragmented groups of segments. I.e., either of the following doesn’t require complicated syntax and is easier to implement:

pli-tv-bu-vb-np1:3.2-3
pli-tv-bu-vb-np1:3.2.5-9, pli-tv-bu-vb-np1:3.3.1-4

In other words, hyphens bind tightly and commas bind loosely. Let’s not get too fancy with rarely used syntax.

sujato · December 6, 2019, 6:39am

Okay, excellent.

How would you indicate, say: dn11:9-66.1 to dn11:9-66.7? Would you be okay with dn11:9-66.1-7?

To avoid such ambiguities, one option would be to use -- for ranges of sagments. Thus:

One or more hyphens indicates a range.
Exactly one hyphen indicates a hard-coded ID range
Two hyphens indicate a range of segments

dn11:9-66.1--9-66.7

No, this isn’t a safe assumption. Sectional parallels may target any arbitrary range of text.

It’s not rarely used. On a quick count, there are over 4,000 sectional parallels. We need an bulletproof method of identifying these in all cases.

We can’t use commas: it must be JSON-friendly.

But they fail to specify the needed range in a usable form.

The issue really is that Voice only handles a subset of the cases that SC does. If we can learn from your system, then great, but we can’t assume it.

karl_lew · December 6, 2019, 1:25pm

Ahhh. Thanks for explaining. The alignment would be broken since the translation itself would introduce a slightly different conceptual organization. We’ve seen that recently in an earlier thread about “The Buddha said” vs. “The mendicant said”,

Voice implements range comparison, so the above could be unambiguously understood as “everything between dn11:9.1 through dn11:66.7”. Range comparison sidesteps the ambiguity of enumeration with very simple rules. We may not know exactly what segments lie within a range, but we can certainly assert that something is or is not within a range. Perhaps we might not need the “- -”?

Yes, it would indeed be friendlier to use arrays rather than embed commas in strings. If disjoint references are required, we could simply have an array of individual references for the many-to-1 or 1-to-many references that may (?) arise. JSON properties can freely switch between strings or arrays values. However, before we take that step, all software that read those references would have to handle strings or arrays for references. The advantage of embedding commas is that less software needs to understand what’s in the string.

My concern is implementation. Siri is usable but horribly complicated to implement.

blake · December 6, 2019, 4:00pm

I am still musing on this matter, but I’ll give my thoughts so far.

So one way to differentiate would thus be to query “is this an actual segment ID” and if it doesn’t exist in the data, proceed to treat it as a segment range reference, one potential hiccup is that a segment ID may not exist in the root text for another reason, an example being the modern headings that only appear in the translation.

A question then: is it desirable to know what a segment range reference is without having to query the data at all? Probably. So then, I find the idea of the double hyphen attractive. In fact, exclusively using double-hyphens for ranges would simplify code, if a segment ID contains a double hyphen it’s a segment range.

I’m not sure why you think commas are not JSON friendly, in fact we already use a colon in the segment ID despite it being a part of JSON syntax. . Within a string in JSON the only two characters that have to be escaped are double-quote and backslash, anything else (other than control characters like newline) is fair game. I’m not saying that commas are the right approach to disjointed ranges but they should be an option.

Anyway, so far I think I’d favor double-hyphen and no abbreviation. Abbreviation might not be a big deal, but then you can end up with weird things like dn1:1.1--3 , which probably means 1.1-1.3 but could also be interpreted as 1.1-3.x - of course computer code can be programmed to always interpret it in a particular way, but if the reference is being used by a meat brain, who knows.

Now to be fair, in the majority of cases sense can be made of an abbreviated form, and also single-hyphen forms. There is an argument for their use in organic text in that non-abbreviated forms can be unwieldy.

So perhaps I would favor the “canonical” form being pli-tv-bu-vb-np1:3.2.5--3.2.29 and the code is written with that form in mind and we use that form in data (i.e. parallel data). However we also support an abbreviated form, such as: pli-tv-bu-vb-np1.3.2.5-29: code which deals with references (like say a reference linkifier), upon discovering that it doesn’t make sense as a segment reference, can try converting it into the canonical form and treating it as a segment range reference instead. Then if a user runs into a weird case and the abbreviated form doesn’t do what they expect (like the linkifier highlights the wrong segments) they can be told to use the canonical form.

karl_lew · December 6, 2019, 8:03pm

Note: for those following this thread in D&D, we have drafted a preliminary proposal for SuttaCentral Identifiers and References. This issue broadly affects translators, engineers and everyday users. Please RSVP with thoughts, concerns and recommendations.

sujato · December 7, 2019, 8:27am

Fine: but dn11:9 and dn11:66, as well as all the numbers in between, don’t exist. These are abbreviated in the text, so the bilara-data numbers only have dn11:9-66.

Yikes, that seems pretty expensive. These references might be hundreds of segments long. We might want to expand them for application purposes, but surely the basic data should be more succinct.

Right, that’s my thinking too.

Sure. I guess I just feel like a comma indicates distinct things, not a single thing (that happens to span a range of segments). Take for example the current entry in parallels.json:

"parallels": ["~sn46.33","~an5.23#1-#2","an3.101"]

Assuming we apply the double-hyphen proposal this would be cleaner:

"parallels": ["~sn46.33","~an5.23:1--2","an3.101"]

Using commas, something like

"parallels": ["~sn46.33","~an5.23:1,~an5.23:2","an3.101"]

Which doesn’t seem great.

I mean, okay, but the older and grumpier I get, the more I appreciate having strict data. Coding loose logic into applications adds complexity and technical debt. If we’re going to anticipate variant forms, where do we stop? I’d rather put the work in to ensure that all our data is consistent. We are in the rather privileged situation of having a well-curated and specific set of data, so we can afford that luxury. Having said which, it’s also a pragmatic matter, so I guess if a particular part of the app requires it such a fallback could be used.

Awesome, thanks, I will check it out.

One thing that is at the back of my mind is the rather disturbing notion that in a fully segmentized world we can get rid of the concept of a “sutta”. A sutta becomes nothing more than a set of segments. The fact that a sutta is contained in a single file is purely a convenience.

While in the bulk of cases this is irrelevant, for texts like the Dhammapada or the Anguttara Ones, the notion of a “sutta” is barely adequate to structure the text.

In principle we could have all the text for the entire canon in one file, and all the structure in a separate JSON file.

could-should

karl_lew · December 7, 2019, 2:43pm

Works for me.

Pragmatically, files are expedient as editing and translation units. However, from a presentation and search point of view, the SC EBTs are basically an ordered list of segments in multiple language dimensions. The EBTs therefore inhabit the dimension of infinite segment space. Providing the world direct access to the infinite segment space of the EBTs nurtures the infinite consciousness of the EBTs. And that perhaps is a step towards global equanimity.

sujato · December 8, 2019, 6:44am

blake · December 9, 2019, 4:07pm

Coincidentally (or partly related to this thread, partly related to implementing new features in Bilara) I’ve been thinking the same with Bilara: at the moment Bilara operates in terms of suttas, I am seriously considered changing it to operate in terms of “sequences of segments” as it will make certain features like global find-replace much easier to implement and I’m needing to make other internal changes to support things like trilinear translation.

So as an example of how this would work: in the sutta translation view the server will send the sequence of segments that pertains to that sutta (so you won’t really notice anything different), in a global search replace view it sends the sequence of segments that match the search query. And more-or-less the same code can be used for both.

It actually already almost works this way when sending the translations back to the server, as it sends back just segments rather than entire texts.

karl_lew · December 9, 2019, 4:13pm

Yay streams of segments!

sujato · December 9, 2019, 5:28pm

What he said, but in a more dignified and monklike manner.