SuttaCentral sins: we made a mistake in designing URLs

sujato · December 12, 2019, 4:37pm

(I wasn’t proposing that, I was merely avoiding using a colon. But if it makes you happy!)

My brain hurts, I will digest this for a bit.

karl_lew · December 12, 2019, 5:46pm

Mine does as well.

The scope of change proposed here is rather massive. I wonder what small steps we might take to evaluate the utility of the change. Perhaps we should step back and ask what problem we are trying to solve.

eroux · December 13, 2019, 8:26am

Hi,

I think semantic versioning of namespaces is a very very bad idea, please don’t do that! I still stand by the idea that non-semantic IDs solve all these problems but I’ve made my point already, I’ll shut up.

Best,
Elie

sujato · December 13, 2019, 9:00am

The problem I am trying to solve is that there is no elegant way of expressing a link to a particular segment or segment range in current URLs, as the sutta ID is at the start of the URL.

But I definitely think we should restrict any changes to the minimum necessary to solve user-facing problems.

Umm, that is surely a big assumption. It literally scrolls to the things called id in HTML. I see no reason why it should not continue to be used as we have always used it, as an in-page ID. The bit of the URL before the hash is the sutta ID, the bit after is the segment or segment range ID.

Again, why though? This seems like a convoluted logic. We currently go “show MN 1 and go to segment 171”. This uses the fundamental page id logic built in to HTML.

Of course, and we appreciate that. One of the most valuable things about your input is the recognition that we should be careful to construct IDs so as to be RDF-friendly in the future.

I am still looking for an answer to thi:

eroux · December 13, 2019, 9:39am

I’m not really sure what you mean, here are a few info that might be useful:

RDF per se doesn’t really care what’s in a URI
RDF serializations (XML, TTL, JSON-LD, etc.) will work with any kind of URI but will look ugly because you won’t be able to use prefixes (think of them as xml namespaces)

But generally speaking if you think of your URLs are user friendly things to put in a browser address bar, I don’t think RDF compatibility is relevant as it lives in a different conceptual framework.

sujato · December 13, 2019, 10:22am

Okay, thanks! I had some chat with Blake about this, as well, and he explained to me that the colon is commonly used to indicate namespacing. Anyway, it seems this is not an urgent issue.

I just had a meeting with Blake and discussed these issues.

He pointed out one practical consideration, which applies to the so-called “debaked” suttas. In these cases the file ID is a range (eg an1.1-10) but the segment IDs are to individual suttas (eg an1.1:123).

In such cases, it is imperative that the entire segment ID be included in a URL. This changes the way I was thinking about this, as I had assumed we would follow the current practice of only including the latter part of the ID in the HTML segment.

In other words, we can no longer think of having a URL like this:

/en/sujato/mn1#123

Instead it must be

/en/sujato/mn1#mn1:123

But in that case it is not problematic, and in fact may even be preferable, to retain the current form of URL:

/mn1/en/sujato#mn1:123

This is a little more verbose, but it gains a lot in terms of flexibility as it means that any segment is always represented by the truly global UID. It also means that we need only make minimal changes to the current setup.

In the future we may end up making use of this to provide different “views” of texts.

dhp/en/sujato#dhp8

= View the whole Dhammapada file and go to verse 8

dhp-vagga1/en/sujato#dhp8

= View the first chapter of the Dhammapada and go to verse 8.

We also might consider using this in the presentation of long suttas. Long ago we made the decision to keep it simple by presenting one sutta per page, but in things like the Mahaparinibbana Sutta this is not ideal.

(Even more future facing, there is a medium-term proposal to incorporate a searchable virtual “infinite list” into the browser API, which would enable a much more flexible approach to presenting suttas.)

In addition, we made another couple of decisions:

We both prefer to retain semantic URLs, given that:
- the rate of change is slow, and content is limited and well-defined;
- we can use redirects;
- we don’t want to take on the additional work of a radically different system;
- and we like having semantic URLs so when shared you can see what it is; this is useful and aids security.
While we are interested in Karls’ + notation, we think this should be in the “for future consideration” basket, focussing on the current use case.
We don’t think versioning URLs is necessary, especially given that the proposed changes will be minimal.

Anyway, so I think the path forward at this point is decided, thanks to all for your contributions, I have learned a lot. The good news is that we won’t have to have redirects for sutta IDs!

karl_lew · December 13, 2019, 3:42pm

spec updated

What about:

/an1.2/en/sujato#an1.2:1.1

I’ve never liked requiring the use of /an1.1-10/en/sujato, which constrains us to knowing the containerization of the sutta.

Aminah · December 13, 2019, 4:24pm

I haven’t read the spec yet (by the way thanks for creating it) nor followed the meat of the conversation, but if this is all roughly settled, it probably is a suitable point any to clarify if it has any bearing on the legacy texts.

Will an1.1-10 still persist alongside whatever newfangled wotnot? Some legacy texts can be broken in this way, I don’t think others can (see the Great HTML reform thread). Those that can have been wraped in articles and given individual IDs, will they perhaps need to be amended?

karl_lew · December 13, 2019, 4:26pm

Bhante is currently pondering this exact question in chat.
Voice actually allows an1.2 in the search box, so things get interesting here…

Aminah · December 13, 2019, 4:26pm

LOL

Khemarato.bhikkhu · December 13, 2019, 10:23pm

What do you think of using DOIs for this? Seems to be the standard solution in the academic publishing world.

karl_lew · December 14, 2019, 2:31am

The DOI stuff looks good. It appears to be a universal registry that aggregates individual identification systems into a shared namespace.

It’s really hard to beat an ID such as MN1 or AN8.63 for mnemonic brevity. Interestingly, these primary IDs are actually non-semantic in general use. Admittedly, the initial letters are meaningful for those who have studied the suttas collections deeply, but to many of us they are about as meaningful as area codes for phone numbers.

In contrast, the secondary qualifiers such as en or sujato are in fact semantic, as are version numbers. However, these secondary qualifiers function serve to identify individual facets of what are truly multilingual resources. They aren’t primary identifiers. Indeed, as SuttaCentral segmented content grows it will primarily grow via translation as individual translators take on the task of extending the segmented EBTs into their own language. For example, here is a trilingual excerpt of SN42.11:

scid: sn42.11:5.6
 pli: Chando hi mūlaṃ dukkhassā’”ti. 
  en: For desire is the root of suffering.’” 
  de: Denn Sehnen ist die Wurzel des Leidens.‘“

From this perspective, the language/translator qualifiers manifest as facets of the single multilingual resource that is SN42.11:5.6. In other words, SuttaCentral IDs identify unique multilingual segments. Language and translator are selectors. Language and translator are not primary identifiers of resources.

eroux · December 14, 2019, 7:34am

DOIs are good yes. It’s another example of identifiers that are non-semantic in order to be persistent. It seems to me like a sledgehammer solution but why not… how would you register them? (as far as I can tell it’s a centralized authority). You could upload your texts to Zenodo (that provides versioned DOIs), but that seems like a lot of work… custom persistent identifiers would be much more simple.

I’m not sure why you would say that MN1 is non-semantic? it seems to me that there’s an expectation that it precedes MN2, and that it is part of MN…