SuttaCentral sins: we made a mistake in designing URLs

eroux · December 11, 2019, 9:26am

Sure, there are different goals, but if you want your resources to be linked to, it’s best to have persistent URLs, and semantic ones are prone to change as this thread exemplifies. The common strategy is to have both.

karl_lew · December 11, 2019, 1:41pm

Consider the persistent IDs for:

{ 
  document:"mn1", 
  segment:171.4, 
  language:"en",
  author:"sujato",
}

From this semantic specification, one might easily generate a unique persistent ID such as:

bW4xOjE3MS40L2VuL3N1amF0bw==

Indeed, the above is actually decodable using any BASE64 decoder as:

mn1:171.4/en/sujato

What we can understand from this exercise is that there is a hidden requirement that persistent IDs should be compact enough for humans to transcribe, and the above is copy-able but not really friendly for speaking out aloud. Another requirement for SuttaCentral Identifiers is that they should be easy to read and write by humans. Short, assigned alphanumeric identifiers such SC123456.789 unfortunately don’t meet that requirement.

What does meet all the requirements is a human readable (and writable), URL-safe canonical form such as:

sc:mn1+171.4+en+sujato

Indeed, the above proposed canonical form has an RDF prefix and even supports abbreviated forms such as:

sc:mn1

Such an identifier is URL-safe, semantic, persistent and human readable/writable. All it takes is a + to combine document,segment,language and author/translator. Additionally, for unambiguous use within SuttaCentral applications, the RDF prefix sc: can be omitted to yield:

mn1+171.4+en+sujato

eroux · December 11, 2019, 2:06pm

I’m affraid I’m not following: can you describe how you went from

bW4xOjE3MS40L2VuL3N1amF0bw== is base 64 for mn1:171.4/en/sujato

to

there is a hidden requirement that persistent IDs should be compact enough for humans to transcribe

using

[your understanding] from this exercise

?

BTW, bW4xOjE3MS40L2VuL3N1amF0bw== is a semantic identifier (just in a different form), so in my view of the world, it’s not suitable for a persistent ID. I’m not sure you disagree with that, I just wanted to clarify.

karl_lew · December 11, 2019, 2:23pm

My first thought for persistent IDs was a hash of a semantic object. Such IDs are commonly used in systems such as git, which generate large hashes with generally unique prefixes. Such persistent IDs are one-way in that they are easy to generate from a semantic representation but the opposite is not true. We cannot derive the semantics from a git identifier.

Then I started thinking about why git identifiers are so useful. And then I realized that they are useful because they can be mnemonically shortened. One often only needs 5-7 characters to uniquely refer to a single change in any git repository.

We could certainly emulate git identifiers. They are quite unique since they are guids. But what I realized upon further thought is that their human value is that they can be abbreviated. And that is quite important for transcription. Persistent IDs such as driver license IDs need to be voice transcribable. Git identifier abbreviations are indeed voice transcribable.

That realization came from tinkering with BASE64. I had expected it to be shorter than the semantic reference, but it turned out to be longer. That in fact was what clued me in to the importance of transcription. I’ll admit that insight was idiosyncratic () but I think that the requirement of transcription is still relevant and critical.

It’s just awful, isn’t it?

eroux · December 11, 2019, 2:44pm

Well, if you like git IDs, you can take the 7 first letters of the md5 hash, so mn1:171.4/en/sujato would become 5e6999f.

My point, though, is that this should be persistent and shouldn’t remain the same through time (and thus be explicitly coded in the database). The reason is that if at some point you want to add something in your semantic it shouldn’t change the persistent ID.

For instance if at some point you want to have a new version of the translation and thus you have mn1:171.4/en/sujato/v1 and mn1:171.4/en/sujato/v2, then mn1:171.4/en/sujato/v1 should still have the same persistent ID (5e6999f), which is not derivable from the semantics anymore.

karl_lew · December 11, 2019, 3:21pm

Excellent point. Semantic bases vary and we would not want persistent ids generated from an ever-changing semantic basis since they would be brittle. A case in point would be splitting area codes. Oh what a headache!

Another property of high value is that a persistent ID should be globally generatable. GUIDS have this property that allows any generator to create GUIDS that are statistically unique globally. If we rely instead on a single generator, a “magic number generator”, then we have another drawback. The single ID generator becomes a bottleneck. In contrast, git ids can be generated universally and in parallel. Git ids are wonderful. But I think we can do better and still be robust in the face of changes in semantics. Let’s take a look at semantics.

Currrently, four semantic components are in play: document, segment, language and author/translator. Those semantic components are quite locked down at this point and not prone to change. We have:

sc:mn1 identifies the SuttaCentral resource bundle (aka. “sutta card”) associated with Majjhima Nikaya 1, The Root of All Things.
sc:mn1+171.4 identifies the SuttaCentral multilingual segment associated with ‘Nandī dukkhassa mūlan’ti—
sc:mn1+171.4+en identifies the SuttaCentral english segment bundle associated ‘Nandī dukkhassa mūlan’ti—
sc:mn1+171.4+en+sujato identifies the SuttaCentral english text most recently translated as *Because he has understood that relishing is the root of suffering, *

These are all different resources that use a consistent persistent ID system that is, as you point out, semantic. They are not the same resource. They are not even necessarily hierarchical. They are however, quite precisely defined and invariant in usage.

Let’s see what happens with the introduction of additional semantic components. Your example of versions is excellent.

sc:mn1 no change in definition
sc:mn1+171.4 no change in definition
sc:mn1+171.4+en no change in definition
sc:mn1+171.4+en+sujato no change in definition

What’s interesting is that adding semantic bases doesn’t break our prior definitions. They remain consistent, applicable and useful. What we have is a four-axis coordinate system that is unambiguously and rigorously defined. Thinking this way, we simply add another axis. Just as we go from two dimensions to three, we can add another dimension for versioning. Essentially, our proposed persistent ids are composite, not hierarchical. They are composite in the (latitude,longitude) sense. And we can formally add another dimension as we would for (latitude, longitude, altitude). We can add a version dimension with its own rigorous and formal semantics (e.g., semantic versioning)

sc:mn1+171.4+en+sujato+v1.2.3

eroux · December 11, 2019, 3:47pm

This is quite reasonable yes. I think I’m more thinking in term of very big collection where concepts combine in complex ways and typos in names are unavoidable… When you have a very scoped project and you’re 100% sure that your semantic is correct (ie. you didn’t mispell sujato, you won’t change en to en_US vs. en_UK in the future, etc.), then why not.

When databases start to grow, you can’t make this kind of assumptions anymore though, as people’s name change, have typos, etc. In your case it’s very theoretical, but what if the person behind the identifier sujato wants to be referred to as something else and finds URLs containing sujato offensive? What if instead of needing to combine 4 dimensions you need 150? etc.

But if you’re 100% confident in the scope of your project and the persistence of the semantics, the IDs you describe are reasonable.

karl_lew · December 11, 2019, 3:51pm

Bhante @sujato, we hereby require that you be happy in this very life with sujato. Future rebirths would permit sujato2, sujato3, etc.

Aminah · December 11, 2019, 4:06pm

Just to throw in a note…

First off, a very warm hello @eroux, lovely to hear from you again! Not long ago I had you in mind as I was trying to get some kind of handle on linked data and, more specifically how SC might well relate to it.

In all honesty, it’s still rather above my head, but at the same time, I have quite a strong instinct that there’s great relevance for SC in this area; both in terms of what it can give to the world of data/knowledge and how it might benefit and become more relevant/fortify its reach from having a presence in the broader net of linked knowledge (there are some very obvious datasets—the British Library and many other of the world’s libraries, DBpedia etc and, of course, BDRC, just thinking off the top of my head—that can be mutually enriched).

Following further (but still wobbly-legged) research on json-ld after making tentative comments in this thread earlier in the year, again by instinct I very much have the impression that json-ld is likely to be the right fit for SC (and certainly, whenever it can venture the matter of RDF, should preferenced over Turtle). Alas, I had to set poking into the issue aside for the time being so can’t yet present the solid case I’d want to, were I to take up advocacy.

Though reluctant to put my mind on gritty detail that gives me jam-brain, I thought to add comment here to pick up on this point:

Yes, to the limited extent that I do understand things, this is also what I believe to be the case.

I think the point of speaking to quite different purposes in some ways could perhaps do well to be drawn out more clearly. The best way I can think to do that is point to a nicely written article on linked data and in particular what it has to say on URIs:

It might yet further help to highlight this:

When we wanted to create URIs for the entities described by the ‘Tobias’ project, we chose a URL-like structure, and chose to use our institutional webspace, setting aside data.history.ac.uk/tobias-project/ as a place dedicated to hosting these URIs. By putting it at data.history.ac.uk rather than history.ac.uk , there was a clear separation between URIs and the pages of the website. For example, one of the URIs from the Tobias project was http://data.history.ac.uk/tobias-project/person/15601. While the format of the abovementioned URIs is the same as a URL, they do not link to web pages

karl_lew · December 11, 2019, 4:15pm

That is a great point. The SuttaCentral Identifiers and References precisely identify specific resources that can be viewed as web pages but which themselves are not web pages. For example a query for sc:mn1 could be satisfied with the suttaplex JSON for a sutta card. Or it could be used to identify a generated PDF for download.

For JSON-LD, we might have:

{
  "@context": {
    "sc": "https://github.com/suttacentral/suttacentral/wiki/SuttaCentral-Identifiers-and-References",
    "sc:homepage": {"@type": "@id"}
  },
...
}

The above declares the context for the sc prefix in a JSON-LD document. Here the context is defined loosely by the link to github. We should probably have a formal definition as a page on SuttaCentral instead.

eroux · December 11, 2019, 4:30pm

I didn’t know this book, it really looks like a very good introduction, thanks for pointing it to me!

I think going LOD is a very exciting adventure and I’m really happy to provide some support for that! Suttacentral indeed has a role to play, and could connect to BDRC. For instance we’re starting to have a very big quantity of images from the Fragile Palm Leaves collection (example), I haven’t looked closely but I’m pretty sure many are witnesses of texts that are on SC. If you have persistent identifiers and if we agree on what concepts we manipulate (which most of the time requires a surprising amount of effort), we could use LOD to point to your pages from our corresponding records, and you could use our images on your site directly using IIIF (all our images are already accessible in this form, as well as the Taisho Canon).

As an example of a commonly difficult conceptual reconciliation (at least in my experience with partnership), the bibliographical entities that are manipulated in different databases are often conceptually different… The model we’re following (or copying) is the most widespread model for bibliographical resources, BIBFAME2.0.

karl_lew · December 11, 2019, 5:13pm

For the links to SC, perhaps the document/segment form would suffice? I’m guessing that the image correspondence might be partial so we might need to use segment ranges as in:

sc:mn1+148-170.25--171.6

mn1:148-170.25: Why is that?
mn1:148-170.26: Because the Realized One has completely understood it to the end, I say.
mn1:148-170.27:
mn1:171.1: The Realized One, the perfected one, the fully awakened Buddha directly knows earth as earth.
mn1:171.2: But he doesn’t identify with earth, he doesn’t identify regarding earth, he doesn’t identify as earth, he doesn’t identify that ‘earth is mine’, he doesn’t take pleasure in earth.
mn1:171.3: Why is that?
mn1:171.4: Because he has understood that relishing is the root of suffering,
mn1:171.5: and that rebirth comes from continued existence; whoever has come to be gets old and dies.
mn1:171.6: That’s why the Realized One—with the ending, fading away, cessation, giving up, and letting go of all cravings—has awakened to the supreme perfect Awakening, I say.

The above uses the double hyphen -- as a reference to all segments in the range:

[ sc:mn1+148-170.25, … , sc:mn1+171.6 ]

For the links going back, SC would use the IIIF URL of choice (e.g., thumbnails would be a different IIIF URL than a full image).

On the image database side, you would have the choice of storing just the ID with the “sc:” prefix or a full SC URL without the prefix:

sc:mn1+171.4
https://suttacentral.net/mn1+171.4

NOTE: the latter is not yet implemented since we’re designing the protocol in this thread.

eroux · December 11, 2019, 5:38pm

Sure, I think this is way too soon to discuss that kind of details, but the usual partnership workflow in LOD is:

agree on a conceptual framework
agree on a vocabulary
reconcile entities semi-automatically (OpenRefine is a tool made for that kind of task for instance)
share the data on both sides

The platform then automatically does the linking, the images, etc. There’s a fairly large amount of clerical work on our end first, to catalog our records better… if some people read Burmese and Pali written in Burmese are interested in helping I can provide some tools

sujato · December 11, 2019, 5:54pm

This is a bit of a digression, but from a philosophical point of view I would be cautious about regarding permanent URLs as an unalloyed good.

Consider SC itself. We used to link to external sites for links, and rather predictably suffered from linkrot, losing thousands of URLs. That was a hassle. But it forced us to reevaluate the project and begin hosting texts ourselves, with the outcome that we are all here enjoying a pleasant chat together.

“Linkrot” is an interesting term, actually. Rot isn’t a bad thing: it’s part of nature. From decay springs new life. And that is exactly what happened with us.

So I’m not saying we shouldn’t try to maintain stable URLs. I’m just saying, the web works in mysterious ways.

karl_lew · December 11, 2019, 6:16pm

Wot thou we let links rot and links killeth we not?

But seriously, I think of SC as a reference site such as Wikipedia.

Let others’ links rot, here we shall kill links not. (With apologies to MN8)

eroux · December 11, 2019, 6:31pm

I agree yes, two remarks:

that’s why I’m cautiously using persistent instead of permanent: in my mind the distinction is that for persistent URI, there is an intentional long term effort to keep them working as long as possible, and not depend on change of technology, etc.
for pre-modern stuff, there are philosophical and practical problems with ontologies that refer to “real” things, while what we really have most of the time is contradictory traces (see this I suppose very French article)

That doesn’t make these concepts not worth using though as they function and bring benefits.

Coemgenu · December 11, 2019, 6:56pm

When I use this site, I just put “suttacentral.net/” and then I write in my sutta number. Will this still be doable with a system that lists author first, bhante?

sujato · December 12, 2019, 8:36am

Sure, suttacentral.net/mn136 would not change.

sujato · December 12, 2019, 9:05am

Thank you so much to all who have contributed to this thread, it is very constructive and interesting. There are a number of things that I still have questions about, if you don’t mind!

Remember the whole “stream of segments” thing? Conceptually, the notion of a “sutta” (which for us is the same as a “document”) is rather poorly defined. As just one example, the Pali Dhammapada currently exists on SC as both a single file and as a sett of chapters. Which one is the real “sutta”? Another case comes up when comparing different editions of the canon, which in several instances defines suttas differently. (BTW, bear in mind that in addition to the conept of “author” for translations, we also have “edition” for root texts.)

If we have everything as a “segment”, we can treat the notion of a “sutta” as set of segments (which may be defined in different ways). Thus the fundamental ID is a segment ID, and a sutta ID is an abstraction for a set of segment IDs.

Thus I am not entirely happy with treating document ID as separate from segment ID, and treating the binding between them as equivalent to the binding with language and author. In other words it seems to me that instead of:

sc:mn1+171.4+en+sujato

It should be something like

sc:mn1.171.4+en+sujato

(Or however we decide to structure the segment IDs)

Further, I am unclear how the proposed URL works in HTML.

 https://suttacentral.net/sujato/mn1:171.4

Surely we must use a # to identify document fragments?

https://suttacentral.net/sujato/mn1#171.4

This is the basic reason why the sutta ID must come at the end of the URL.

To return to the form of the basic segment ID, can someone please clarify the status of the colon in relation to RDF for me? Is this use a standard, or are there alternatives?

For the basic segment ID forms, the colon works well, as it marks the boundary in the segment between the document (sutta) and the segment. Notwithstanding the qualifications about the notion of “sutta” expressed above, in general it is still a useful concept.

However it is not really necessary. Currently we use # for this purpose in our data, and it does the same job, just not as pretty. And of course it does not have to be transformed for URLs. Perhaps after all we should retain it?

I would like, so far as it possible, to retain the same form everywhere, as transforming IDs is always buggy and laborious.

karl_lew · December 12, 2019, 3:58pm

Hmm. So we have author editions as well as root editions. That’s a double system of versioning.

Perhaps we can address that with namespaces. A namespace is a self-consistent collection of names. That is what the sc: is. It is a namespace. This convention is used in XML, RDF or JSON-LD.

The interesting thing about namespaces is that in local use, the namespace is conventionally omitted. Up until now we’ve been discussing sc: as the “SuttaCentral Identifier and Reference Namespace”. If we wish, we can have versioned namespaces to handle editions and versions:

sc1.2.0: Namespace for translation version 2 based on root source version 1
sc1.2.1: Namespace for translation version 2.1 (minor typographic edits) based on root source version 1
sc1.3.1: Namespace for translation version 3.1 (major conceptual edits) based on root source 1
sc2.1.1: Namespace for translation version 1.1 based on root source 2
sc: Namespace for “latest and greatest at SuttaCentral”. Ideal for third-party links to SC.

In other words, by applying semantic versioning to namespaces, we can address both root and translation versioning. In common use, namespaces would be omitted. Even third party links would probably only use HTTPS urls or the sc: namespace. Semantically versioned namespaces would probably only be needed for academic referential precision and integrity.

Yes. That would be a great relief. The inconsistency of : vs . is no longer required. Thank you.

Actually, it may not be necessary. Conceptually, we want mn1.171.4+en+sujato

Contextually we might want that segment displayed in different ways:

https://suttacentral.net/sutta/mn1.171.4+en+sujato
https://suttacentral.net/segments/mn1.171.4+en+sujato
https://suttacentral.net/whatever-we-think-of/mn1.171.4+en+sujato

For case 1, the mn1.171.4 would be appended automatically by the web page itself to scroll down to segment 171.4 when loading the web page. The # is a presentation command, not an identifier. It’s a local link.

For case 2, we propose a new API that would simply show designated segments alone. Note that the designated segment would have a link back to the full sutta of case 1. Such a web page would only show the segments referenced, in this case, it is just one:

MN1:171.4: Because he has understood that relishing is the root of suffering,

For case 3 and beyond, we would permit future representations that would display other context (e.g., parallels) for a given segment. I.e., we would not be bound to the display of a single sutta. We would display a segment in whatever context made sense.

To summarize, the # is an annotation like bold or italic. It is used for emphasis and focus, not identification. Indeed, consider the following example, which is “show the sutta containing mn1.171.4 and display all of segment 171 on the page”.

https://suttacentral.net/sutta/mn1.171.4+en+sujato#mn1.171