SuttaCentral sins: we made a mistake in designing URLs

sujato · December 8, 2019, 6:58am

@blake @karl_lew @HongDa @chansik_park @Aminah

When building the current PWA site, we introduced the new concept of “author/edition” into the URL, thus enabling multiple translations for the same sutta (or editions of the root text).

This was a major change that unfortunately broke a bunch of third party work and applications that assumed we had a stable URL structure. At the time, we just had to bite the bullet and bear the cost.

But, and I’m I’m really embarrassed and frustrated by this, we dropped the ball on designing the new URLs. Or rather, I dropped the ball: getting good URLs has been a passion of mine since the beginning, and I just let this one through.

Currently we have:

https://suttacentral.net/mn136/en/sujato

That is: sutta ID/language/author

The problem is, we need to be able to add section numbers to the URL, and that becomes bad:

https://suttacentral.net/mn136/en/sujato#7

O-yo! The section number is split apart from the sutta ID, and tagged awkwardly at the end. That is ugly and unintuitive, and worse, it makes it really hard to write a link for a section or segment.

Consider too, the case with the segmented texts. If I have a consistent segment, I can easily swap out the language and/or author and get the same segment in a different edition. But if the ID is broken, this becomes ugly and hard.

What to do?

The URLs should have a form like this:

https://suttacentral.net/en/sujato/mn136#7

So:

Should we make this change?
Is it possible to write a redirect so that URLs of the old form will not break?
Are there other options?

sabbamitta · December 8, 2019, 7:53am

Bhante, I find your thoughts on the new URLs most logical, and this would also represent the structure of the bilara-data repository. And especially given the wish—and necessity—to be able to specify segments this change would in principle be very desirable. It has of course to be well considered because it would have just so many implications!

There may also be some positive points to the old URLs, it’s not just rubbish and badly thought out. One thing that comes to mind is that you can just type https://suttacentral.net/mn136, and you get to the suttaplex card of the respective sutta. You don’t need to know who has translated it, nor in which languages it is available, but you can go to the overview card and find all the available options.

That wouldn’t perhaps be possible with the new URLs, would it?

sujato · December 8, 2019, 8:29am

I think that should be still possible. The bare sutta UID points to the suttaplex card. It’s only the next level “down” that would change.

blake · December 8, 2019, 1:28pm

This is certainly technically possible, because each component in a URL can be unambiguously identified as a language, author or sutta UID. In fact the author and language uids are so few in number that this logic could even be performed on the client side.

Maybe? Can you remember why we decided against this in the past? Because I know it’s come up at least twice before.

Aminah · December 8, 2019, 2:31pm

It was discussed in the meeting on 14/6/18, the points considered were that

it might require work at frontend which wasn’t a high enough priority at the time.
the current scheme has benefits in it’s logical structure and the only fail is in the elegance of section references [I’ve no idea what the benefits are, I just made the note of the point raised]
a redirect is easy to do, and ultimately that,
it should be revisited as part of polymer 3 upgrade

So in short, it fell off the table.

karl_lew · December 8, 2019, 3:11pm

One of the reasons I never use Google APIs is because they can never make up their minds about what is best. They’re always getting “better” and breaking everything. Other companies support legacy APIs and URLs while introducing new ones. Those are the companies I trust and rely on.

We can easily introduce new URLs by adding new paths while still supporting old URLs. E.g. the following scheme is an example of how to provide endless API possibilities with little extra cost overall:

https://suttacentral.net/v2/whatever-we-wish-version2
https://suttacentral.net/v3/whatever-we-wish-version3
https://suttacentral.net/v4/whatever-we-wish-version4
…etc…

SuttaCentral has grown to be very much connected with the world. Let’s not break that connection as we add new stuff.

sujato · December 8, 2019, 5:48pm

Okay, so I think the consensus is that we can shift to a new system so long as we continue to support the old.

Not sure about the versioning proposal, Karl, but happy to discuss it further.

Indeed. But I also don’t want to carry the burden of past mistakes.

HongDa · December 9, 2019, 1:41am

I think that while adopting new URLs, old URLs must also be fully supported. At least for a period of time, redirection technically possible, but a comprehensive test is required.

Karl’s approach the advantage is that there is no need to test whether the redirection of old URLs is correct, which will save some time.

karl_lew · December 9, 2019, 3:43am

Yes. That is the heart of my concern. There are many many references to SuttaCentral throughout the internet. We should honor them and not treat them as mistakes.

sujato · December 9, 2019, 6:30am

Can someone explain to me in baby language exactly why this is so?

sabbamitta · December 9, 2019, 8:23am

As an IT baby, I do understand that if we keep the old URLs and build new ones according to Karl’s proposal, there is no need to redirect stuff and hence no need to test if the redirection works.

What do the adults say?

sujato · December 9, 2019, 8:40am

We should redirect things: there must be one and only one canonically correct URL. If someone goes to an old URL, it should work, but redirecting informs them that they are using an old feature.

Snowbird · December 9, 2019, 11:06am

In practice, is this really true? Were you planning on having a “You will be directed to the new page in 5…4…3…2…”

With any of the standard redirects, there is nothing that the user sees other than the url is not what they typed in (or more likely clicked on).

As I understand, even a fastidious web page owner would have to work hard to know that their links were being redirected.

karl_lew · December 9, 2019, 3:31pm

As long as we don’t break all the existing URLs, I am quite happy to explore the brave new world of revisiting SC url structure.

One way to think about this is that we can just add the new system in parallel so that we support all the following:

https://suttacentral.net/mn1 => sutta card
https://suttacentral.net/mn1/en/sujato redirects to
https://suttacentral.net/en/sujato/mn1
https://suttacentral.net/en/sujato/mn1:171.4 displays MN1 highlighted at “root of suffering” with a client side URL rewrite to:
https://suttacentral.net/en/sujato/mn1:171.4#mn1:171.4 (i.e the # is added by the web page itself)
https://suttacentral.net/en/bodhi/mn1 => non-segmented legacy translation
https://suttacentral.net/nya/nya1 redirects to
https://suttacentral.net/mn1
https://suttacentral.net/json/en/sujato/mn1 => JSON segmented MN1
etc.

I.e., there are multiple right views. Each serves a different purpose. Thinking this way, we first ask what is needed, then we add what serves the need.

sujato · December 9, 2019, 5:32pm

Sure, that’s enough. It is not for average users, but for developers: if someone is linking from an external resource, they should know what the canonical form is.

https://i.imgur.com/8zS4uHH.mp4

karl_lew · December 9, 2019, 5:43pm

Restraint should clearly be exercised on problem definition so that solutions themselves do not become a problem.

AN5.272:3.3: They don’t make decisions prejudiced by favoritism, hostility, stupidity, and cowardice. And they know if a meal has been assigned or not.

eroux · December 10, 2019, 7:24pm

Hi!

I’m the lead developer of BDRC, I haven’t made contact in a long time but I’m still following this in the hope that we can collaborate in the future!

As part of our move to Linked Data, an important aspect was to design URLs that are going to be persistent and will “never” change, or at least not for motives such as change of technology, ingestion of new data, SEO, etc.

It seems that the consensus is that the only way to achieve that is to have non-semantic URLs on just one level of path (no/sub/path/). I’ve gathered a few discipline instructions and links in Choosing persistent IDs - Google Docs if you want to know more!

Best,

Elie

karl_lew · December 11, 2019, 5:41am

Elie, thanks for pointing this out. Looks like SuttaCentral Identifiers are a bit problematic with the use of hierarchies and the colon, which are both pre-existing. For the hierarchies, we would at least be allowing the original hierarchy (mn1/en/sujato) as well as supporting a new hierarchy (en/sujato/mn1).

The colon is a bit trickier, since it’s likely that we might need an RDF prefix to indicate that “this is a SuttaCentral ID”. In this case we might make use of an optional alternate character such as underscore. This would allow sc: as the RDF prefix for “SuttaCentral”:

sc:mn1_171.4

Would this be an effective compromise?

eroux · December 11, 2019, 8:45am

Karl, thanks a lot for your answer!

sc:mn_171.4 looks reasonable yes!

Another possible strategy is to have two URL schemes. For instance if you assign non-semantic IDs to things in your database, let’s say you have:

ID39847:
   - lang: en
   - author-id: sujato
   - text-id: mn1

Then you can have:

http://purl.suttacentral.net/ID39847

that you will never have to change and that you can use for your persistent URIs (that you could encourage as links from the outside). And in addition you can have any kind of user friendly URL scheme using the en, sujato, mn1, etc. parts. And this can change, the persistent URI can redirect the html to the latest version.

This is a strategy I’ve seen applied in many platforms (not only Linked Data), and this is what BDRC will do. I think it’s quite solid and future proof…

Best,
–
Elie

Khemarato.bhikkhu · December 11, 2019, 9:05am

On the other hand, human readable URLs have a number of benefits: they can be guessed (for example, this site programmatically generates SC links), they are transparent (where does this link go?), and they are better for SEO