Specifications for UIDs for the texts

Specifications for UIDs (DRAFT!)

  • UIDs are all lowercase
  • UIDs will only have ASCII characters
  • UIDs may contain a dash (-)
    • to indicate an existing range (not just ad-hock ranges)
    • In the Vinaya in the text of the UID itself. e.g. pli-tv-bu-vb-pc1 to indicate the Pali, Theravada, monk’s, Vibhanga pācittiya.
  • UIDs may contain a single period (.) to indicate a chapter and sutta number.

Other quirks:

  • In a regular URL you can create a UID for an individual sutta within a range sutta (e.g. https://suttacentral.net/dhp1/en/sujato) and you can successfully get that sutta. However, dhp1 is not really a valid UID because using it in the API (e.g. https://suttacentral.net/api/bilarasuttas/dhp1/sujato?lang=en) will fail.
  • In a regular URL the UID is not case sensitive, however it’s not a good idea to do this.
  • The Vinaya UIDs are the only UIDs that contain language, school, etc.
  • In range suttas like the Dhammapada, the range is in human readable areas using an en-dash, e.g. “Dhp 1–20” not “Dhp 1-20”. Using the en-dash in a UID will break it.
2 Likes

I’m wondering about edge cases for UIDs so I can make sure that my apps can handle them. In the OP I’ve added what I know so far to be true. It’s a wiki post so anyone can edit.

If you have time, I’m wondering if Bhante @Sujato, @HongDa, or anyone else in the know can offer corrections or additions. If you just want to reply I’ll add them to the OP.

2 Likes

You probably want to also read Blake’s definition of an MUID (you’ll find it on our Github.)

Yep.

Yep.

In which case it would always be selected by \d-\d? I think?

Yes, but is this only Vinaya? I can’t recall, but I don’t think it was a specific design constraint. Anyway, this would be selected [a-z]-[a-z], I think.

Right.

Indeed, a questionable design choice, but forced by the massive parallelism of the Vinaya, which is quite unlike the suttas.

That’s correct, but it speaks to the more general distinction between the UID and the acronym.

The UID is the jeans-&-T-shirt version, the acronym is suit-&-tie. Acronyms are case sensitive, and they have spaces. The most important thing, which I have had to drill into developers’ heads again and again, is that the meaning of acronyms is context-dependent and therefore they cannot be automatically derived from UIDs. Instead we use a lookup table, otherwise we will get capitalization wrong.


So far you’ve addressed sutta UIDs; are you wanting to define segment IDs as well?

Hmm. Is this in documentation somewhere? I did a search and found it several places in the code. Is the MUID the part of the file name after the _? Until now I haven’t messed with those much, but it would be great to document them.

Sure. I haven’t spent much time working with them other than just as an array to build the suttas.

Is there already documentation on all this in the repository? I couldn’t find any.

We break the file into what I call the “uid” (dn1) and “muid” (“root-pli-ms”), by convention in Sutta Central we use “uid” to refer to a particular sutta, in this case “dn1”, and in Bilara I came up with “muid” (meta-uids) which refers to a collection, for example “root-pli-ms” being the pali root texts and “translation-en-sujato” being english translations by sujato.

BTW, I’m loving Github’s new search, it’s really well done!

2 Likes