I have updated this proposal following some suggestions in the comments.
SC is moving to a JSON-based system for maintaining data. Each kind of data is maintained in a separate file, and the text segments are coordinated by the universal IDs. Here is the initial implementation.
The question arises as to how we should name our files. Traditionally on SC we have had minimal file names and have inferred further details from context, i.e. folder names. For example, we might have the file
mn1.html included in the path
/en/sujato and can therefore conclude that this is a translation by sujato of the text mn1.
This has the advantage being fairly DRY (although not completely; we still have
/mn/mn1). However it means that everything has to be inferred, and makes them less portable and resilient; any change to the folder structure can mess things up. And if someone wants to, say, use the files in another project, they have to ensure to maintain or adapt the same folder structure.
Proposed JSON naming conventions
For the JSON files, building on suggestions by Blake, I propose we use fully articulated file names, eg.
Here, there are three main sections:
- The initial part, before the underscore, is the text UID.
- Between the underscore and the period we have the MUID (meta-UID), consisting of several components.
- The file-type extension. This will usually be
.json, except for markup files, which take the extension of the markup, eg.
Text UIDs and file extensions are unproblematic, but the MUID is a new concept, so let us consider it further. Following is an initial proposal for discussion.
The MUID must always include a type, of which we have the following:
- root (= fundamental text source in ancient language)
- translation (into any language)
- variant (variant readings for the root text)
- reference (reference details aligned per segment)
- comment (various notes etc. by modern translators or editors)
- markup (currently HTML, may be others)
Other elements of the MUID will vary. In some cases, only the type is needed, in other cases we also need language and edition.
Consider, for example, the markup type. The same markup is applied to all versions of the text, whether root or any translated language. Thus we have:
Similarly, the same set of references apply across every edition. So for the reference type we have:
Type, language, edition/author
For texts,comments, and variants, we need to additionally specify the edition or author.
For the Mahasangiti edition of the Pali text of MN1:
For the sujato translation of the same text:
In the case of variant readings, the edition indicates the edition from where the variant was sourced. Currently these are always the MS edition:
Comments, always a tricky area, quickly gets complicated. I propose we use the same form:
Then we know “This is a comment by sujato in English on MN 1”. That way we can keep the form of the file names exactly similar to the translation and root texts, and keep main info there.
Now, the comment might in principle need to be further qualified. It might apply, say, to all texts of MN 1 (like say a discussion of Pali syntax) or only to this particular translation (like a note on English rendering), or some other scope. In addition it might want to be qualified by type or purpose. Probably this will all get too complex for a mere file name.
So we accept that comments are more loosely coupled to their source, and in addition, are more complex and open-ended than other kinds of data.
mn1_comment-en-sujato already gives us some useful basic info. Further details can be added to the relevant
This makes a nice distinction between critical data (in the file name) and useful data (in
No matter what happens, the comment is on this text; it is by sujato, and it is in English. This data is all critical and is in the file name.
On the other hand, the comment may be recommended for translators or for general readers; it may be intended to apply to a specific translation edition (such as a discussion on a terminological choice), or to a specific original text (such as a mistake in punctuation).
But there is nothing definitive about this: a non-technical reader may still enjoy reading a technical comment, while a non-Pali scholar may still be interested to know that differences in editorial choices of the Pali edition can affect the translation. Individual readers or clients may handle these things differently. Such data can be retained in an open-ended form in separate json files.
Summary of proposed MUID rules
- The MUID must begin after the underscore and end before the file-name extension.
- Elements of a MUID must be separated by hyphens.
- Every MUID must include a type.
- By default, data applies to every root and translation of its UID. The scope may be constrained by specifying language and edition/author in the MUID.
- MUIDs for comments, root, translation, and variants must follow the sequence: type-language-author/edition
Various perhaps obscure scenarios may arise in the future. It’s probably not a good idea to overthink it, but worth bearing in mind that new requirements may appear.
- Multiple translations of the same text by the same author
- Translations into ancient languages
- New types? (can’t think of any TBH!)
If the file names are well-articulated, folder structure becomes less important, and chiefly serves for convenience. I propose we keep the idea of a single attribute per folder, and start with the type: