File and folder name conventions for JSON texts

sujato · May 30, 2019, 12:28am

I have updated this proposal following some suggestions in the comments.

Background

SC is moving to a JSON-based system for maintaining data. Each kind of data is maintained in a separate file, and the text segments are coordinated by the universal IDs. Here is the initial implementation.

The question arises as to how we should name our files. Traditionally on SC we have had minimal file names and have inferred further details from context, i.e. folder names. For example, we might have the file mn1.html included in the path /en/sujato and can therefore conclude that this is a translation by sujato of the text mn1.

This has the advantage being fairly DRY (although not completely; we still have /mn/mn1). However it means that everything has to be inferred, and makes them less portable and resilient; any change to the folder structure can mess things up. And if someone wants to, say, use the files in another project, they have to ensure to maintain or adapt the same folder structure.

Proposed JSON naming conventions

For the JSON files, building on suggestions by Blake, I propose we use fully articulated file names, eg.

mn1_root-pli-ms.json

Here, there are three main sections:

The initial part, before the underscore, is the text UID.
Between the underscore and the period we have the MUID (meta-UID), consisting of several components.
The file-type extension. This will usually be .json, except for markup files, which take the extension of the markup, eg. .html

Text UIDs and file extensions are unproblematic, but the MUID is a new concept, so let us consider it further. Following is an initial proposal for discussion.

MUID

The MUID must always include a type, of which we have the following:

root (= fundamental text source in ancient language)
translation (into any language)
variant (variant readings for the root text)
reference (reference details aligned per segment)
comment (various notes etc. by modern translators or editors)
markup (currently HTML, may be others)

Other elements of the MUID will vary. In some cases, only the type is needed, in other cases we also need language and edition.

Type only

Consider, for example, the markup type. The same markup is applied to all versions of the text, whether root or any translated language. Thus we have:

mn1_markup.html

Similarly, the same set of references apply across every edition. So for the reference type we have:

mn1_reference.json

Type, language, edition/author

For texts,comments, and variants, we need to additionally specify the edition or author.

For the Mahasangiti edition of the Pali text of MN1:

mn1_root-pli-ms.json

For the sujato translation of the same text:

mn1_translation-en-sujato.json

In the case of variant readings, the edition indicates the edition from where the variant was sourced. Currently these are always the MS edition:

mn1_variant-pli-ms.json

Comments, always a tricky area, quickly gets complicated. I propose we use the same form:

mn1_comment-en-sujato.json

Then we know “This is a comment by sujato in English on MN 1”. That way we can keep the form of the file names exactly similar to the translation and root texts, and keep main info there.

Now, the comment might in principle need to be further qualified. It might apply, say, to all texts of MN 1 (like say a discussion of Pali syntax) or only to this particular translation (like a note on English rendering), or some other scope. In addition it might want to be qualified by type or purpose. Probably this will all get too complex for a mere file name.

So we accept that comments are more loosely coupled to their source, and in addition, are more complex and open-ended than other kinds of data. mn1_comment-en-sujato already gives us some useful basic info. Further details can be added to the relevant _mn-info.json.

This makes a nice distinction between critical data (in the file name) and useful data (in _info.json).

No matter what happens, the comment is on this text; it is by sujato, and it is in English. This data is all critical and is in the file name.

On the other hand, the comment may be recommended for translators or for general readers; it may be intended to apply to a specific translation edition (such as a discussion on a terminological choice), or to a specific original text (such as a mistake in punctuation).

But there is nothing definitive about this: a non-technical reader may still enjoy reading a technical comment, while a non-Pali scholar may still be interested to know that differences in editorial choices of the Pali edition can affect the translation. Individual readers or clients may handle these things differently. Such data can be retained in an open-ended form in separate json files.

Summary of proposed MUID rules

The MUID must begin after the underscore and end before the file-name extension.
Elements of a MUID must be separated by hyphens.
Every MUID must include a type.
By default, data applies to every root and translation of its UID. The scope may be constrained by specifying language and edition/author in the MUID.
MUIDs for comments, root, translation, and variants must follow the sequence: type-language-author/edition

Complications

Various perhaps obscure scenarios may arise in the future. It’s probably not a good idea to overthink it, but worth bearing in mind that new requirements may appear.

Multiple translations of the same text by the same author
Translations into ancient languages
New types? (can’t think of any TBH!)

Folder structure

If the file names are well-articulated, folder structure becomes less important, and chiefly serves for convenience. I propose we keep the idea of a single attribute per folder, and start with the type:

/root/pli/ms/mn/mn1_root-pli-ms.json

@blake @Aminah @karl_lew @michaelh @anders @HongDa

Khemarato.bhikkhu · May 30, 2019, 3:22am

Linux and most commandline utilities can handle colons in file names, but not every operating system can and even on e.g. Linux there are some gotchyas when files contain colons. I therefore generally recommend against a file naming convention that uses colons (or semicolons, slashes, ampersands, etc etc)

sujato · May 30, 2019, 5:45am

Thanks, would you recommend something else instead? We can of course just use a hyphen.

I have done a little research and found that this means “Windows”. So … bug or feature?

Khemarato.bhikkhu · May 30, 2019, 6:30am

Generally I would use hyphens for the first level and underscores for the higher level delimiter (and dots if a third level seperator is necessary, which is almost never)

But that’s true. If your project is already Windows-hostile then ¯_(ツ)_/¯ “Yassadāni tvaṃ, mahārāja, kālaṃ maññasī”

Which is to say, I’ll go back to minding my own business

Robbie · May 30, 2019, 11:19am

Could there be benefits in indicating whether or not a translation is segmented in the MUID?

For example,

mn1_translation-segmented-en-sujato.json

Or by using a new MUID type:

mn1_segmentation-en-sujato.json

(in which “segmentation” takes on the new meaning of “segmented translation”)

blake · May 30, 2019, 11:47am

That mostly looks reasonable, here are my thoughts:

markup

I am considering using JSON for markup too. This would be basically like a dump from the PO, the markup tail would be handled by assigning it to “~”, the reason being that in ASCII order “~” comes after alphanumeric characters, being one of only four characters, the others are {, } and |.

Comments

I was thinking of using a second level of JSON inside the file,

mn1_comment-en.json

{
  "mn1:1.1": {
    "captain_obvious": "the first paragraph",
    "sujato": "...something insightful..."
  }
}

I don’t see any serious disadvantages to doing this. It’s even quite trivial to search by author as long as the JSON formatting follows the normal human-readable convention of putting the key and value on the same line, i.e. you could use a regex “sujato.*insightful”.

The exact format might be different though, for example:

{
  "mn1.1": [
      {
        "author": "sujato",
        "comment": "...something insightful..."
       }
    ]
}

Using an array instead of an object means multiple comments from the same author is possible including comment chains (i.e. discussions between two translators). We should decide if this is desirable.

A second level of structure in the json seems beneficial for certain classes of data, such as comments.

author vs project

One challenge when it comes to translations is there isn’t necessarily a single translator, as such I would lean towards the concept of a project or edition. It would be okay if we label sujato’s translations as the “sujato” project. It would also be acceptable for translations to be projectless which means internally they use a default unnamed project.

I would suggest having no reference whatsoever to author in the folder or file names. Author would only appear in the json data, as in my proposed structure for comments.
A project can share a name with an author but does not have to.
A project is not required.

defining MUIDs

It is important to clearly define what MUIDs refer to. In the current bilara prototype I use meta files, starting with an underscore for this. I believe this is a useful concept.

/_pli.json

{
  "pli": {
    "type": "language",
    "name": "Pali",
    "description": "The language of the Pali Canon"
  }
}

The meta data should at a minimum be type and name, but can include literally anything, for example a project might contain an extended description.

In the current prototype a MUID should be defined before or when it is encountered while exploring the folder structure and is then inherited by subfolders, so for example it might be located at /pli/_pli.json or /_pli.json, but if the code encounters a folder called pli but hasn’t previously encountered the definition nor can find the definition in the pli folder it should throw an error saying “I don’t know what pli is”. This scheme allows a definition to be overridden in a subfolder or alternative branch, for example “mn” might have the name “Majjhima Nikaya” in root/pli/mn/_mn.json but the name “Middle Discourses” in translation/en/mn/_mn.json.

UIDs could also be given definitions in meta files, for example the file _mn.json might contain the description or blurb for every mn sutta.

There is an alternative scheme, the code could first explore the entire structure and examine every meta file and then apply that knowledge globally, altough that doesn’t allow for a clean way to override the metadata in a specific context. Alternatively everything could be defined at the top level, but that also doesn’t exactly leave a clean way to override in very specific contexts.

In any case I haven’t concluded what the best naming scheme for meta definition files would be, though I think the general concept that they start with an underscore is sound.

MUIDs must be defined in a meta file before or when they are encountered, to not do so is an error. MUIDs include folder names and whatever follows an underscore in non-meta files.
UIDs may be defined in a meta file but it is not required as that metadata can be contained in the file itself.
Meta files must start with an underscore, inside they should contain an object of key:value pairs, where keys are MUIDs.
The code will probably ignore the name following the underscore but we might want a convention.

karl_lew · May 30, 2019, 2:32pm

The markup folder is brilliant and the language independence breathtaking.

The continued and pervasive use of a long filename component such as “translation” will become irritating over time. Suggest “t” or “tr” or “trans” for brevity. Perhaps even “t9n” for the whimsical.

Colons are divisive speech used by Windows as drive name syntax. Avoiding them would be inclusive. A hyphen suffices.

Over-burdening filenames with meaning can get quite cumbersome, especially with the unbounded nature of “comments about”. Suggest using a simple date code such as “20190530” for each comment regardless of purpose. Prolific commentators may suffix date with a single letter (e.g., “20190530A”). Extraordinarily prolific commentators should…practice restraint.

It would be interesting to extend MUIDs to voice recordings…
For example, we might have the Vinaya spoken by either Ajahn Brahmali or Bhante Sujato on a given date.

Since this is a Git project, the entire corpus could be released with all prior versions available to posterity:

bilara-data v1.0.0
bilara-data v1.1.0
etc.

Adoption of semantic versioning would even permit resegmentation between major revisions (i.e., “not backwards compatible”)

sujato · May 31, 2019, 12:35am

I’d have to see this first. But I have to say, I really like like the current HTML form. It just makes it really clear exactly what is going on.

Umm, I do! I really don’t want lots of text by different authors in the same file. Sure you can use keywords to search it or whatever, but that’s not the point: 1 file should be 1 thing. I can guarantee you people will want something like “a set of brahmali’s notes”. How do I give that to them? Heck, I’ll want to do that all the time, I want to read and edit and search my notes, and not have to sort through someone else’s. Sure, I know I can filter the searches, I just don’t want to!

Moreover, it’s just asking for trouble down the road. I am still thinking of enabling a user-interface for note-taking at some point. We’ll heap up multiple sets of comments, imagine 100s of thousands or millions of notes by dozens or hundreds of authors, all mashed up in the same files. That’s exactly what we’re trying to get away from.

It also just makes it harder to see what’s what. I can’t just scan the files and see who did what, I have to look at the JSON. And of course, it is less portable.

It’s not necessary. the scope of a segment is small enough that you can simply write separate comments in the same comment field, which will happen occasionally.

What may be more useful, as I mentioned above, is different sets of comments for different purposes. But we’re not there yet.

Note too that I am assuming we do not want to use block level formatting inside comments. Inline formatting uses nilakkhana.

Right, so let me think it through. Say we have two translators collaborating on Bilara, Fred and Wilma. Each edit they make is recorded on Github under their names. But the translation itself must have a separate name for their collaboration.

What about this: The name of a translation cannot be the same as the Github user name (because people have silly user names!). So in any case, we are going to have to get the authors to input a distinct “author” name. Currently on SC we have two forms of this, the slug and the full form. So Fred and Wilma both set up their accounts on Bilara under their Github names, but use the same author name:

slug form: flintstone
full form: Fred & Wilma Flintstone

This is all going to have to happen anyway, and it fits in with the current SC model, where we have multiple authors on several texts. Of course the author name could, in principle, also be an impersonal name, “Paleolithic Translation Project”.

So I’m not sure what problems introducing the idea of “projects” at this level is going to solve.

For the same reason as above, I don’t like this idea: it is not portable or resilient. I want to be able to copy a file, plunk it anywhere on my system, or email it to someone, and have it be immediately clear exactly what it is just by looking at the name. And i’m not sure what problem this is solving.

This all sounds good. Again I would encourage keeping the definitions local as far as possible, having due regard also for DRY.

This whole discussion is for segmented texts only. It doesn’t apply to legacy HTML texts. As a general rule, the legacy HTML format is deprecated. We are currently working to clean and improve the markup, and we will continue to add texts as the become available, but there will be no new functionality developed for them, everything is segmented going forward.

I agree, I have a little thrill of rapture when I saw it!

Maybe, but we already use a bunch of small IDs, so disambiguating them becomes difficult. And lots of small IDs rapidly become unreadable.

Perhaps we could use:

root
trans
var
ref
note
markup

Or make them all four letters?

- root
- tran
- vars
- refs
- note
- mark

Lol, okay!

True, but I am not sure that we have reached that point quite yet. I don’t really see the advantage of using dates: as i mentioned to Blake, what I want is to see the file name and know what is in it.

Right, the audio dimension could definitely extend this concept.

I’m not really think of versions here, but of purposes. For example, I may want to make a literal translation of the Dhammapada for Pali students, and a poetic one for general readers.

Having said which, it may be a good idea to introduce semantic versioning. So far we just rely on Git, which records everything, but it’s obviously not always convenient to check.

The problem with introducing a semantic versioning system is it becomes an additional overhead for the translators. It’s not easy to figure out. It’s the kind of thing that, if you’ve done half a dozen projects, you can get familiar with the different stages and have some idea where its going. But for most of us, when we translate, we’ve never done anything like this before, so we just start and keep going. We don’t really think in terms of versions or stages. This is why I have always thought its best to just rely on git to keep records.

sujato · June 7, 2019, 5:36am

Just so y’all know, I have updated the OP with my latest thinking. I have gotten rid of colons, and keep the comment syntax the same as the other types, see post for reasoning. This eliminates the verbosity problem.