New Structural Data Implementation

So this is the thread about the new data model implementation, which because the acronyms have gone too far is called the Forest, as it contains trees of data.

It should be noted that while the loader code is implemented, I have not gone about rearranging the actual texts much yet, aspects of that are still up for discussion.

Basic principles:

  1. Folder structure is used to form a hierarchy
  2. There is no requirement for uniqueness of uids
  3. Addressing can be done by “bucket of uids” URIs
  4. “Property Files” allow context-sensitive naming
  5. Alternative way of assigning properties
  6. The Client knows nothing of the above

An example folder structure might look something like this:


Folder Structure is used to define a hierarchy

This one is pretty self-explanatory. If you need another level of division, you add another level of folders. This way it’s super clear what you’re going to get.

There is no requirement for uniqueness of uids

Handling of editions in a sane way pretty much mandated this, so does the handling of translations even as we do it now. When I realized this, I realized there’s also no need to give each vagga a unique uid. In addition to what is outright forbidden by how filesystem work (that is folders/files with the same name in the same folder), an exception to the “no requirement for uniqueness” is that a uid cannot be shared between ancestors and descendents, so “part1/part1” would not be allowed. Also as usual, just because you can doesn’t mean you should: The data loader and API permits this, but that ability should be used wisely.

Addressing can be done using “bucket of uids”.

In relation to the above paragraph, as uids are not unique there needs to be a way to uniquely address a node. This is handled by what I call a “bucket of uids”, you can take a bunch of uids and throw them at the API and it’ll return something that matches based on the uid of the node and uids of its ancestors. This makes it easy to construct URIs:

  • “dn” or “dn1” - returns a unique result, so a bucket can just have a single uid. That’s fine.
  • “dn/vagga1” or “mn/vagga1” - This uniquely refers to the first vagga of the DN or MN respectively.
  • “mn10/sujato” or “mn10/bodhi” - Another example of “bucket of UIDs” - the API doesn’t care what order the uids are in.

A big advantage of this approach is you end up with a cleaner structure, instead of “/dn/dn-vagga1/dn1.html” there is no need to have dn- everywhere (of course, you could if you wanted to)

Whatever is used can potentially end up in a URI, so while the data loader doesn’t care we’d need to consider whether we prefer “mn/mulapannasapali” or “mn/part1” or something else like “mn/first-fifty”. One criteria to decide which is better, is that using a highly unique UID allows addressing that thing with a short or single uid URI. A multi-part URI should generally be considered an exceptional way to address nodes (example: alternative editions, or a weird kind of parallel relationship), important nodes should be referenceable by a single uid, so on the whole it doesn’t matter too much what the uids of these “intermediate” nodes are as they won’t be seen very often anyway, only forming a part of the URL when it is completely necessary, otherwise they mainly just become something to hang a name off of.

My idea is that while it wont normally be needed, these kind of URIs could also be used in relationships, so you could define a relationship between a vagga and another vagga, or something else if you need to, using URI style reference like “an1/vagga1” (the slash is not used in uids, so this is nicely unambiguous)

Taken together, this also means that in principle you could arrange the vinaya with a structure like “pi/tv/bu-vb/pj1” - there might still be good reasons to use a hyphenated composite uid though including “if it’s not broke, don’t fix it”, for the most part the design here is intended to work seamlessly with the current URL structure, while being accommodating of what are currently fringe cases but in the future will become more important, like multiple editions or versions.

A final thing to note here, is that when a “bucket of uids” does not resolve to a single node but instead to multiple, then that is basically a case to be handled by the view. Say for example the URI is “en/mn10”, which resolves to both a sujato and bodhi version, the view might show the sujato version of the text by default, and then it could either just plain ignore the bodhi version, or it could create a link saying “Alternative translation by Bodhi” with that link being more specific en/mn10/bodhi or it could even be fair and create a link to both. For the most part this is of no concern to the backend API, it’s job is to deliver everything that matches. However the API should have the ability to construct *minimal unambiguous URIs", that is it should be able to tell the client the most concise way to refer to a node with as much intermediate uids removed as possible, useful for constructing a URL. Also it is probably useful if some defaulting works at a pretty deep level, for example a match from the /root tree taking absolute precedence over a match from the /translation tree

“Property Files” allow context-sensitive naming

A general problem with this general scheme is how to give properties like names to things like “su” or “vagga1” which are otherwise only defined as a folder, a file system does not really provide a way to attach additional properties to a folder.

This introduces the concept of a “Property File”, which is a JSON file containing a mapping of uids to properties.

For example “root/pitaka.json”

  "su": "Sutta",
  "vi": "Vinaya",
  "ab": "Abhidhamma"

When the loader encounters a uid, like say “su”, it wonders “whatever could this be?”, it then searches the parent and ancestors for a property file that defines “su”, in this case it’ll bubble up until it encounters “pitaka.json”, as it finds su in a file called “pitaka” this informs it that su has the type of pitaka and the name of Suttas. However this is context sensitive, for example when in the chinese subtree it might bubble up to “lzh/pitaka.json” which calls it “Sutras”, so this means that pi/su can be “Pali -> Suttas” while “lzh/su” can be “Chinese -> Sutras”, altough in both cases the type is “pitaka” - types would not be too important on the whole, but they would be used for consistent formatting. It should be noted here that the system is satisfied by the first property file it finds, so a default could be used at a high level, which can be overridden at lower levels.

This is also the way that vaggas would get their context-appropriate names, for example “en/pi/su/dn/vagga.json” provides the specific names relevant to the Digha Nikaya’s vaggas.

If more properties than just type and name need to be applied, the extended form is like this:

  "pi": {
    "name": "Pāli",
    "priority": "3",
    "iso_code": "pi"
  "skt": {
    "name": "Sanskrit",
    "priority": "4",
    "iso_code": "sa"

Alternative way of assigning properties (standoff properties)

[Not Implemented (yet)]

It would probably also be desirable to have another way of bulk-assigning properties. An example is volpage references, while primary references can be read directly from the text files it would also be desirable to make other volpage entries say for instance from other manuscript editions. This would probably involve a third kind of mapping, which I’ve not fully thought through. This is very closely related to relationships, a volpage mapping is an awful lot like a parallel, so it is likely that this could also be managed through a relationship mapping.

One practical consideration here is keeping the data tree lean and mean for offline and client-side use. You don’t necessarily want to integrate a bunch of data into the main tree especially if it is bulky and of no interest to 99.9% of users, having it as standoff properties would be more practical, as it could then only be downloaded if the user expresses an interest in it.

The Client knows nothing of the above

The Loader constructs a JSON structure and parts or all of that structure can be delivered to the Client. But the Client does not know about property files and such, that all gets baked into the JSON which is delivered.

Say for example the client asks for dn, it will get something like this:

  "dn": {
    "uid": "dn",
    "type": "division",
    "name": "Dīgha Nikāya",
    "children": [
        "uid": "vagga1",
        "type": "vagga",
        "name": "Sīlakkhandha Vagga",
        "children": [
            "uid": "dn1",
            "type": "text",
            "name": "Brahmajāla"
    "ancestors": [
        "uid": "su",
        "type": "pitaka",
        "name": "Suttas"
        "uid": "pi",
        "type": "language",
        "name": "Pali"
        "uid": "root",
        "type": "tree"

One of the consequences of this is that if for example the server side functions are replicated in a service worker, that code running in the service worker doesn’t need to know about how the file reading and property file side of things works.

This also means that the folder structure + property file loader isn’t the only way to make the baked JSON: In fact another way is including “prebaked” JSON directly, for example this is how collections which we don’t have texts for are most easily done.

1 Like

Thanks, that all seems clear. One question only.

I didn’t entirely understand this. They are different in that parallels refer to different things, while col/page is an alternative way of referring to the same thing. In addition, vol/page does not always map 1-to-1 onto the semantic divisions: many suttas can be on one page, etc.

But anyway, the basic idea is clear enough: a vol/page mapping can be developed as standoff JSON.

What I mean is that mean mapping between the same sutta in one manuscript (say Thai) and another (say Burmese) is just a much stronger equivalence than mapping between the same sutta in Pali and Chinese, they have just had less forces causing divergence and/or more forces preventing divergence. The point at which you can say they have actually diverged into being different things rather than being variations on the same thing is fairly arbitrary.

The mapping between our sutta uid and the mahasangiti “p_” numbers is the only truly special case since it is truly exact. From there there are mappings between the p_ numbers and page numbers in other manuscripts/editions, from which you can infer where a sutta or paragraph is in those other manuscripts/editions. Since pages are a “flattened” structure (devoid of hierarchy) this dataset also ends up being just a big table of “this is equivalent to that”, which in form is much more like a list of parallel relationships.

1 Like