New structural data format

blake · August 3, 2016, 5:45pm

This is essentially the design document for the new structural data.

##Why a Tree doesn’t work

A tree starts with a root and then branches out. For our purposes this almost works, almost, but not quite.

On closer examination while the basic structure of say the Digha Nikaya does form a nice tidy tree, when we consider the larger structure of the collections we realize a tree doesn’t work:

The Digha Nikaya is in the Sutta Pitaka, the Pali Language and the Theravada Tradition

However it would be completely wrong to say that Theravada comes under Sutta, or Sutta comes under Theravada, and when we extend this further to other languages, can cannot say that Sarvāstivāda comes under Chinese when it also appears under Sanskrit.

So instead we use a Graph where a node can have multiple parents, and multiple children, for instance starting with DN:

      su  pi  tv     parents
       \  |  /
          dn
        /  |  \
      dn1 dn2 dn3    children

In our case, the Graph is still pretty Tree-like, there’s a pretty clear flow. It’s still essentially a top-to-bottom hierarchy. In lay speak you could even get away with calling it a tree (i.e. as in a family tree) but in computer science lingo it’s a Graph.

In fact it’s worth noting that when you start at any one node, traversal downwards (or upwards) results in an ordinary tree, each node thus has a tree of descendents, and a tree of ancestors. This is really handy for rendering views because it keeps things simple. Trees of descendents are really straightforward to work with, so for practical purposes the data looks like a tree when you’re working with it.

Implementing this in JSON

JSON cannot natively represent graphs, binary formats such as Python Pickle can, and YAML kind of can. But we can use JSON, and use uids as references:

"dn": {
  "type": "division",
  "name": "Dīgha Nikaya",
  "children": ["dn-vagga1", "dn-vagga2", "dn-vagga3"],
  "parents": ["pi", "su", "tv"]
}

These “children” and “parents” properties are the glue that fits the tree together. Either children or parents can be used to define these relationships, that is you could take the “su” object:

"su": {
    "type": "pitaka",
    "name": "Sutta Pitaka",
    "children": ["dn", "mn", "sn", "an", "kn", "da", "ma", ...]
}

And the result would be precisely identical to adding “su” to the parents property of each division. This means that whatever is superior from an organizational perspective can be used.

Note that the uids in parents/children are functioning as placeholders, naturally you can place the real thing directly instead of the placeholder. This doesn’t make much sense for parents, but is useful for children as you can include a large sub-tree in a single JSON file.

#dictionary Structure is for Organizational Purposes Only

I had contemplated using dictionary structure to build the tree. Now that I’ve decided that a Graph is better than a Tree this idea goes out the window since a file system can barely represent a Graph (it can with Symlinks, but just because you can…). The JSON data can be divided up into as many files and folders as desired, ultimately it is the uids in the parents and children properties that unify the data.

Note that there can be special files, such as “props.json” which defines property names. These are loaded first and are used to configure the loading process.

Verbosity Reduction Syntax

Since verbosity is bad and since a pre-processing step is mandatory I have come up with a shortcut syntax:

First of all, here is what a relatively full json tree would look like:

{
    "uid": "dn",
    "type": "division",
    "name": "Dīgha Nikāya",
    "parents": ["su", "pi", "tv"],
    "acronym": "DN",
    "children": [
        {
            "uid": "dn-vagga1",
            "type": "vagga",
            "name": "Sīlakkhandha Vagga",
            "children": [
                {
                    "uid": "dn1",
                    "type": "sutta",
                    "name": "Brahmajāla"
                },
                {
                    "uid": "dn2",
                    "type": "sutta",
                    "name": "Sāmaññaphala",
                },
                {
                    "uid": "dn3",
                    "type": "sutta",
                    "name": "Ambaṭṭha"
                },
                ...
            ]
        },
        ...
    ]
}

And this is what the shortcut (DRY) syntax looks like:

{
    "dn": {
        "type": "division",
        "name": "Dīgha Nikāya",
        "parents": ["su", "pi", "tv"],
        "acronym": "DN",
        "dn-vagga1": {
            "type": "vagga",
            "name": "Sīlakkhandha Vagga",
            "_children_type": "sutta",
            "dn1": "Brahmajāla",
            "dn2": "Sāmaññaphala",
            "dn3": "Ambaṭṭha",
            ...
        }
    }
}

###Features:

It’s not necessary to have an explicit “children” property, instead children can be included as key: value pairs. The key becomes the uid and if the value is a string (rather than an object) it becomes the name. Finally a special underscored property _children_type can be used to set a property automatically for all children, basically saying “all these children are suttas”.

This shortcut syntax makes for much shorter and more human-scannable data, being between 1/3rd and 1/5th as long, fitting much more readily on the page of a text editor.

The shortcut syntax does have a few limitations, because in JSON object keys are unordered by default the loader will sort them using an alpha-numeric sort, 99% of the time this will result in a correct ordering, but if the children cannot be sorted by uid (for example subdivisions in the kn) it is necessary to include explicit ordering information by using the children array, it is sufficient to simply list the uids in the correct order.

Other effort savers: The “acronym” property defines a relationship between the uid and the value of acronym, so in the above case it adds the rule dn → DN, there is no need to specify that “dn1” expands to “DN 1” because that can be inferred.

##Construction of an in-memory graph:

First the shorthand syntax is expanded into the long hand
Then once every everything has been loaded the the uid strings in parent/children arrays are converted to live references and the inverse relationships created.

##Usage of the in memory nodes:
Presently, if you want to find the language of say, a subdivision, you need to go subdivision.division.collection.lang, a subdivision does not know what lang it has, these things are only attached to certain object types, even though they are in common to all descendents.

In the new model something like language is inheritable, if you say “subdivision.language” what the code does is first checks if there is a language property on subdivision, if there is no direct property by that name, it propagates up through the ancestors looking for a node of type “language”, so if for example we’re starting from dhp, it will propagate up to kn, then up to “pi”, “pi” is of type “language”, so dhp.language will return the “pi” node. This is designed to be super friendly for templates, effectively a high level object becomes a property on all descendents.

Other possibilities of Parents/Children:

As previously noted, a node can have any number of parents, also requesting a property will search up through the ancestors. This can be used to easily attach a property to many different uids without needing to edit those files.

For example we might decide we want to be able to group suttas by speaker:

{
    "uid": "speaker-sariputta",
    "type": "speaker",
    "name": "Sariputta",
    "children": ["dn33", "dn34", ...]
}

If we were to attempt this by putting a property on sutta objects, we run into a few problems. First is that the entries which need to be changed are all over the place. Secondly, while most suttas have a single speaker, a few suttas have two or more speakers. With this method there is no problem at all with having multiple speakers for the same uid.

Another example would be handling Vol/Page data in this manner, instead of attaching it directly to a sutta, we could define an entry along the lines of:

{
    "uid": "pts-dn-book1",
    "type": "book",
    "name": "Digha Nikaya PTS edition",
    "children_data": {
        "dn1": "DN i 1",
        "dn2": "DN i 47",
        "dn3": "DN i 87",
        "dn4": "DN i 111",
        ...
    }
}

Now if you check the “book” property of dn1, it propagates up and finds that PTS book which contains the relevant data. Nice thing is, if you then add more printed/manuscript editions all of them will be properly associated.

Another practical example:

{
  "uid": "ebt",
  "name": "Early Buddhist Texts",
  "type": "era",
  "children": [...]
}

Then that could be used as an entry point for rendering a list of EBT suttas, or it could be used as sutta.era to appear as a column in a division table.

The important thing is this flexibility is baked into the data format and conventions, practically arbitrary groupings can be defined in the data. Want to additionally add “script” information to books? You could create a new object with type “script” which has the relevant books as children.

More generally, something the ebt example becomes wonderfully centralized, you don’t need to hunt through a lot of different files to find where to make changes.

Views on data:

When the data is loaded into memory, it’s put into a Graph structure. Any uid can be used as the entry point for creating a view. As noted previously there are non-interacting separate “roots”, like pitaka, language and tradition. Often you want these things to form a single tree, happily there is a powerful language feature called “groupby” baked into python and jinja, using standard template features you can display texts by pitaka, grouped by language and grouped by tradition.

The power of this data model

Previously I used an example of adding an “EBT” group, now here’s where things get really neat. With that example, you only need to add that entry to the data, then you can go to the template and use {{sutta.era}}, the addition of a feature like this requires no changes to the python code at all, it can be done purely at the level of the data and the templates. That’s a big effort saver and reliability improvement.

The distinction between the structural data and relationships data

There are some superficial similarities between the two formats. For example both use lists of uids.

The main distinction is the structural data is for creating hierarchical trees. Descending (to build views) and ascending (to get properties) both form neat and tidy trees.

The parallels data is about cross-linking, if you tied all the parallels data together it’d create a great big ball of interconnectivity with no discernable structure at all.

sujato · August 3, 2016, 11:00pm

Wow, sounds great. As usual i have to do some research to catch up, but I think it makes sense!

Indeed. This is one of the niggly things that I’ve never had the conceptual language to express precisely.

Would it be worthwhile to use one of the JSON styles that have been developed specifically for Graphs? Eg. JSON Graph Format or JSON Graph.

What about cases, of which there are now many, where we deal with relations between things that aren’t suttas, i.e. sutta ranges, or sutta sections? Or is this a separate issue, confined to the relationship data?

Which sounds like it will be perfect for i18n.

Perhaps not directly relevant, but how can we handle “in-between” data? For example, in the PTS dict, the refs are in vol/page form. But typically they don’t point to the first page of the text, especially in long suttas. So we want to not just associate dn1 with DN i 1 but also DN i 4 and so on. To complicate this we can also associate this with chapter/section numbering, and potentially other systems.

The Pali Vinaya has four sets of hierarchies! (pts-vp-en, pts-vp-pi, chapter/section, and by rule number. Five if you include the bhanavaras.) We should be able to universally and seamlessly convert between any of these.

Very nice. Hopefully we can add the other scanned images of printed editions at some point.

Vimala · August 4, 2016, 10:20am

I had been thinking about the tree structure and came across the same issues. I was wondering how you were going to solve that. This sounds very logical and practical to work with.

More related to another discourse post on parallels: Display strategies for new data tables - #35 by sujato
The same issue applies to the representation of the parallels. the Taisho numbers also don’t always correspond directly with the exact text you want to mark as parallel and neither do the PTS.

sujato · August 4, 2016, 10:24am

Indeed, there needs to be some kind of fuzzy logic to deal with such imperfections. Of course, the most widely available and sophisticated fuzzy logic machine is the human brain. If we can at least bring people close to the right place, hopefully people can figure out if they’ve landed a sutta too early or too late. Having said which, I use the vol/page refs for the PTS page images, and most of the time it works great.