This is essentially the design document for the new structural data.
##Why a Tree doesn’t work
A tree starts with a root and then branches out. For our purposes this almost works, almost, but not quite.
On closer examination while the basic structure of say the Digha Nikaya does form a nice tidy tree, when we consider the larger structure of the collections we realize a tree doesn’t work:
The Digha Nikaya is in the Sutta Pitaka, the Pali Language and the Theravada Tradition
However it would be completely wrong to say that Theravada comes under Sutta, or Sutta comes under Theravada, and when we extend this further to other languages, can cannot say that Sarvāstivāda comes under Chinese when it also appears under Sanskrit.
So instead we use a Graph where a node can have multiple parents, and multiple children, for instance starting with DN:
su pi tv parents
\ | /
dn
/ | \
dn1 dn2 dn3 children
In our case, the Graph is still pretty Tree-like, there’s a pretty clear flow. It’s still essentially a top-to-bottom hierarchy. In lay speak you could even get away with calling it a tree (i.e. as in a family tree) but in computer science lingo it’s a Graph.
In fact it’s worth noting that when you start at any one node, traversal downwards (or upwards) results in an ordinary tree, each node thus has a tree of descendents, and a tree of ancestors. This is really handy for rendering views because it keeps things simple. Trees of descendents are really straightforward to work with, so for practical purposes the data looks like a tree when you’re working with it.
Implementing this in JSON
JSON cannot natively represent graphs, binary formats such as Python Pickle can, and YAML kind of can. But we can use JSON, and use uids as references:
"dn": {
"type": "division",
"name": "Dīgha Nikaya",
"children": ["dn-vagga1", "dn-vagga2", "dn-vagga3"],
"parents": ["pi", "su", "tv"]
}
These “children” and “parents” properties are the glue that fits the tree together. Either children or parents can be used to define these relationships, that is you could take the “su” object:
"su": {
"type": "pitaka",
"name": "Sutta Pitaka",
"children": ["dn", "mn", "sn", "an", "kn", "da", "ma", ...]
}
And the result would be precisely identical to adding “su” to the parents property of each division. This means that whatever is superior from an organizational perspective can be used.
Note that the uids in parents/children are functioning as placeholders, naturally you can place the real thing directly instead of the placeholder. This doesn’t make much sense for parents, but is useful for children as you can include a large sub-tree in a single JSON file.
#dictionary Structure is for Organizational Purposes Only
I had contemplated using dictionary structure to build the tree. Now that I’ve decided that a Graph is better than a Tree this idea goes out the window since a file system can barely represent a Graph (it can with Symlinks, but just because you can…). The JSON data can be divided up into as many files and folders as desired, ultimately it is the uids in the parents and children properties that unify the data.
Note that there can be special files, such as “props.json” which defines property names. These are loaded first and are used to configure the loading process.
Verbosity Reduction Syntax
Since verbosity is bad and since a pre-processing step is mandatory I have come up with a shortcut syntax:
First of all, here is what a relatively full json tree would look like:
{
"uid": "dn",
"type": "division",
"name": "Dīgha Nikāya",
"parents": ["su", "pi", "tv"],
"acronym": "DN",
"children": [
{
"uid": "dn-vagga1",
"type": "vagga",
"name": "Sīlakkhandha Vagga",
"children": [
{
"uid": "dn1",
"type": "sutta",
"name": "Brahmajāla"
},
{
"uid": "dn2",
"type": "sutta",
"name": "Sāmaññaphala",
},
{
"uid": "dn3",
"type": "sutta",
"name": "Ambaṭṭha"
},
...
]
},
...
]
}
And this is what the shortcut (DRY) syntax looks like:
{
"dn": {
"type": "division",
"name": "Dīgha Nikāya",
"parents": ["su", "pi", "tv"],
"acronym": "DN",
"dn-vagga1": {
"type": "vagga",
"name": "Sīlakkhandha Vagga",
"_children_type": "sutta",
"dn1": "Brahmajāla",
"dn2": "Sāmaññaphala",
"dn3": "Ambaṭṭha",
...
}
}
}
###Features:
It’s not necessary to have an explicit “children” property, instead children can be included as key: value pairs. The key becomes the uid and if the value is a string (rather than an object) it becomes the name. Finally a special underscored property _children_type can be used to set a property automatically for all children, basically saying “all these children are suttas”.
This shortcut syntax makes for much shorter and more human-scannable data, being between 1/3rd and 1/5th as long, fitting much more readily on the page of a text editor.
The shortcut syntax does have a few limitations, because in JSON object keys are unordered by default the loader will sort them using an alpha-numeric sort, 99% of the time this will result in a correct ordering, but if the children cannot be sorted by uid (for example subdivisions in the kn) it is necessary to include explicit ordering information by using the children
array, it is sufficient to simply list the uids in the correct order.
Other effort savers: The “acronym” property defines a relationship between the uid and the value of acronym, so in the above case it adds the rule dn → DN, there is no need to specify that “dn1” expands to “DN 1” because that can be inferred.
##Construction of an in-memory graph:
First the shorthand syntax is expanded into the long hand
Then once every everything has been loaded the the uid strings in parent/children arrays are converted to live references and the inverse relationships created.
##Usage of the in memory nodes:
Presently, if you want to find the language of say, a subdivision, you need to go subdivision.division.collection.lang
, a subdivision does not know what lang it has, these things are only attached to certain object types, even though they are in common to all descendents.
In the new model something like language is inheritable, if you say “subdivision.language” what the code does is first checks if there is a language property on subdivision, if there is no direct property by that name, it propagates up through the ancestors looking for a node of type “language”, so if for example we’re starting from dhp, it will propagate up to kn, then up to “pi”, “pi” is of type “language”, so dhp.language will return the “pi” node. This is designed to be super friendly for templates, effectively a high level object becomes a property on all descendents.
Other possibilities of Parents/Children:
As previously noted, a node can have any number of parents, also requesting a property will search up through the ancestors. This can be used to easily attach a property to many different uids without needing to edit those files.
For example we might decide we want to be able to group suttas by speaker:
{
"uid": "speaker-sariputta",
"type": "speaker",
"name": "Sariputta",
"children": ["dn33", "dn34", ...]
}
If we were to attempt this by putting a property on sutta objects, we run into a few problems. First is that the entries which need to be changed are all over the place. Secondly, while most suttas have a single speaker, a few suttas have two or more speakers. With this method there is no problem at all with having multiple speakers for the same uid.
Another example would be handling Vol/Page data in this manner, instead of attaching it directly to a sutta, we could define an entry along the lines of:
{
"uid": "pts-dn-book1",
"type": "book",
"name": "Digha Nikaya PTS edition",
"children_data": {
"dn1": "DN i 1",
"dn2": "DN i 47",
"dn3": "DN i 87",
"dn4": "DN i 111",
...
}
}
Now if you check the “book” property of dn1, it propagates up and finds that PTS book which contains the relevant data. Nice thing is, if you then add more printed/manuscript editions all of them will be properly associated.
Another practical example:
{
"uid": "ebt",
"name": "Early Buddhist Texts",
"type": "era",
"children": [...]
}
Then that could be used as an entry point for rendering a list of EBT suttas, or it could be used as sutta.era
to appear as a column in a division table.
The important thing is this flexibility is baked into the data format and conventions, practically arbitrary groupings can be defined in the data. Want to additionally add “script” information to books? You could create a new object with type “script” which has the relevant books as children.
More generally, something the ebt example becomes wonderfully centralized, you don’t need to hunt through a lot of different files to find where to make changes.
Views on data:
When the data is loaded into memory, it’s put into a Graph structure. Any uid can be used as the entry point for creating a view. As noted previously there are non-interacting separate “roots”, like pitaka, language and tradition. Often you want these things to form a single tree, happily there is a powerful language feature called “groupby” baked into python and jinja, using standard template features you can display texts by pitaka, grouped by language and grouped by tradition.
The power of this data model
Previously I used an example of adding an “EBT” group, now here’s where things get really neat. With that example, you only need to add that entry to the data, then you can go to the template and use {{sutta.era}}
, the addition of a feature like this requires no changes to the python code at all, it can be done purely at the level of the data and the templates. That’s a big effort saver and reliability improvement.
The distinction between the structural data and relationships data
There are some superficial similarities between the two formats. For example both use lists of uids.
The main distinction is the structural data is for creating hierarchical trees. Descending (to build views) and ascending (to get properties) both form neat and tidy trees.
The parallels data is about cross-linking, if you tied all the parallels data together it’d create a great big ball of interconnectivity with no discernable structure at all.