I’ve been feeling not happy with the plans for the new data model, feeling that something was off. And I think I figured it out, that it involves doing a not so great approach better, rather than embracing a better approach. To understand this requires a brief overview of the under the hood architecture of the server.
The IMM (In Memory Model)
When we first transitioned from PHP to Python the core sutta data was managed by the In Memory Model. The IMM is built up from table data, presently in CSV format. The IMM remains the core of the server. It is some of the oldest code. The IMM consists of a table data loader and a kind of monolithic god object" (which is a bad thing), in fact it’s a classic example of a god object. The IMM is a not so great approach and it should be dismantled rather than improved.
The TIM (Text Info Model)
Later, as our bodies of texts expanded, I added the Text Info Model. The TIM examines the texts and composes a useful summary of data: For example it pulls out sutta names, author information, vol page info and more from the texts. The IMM uses the TIM as a supplementary source of this data. The TIM is also the primary source for text and translation info. That is to say, there is no table which says what texts and translations we have, the TIM dynamically discovers what texts and translations we have and then informs the IMM of what is available. The TIM has proven to be an excellent and reliable approach with a focused role to perform. To add texts you just add texts and the TIM takes care of everything, how cool is that?
The Evolution
In PHP days we had table data exclusively, and all texts were hosted externally. Then we started adding texts but suttas with texts were in the extreme minority, in the time before the TIM we would actually add a new text then we would edit a table to add the url to the text on our own server.
Over time there has been a shift in importance. In the PHP days the table data was the only thing we had, so the table data was of absolute importance and everything was built around the table data. Gradually as the body of texts have grown to the point where most suttas now have texts, so it’s almost exceptional to not have the text.
Texts as the foundation of the data structure
My thinking is that the texts should be of primary importance in defining structure, with the table data being supplementary. As such the TIM becomes the central structure which provides data and it understands structure by examining the folders and files. The elimination of this role of the IMM and the elimination of separate structure definition would not only simplify things in general, it would also have specific advantages in areas such as search, at the moment the table data has to be searched separately from the text, but if they are unified there is only one body of data to search.
What it comes down to is a few basic principles, Don’t Repeat Yourself (DRY) and Principle Of Least Astonishment (POLA), perhaps also the principle “There should be one — and preferably only one — obvious way to do it”. DRY is simple, try to avoid having to define the same thing in two different places because that is fraught with problems. In this case, I’m referring mainly to defining the hierarchy both in tables and in the text folders. My suggestion is that hierarchy should be defined primarily by the texts themselves and that there should be no additional structural information required at all unless that data is not actually complete in the texts, either because we don’t have texts or because the information is in some manner external to the text rather than belonging to it.
Using the folder structure to define structure has benefits. File systems are super mature, super reliable, super robust, super well understood and there is fantastic tooling we use every day. Everything works with folders and files, from your file browser to git to zip archives. So we leverage the file system has much as possible.
When auxiliary data is required
The TIM approach of pulling out names and such from texts is straightforward for when we hosted suttas, but divisions and such will generally appear as just a folder. But you can’t give folders any attributes. So there is essentially just a uid, such as “dn”. Also there are “stubs” we don’t have texts for.
That means there needs to be a way to add attributes to these textless nodes, this should be done in a reasonably unsurprising way.
A good option would be to use JSON:
/pi/tv/su/dn.json
{
"name": "Dīgha Nikāya",
"type": "division"
}
That is a pretty clear approach. As mentioned in previous documents on data format, it would also be acceptable to write a tree in json:
/pi/tv/su/sn.json
{
"name": "Saṃyutta Nikāya",
"type": "division",
"children": [
"sn1": "Devatā Saṃyutta",
"sn2": "Devaputta Saṃyutta",
"sn3": "Kosala Saṃyutta",
...
]
}
One of the solemn duties of the data loader would be to complain loudly if you try to define the same thing twice but in different ways in fact these json files should be kept pretty short and not too deep.
Another example of auxiliary data required is explicit ordering information
/pi/tv/su.json
"ordering": ["dn", "mn", "sn", "an", "kn"]
or perhaps:
/pi/tv/su/ordering.json
["dn", "mn", "sn", "an", "kn"]
Standardization of data in texts
At the moment the TIM performs a feat of quasi-intelligent data-mining to pull out info like name and author. While convenient it’s definitely not ideal. It would be better to use a standard way to make this data quick to retrieve. One way would be to have a standard header:
<meta name="author" content="sujato">
<meta name="name" content="Brahmajala sutta">
<meta name="license" content="cc">
Another more DRY way would be to use attributes to clearly state what values are to be used:
<h1 data-name>Brahmajāla Sutta</h1>
...
<span class="author" data-author="sujato">Bhikkhu Sujato</span>
The HTML data attribute has a specific role: It says “this isn’t for humans to read, it’s not used for presentation, it’s for computer programs”, so it would be quite an appropriate usage.
Overarching architecture
At the moment the TIM examines all the texts and generates an index of data. Instead it will examine the texts and generate a hierarchy so it not only knows stuff about texts, but it also knows how everything fits together.
The data service and the presentation should be more strongly decoupled. In the future it is quite likely that the templates will end up getting rendered at least partially in the browser, with data being cached in the browser for offline use.
At the moment the flow goes:
Data Manager -> Template Rendering -> HTML in Browser
Instead it will go:
Data Manager -> JSON -> Template Rendering -> HTML in Browser
Which means that Templates aren’t communicating directly with the data model but instead are just delivered what they need. While in-browser rendering isn’t something we’ll do immediately this will make it much easier in the future.