Latest on Data Model

blake · November 15, 2016, 9:49pm

I’ve been feeling not happy with the plans for the new data model, feeling that something was off. And I think I figured it out, that it involves doing a not so great approach better, rather than embracing a better approach. To understand this requires a brief overview of the under the hood architecture of the server.

The IMM (In Memory Model)

When we first transitioned from PHP to Python the core sutta data was managed by the In Memory Model. The IMM is built up from table data, presently in CSV format. The IMM remains the core of the server. It is some of the oldest code. The IMM consists of a table data loader and a kind of monolithic god object" (which is a bad thing), in fact it’s a classic example of a god object. The IMM is a not so great approach and it should be dismantled rather than improved.

The TIM (Text Info Model)

Later, as our bodies of texts expanded, I added the Text Info Model. The TIM examines the texts and composes a useful summary of data: For example it pulls out sutta names, author information, vol page info and more from the texts. The IMM uses the TIM as a supplementary source of this data. The TIM is also the primary source for text and translation info. That is to say, there is no table which says what texts and translations we have, the TIM dynamically discovers what texts and translations we have and then informs the IMM of what is available. The TIM has proven to be an excellent and reliable approach with a focused role to perform. To add texts you just add texts and the TIM takes care of everything, how cool is that?

The Evolution

In PHP days we had table data exclusively, and all texts were hosted externally. Then we started adding texts but suttas with texts were in the extreme minority, in the time before the TIM we would actually add a new text then we would edit a table to add the url to the text on our own server.

Over time there has been a shift in importance. In the PHP days the table data was the only thing we had, so the table data was of absolute importance and everything was built around the table data. Gradually as the body of texts have grown to the point where most suttas now have texts, so it’s almost exceptional to not have the text.

Texts as the foundation of the data structure

My thinking is that the texts should be of primary importance in defining structure, with the table data being supplementary. As such the TIM becomes the central structure which provides data and it understands structure by examining the folders and files. The elimination of this role of the IMM and the elimination of separate structure definition would not only simplify things in general, it would also have specific advantages in areas such as search, at the moment the table data has to be searched separately from the text, but if they are unified there is only one body of data to search.

What it comes down to is a few basic principles, Don’t Repeat Yourself (DRY) and Principle Of Least Astonishment (POLA), perhaps also the principle “There should be one — and preferably only one — obvious way to do it”. DRY is simple, try to avoid having to define the same thing in two different places because that is fraught with problems. In this case, I’m referring mainly to defining the hierarchy both in tables and in the text folders. My suggestion is that hierarchy should be defined primarily by the texts themselves and that there should be no additional structural information required at all unless that data is not actually complete in the texts, either because we don’t have texts or because the information is in some manner external to the text rather than belonging to it.

Using the folder structure to define structure has benefits. File systems are super mature, super reliable, super robust, super well understood and there is fantastic tooling we use every day. Everything works with folders and files, from your file browser to git to zip archives. So we leverage the file system has much as possible.

When auxiliary data is required

The TIM approach of pulling out names and such from texts is straightforward for when we hosted suttas, but divisions and such will generally appear as just a folder. But you can’t give folders any attributes. So there is essentially just a uid, such as “dn”. Also there are “stubs” we don’t have texts for.

That means there needs to be a way to add attributes to these textless nodes, this should be done in a reasonably unsurprising way.

A good option would be to use JSON:

/pi/tv/su/dn.json

{
    "name": "Dīgha Nikāya",
    "type": "division"
}

That is a pretty clear approach. As mentioned in previous documents on data format, it would also be acceptable to write a tree in json:

/pi/tv/su/sn.json

{
  "name": "Saṃyutta Nikāya",
  "type": "division",
  "children": [
    "sn1": "Devatā Saṃyutta",
    "sn2": "Devaputta Saṃyutta",
    "sn3": "Kosala Saṃyutta",
    ...
  ]
}

One of the solemn duties of the data loader would be to complain loudly if you try to define the same thing twice but in different ways in fact these json files should be kept pretty short and not too deep.

Another example of auxiliary data required is explicit ordering information

/pi/tv/su.json

"ordering": ["dn", "mn", "sn", "an", "kn"]

or perhaps:

/pi/tv/su/ordering.json

["dn", "mn", "sn", "an", "kn"]

Standardization of data in texts

At the moment the TIM performs a feat of quasi-intelligent data-mining to pull out info like name and author. While convenient it’s definitely not ideal. It would be better to use a standard way to make this data quick to retrieve. One way would be to have a standard header:

<meta name="author" content="sujato">
<meta name="name" content="Brahmajala sutta">
<meta name="license" content="cc">

Another more DRY way would be to use attributes to clearly state what values are to be used:

<h1 data-name>Brahmajāla Sutta</h1>
...
<span class="author" data-author="sujato">Bhikkhu Sujato</span>

The HTML data attribute has a specific role: It says “this isn’t for humans to read, it’s not used for presentation, it’s for computer programs”, so it would be quite an appropriate usage.

Overarching architecture

At the moment the TIM examines all the texts and generates an index of data. Instead it will examine the texts and generate a hierarchy so it not only knows stuff about texts, but it also knows how everything fits together.

The data service and the presentation should be more strongly decoupled. In the future it is quite likely that the templates will end up getting rendered at least partially in the browser, with data being cached in the browser for offline use.

At the moment the flow goes:

Data Manager -> Template Rendering -> HTML in Browser

Instead it will go:

Data Manager -> JSON -> Template Rendering -> HTML in Browser

Which means that Templates aren’t communicating directly with the data model but instead are just delivered what they need. While in-browser rendering isn’t something we’ll do immediately this will make it much easier in the future.

sujato · November 16, 2016, 12:47am

Hey Blake,

Thanks so much, this sounds awesome as always. I’d love to hear Vimalas’s perspective too, she probably understands this better than I.

Anyway, here’s some questions and comments.

It is very cool indeed. I understand the principle here, but I’m vague as to what this actually is. Am I right to think that the IMM and TIM are both data structures held in RAM? What form is the data in?

I’m concerned about reusability and lastingness. If the overall data only exists in RAM, well, that’s not very stable. And what if someone else wants to use our data for their project? Can the TIM be saved as a file?

Fine. But this would mean that we’d have to reorganize our texts, putting them into vaggas, pannasas, and so on.

To play Mara’s advocate, you are suggesting that the hierarchical data be a combination of folder structure and JSON. It seems that we will never be able to completely eliminate stubs. (Some of our texts have no digital form, while others, especially SHT fragments and similar, are available only as copyrighted images.) We could perhaps represent them with empty files and folders. But that still leaves us needing to supply auxiliary data.

Would it not be simpler, DRY-er, and POLA-er to take the opposite approach? Folders exist purely for human convenience. Even files, too. The data structure ignores them and keeps everything in JSON. That way all hierarchy is in one readable and human-editable form. No stubs or auxiliary data to worry about.

The real issue here is that the data being mined is loose. Obviously, you’ve done a great job of extracting it, but at the end of the day, such data is always semi-structured at best. Consider a situation like the English translation of the Sutta Nipata. The overall author is “Khantipalo, including several sections from other translators, with corrections and additions by Sujato”. Getting all this straight, especially for all the languages, is no mean feat. Nevertheless, you’re obviously right, we should strive to make this as clear as possible.

This raises another very un-DRY (or is it WET?) aspect of our texts. The metadata is included in each file, sometimes repeated verbatim hundreds or thousands of times. I don’t know if this is really a problem, but it sure ain’t DRY.

In the spirit of decoupling (or stand-offing) would it not be better to extract the meta, and preserve this in a separate set of data files, eliminating repetition? Then the texts would just be the texts. On the other hand, I do like the simplicity and robustness of keeping the metadata and text in the same HTML file.

Vimala · November 16, 2016, 8:22am

I completely agree with @Blake’s reasoning here. This will simplify the structure and eliminating data in two or more places.

Right now we also have 2 different structures for representing suttas. Some, like DN, MN, etc. have only one sutta per file and each file has the correct filename within that folder. Then others, like the DHP have all the verses in just one file, while they are represented in the list as if they were separate suttas. The tables make sure that this is represented correctly.
By using attributes in the texts itself, things like the DHP can also be represented fairly easily.

And while we’re at it, maybe we can also eliminate the need to have paragraph numbers defined in 4 different places in 3 different files.

I don’t agree with that. You would again have a structure where you have data in more than 1 place and keep the possibility to store files in the wrong place. Especially when working with the Chinese vinaya texts, I have come across this often: texts stored in the wrong folders. But the TIM happily retrieves them when the language and id are correctly set within the html. I’d rather eliminate the possibility to store files wrongly.
We need to keep the folders for human convenience, so we might as well use that.

I agree. We could for instance have a file called meta.html with the meta data in every (sub)division folder and retrieve that. In cases where files have different meta data, this could be still stored in the file itself.
So
IF (meta data in file): use that
ELSE: find the nearest file called meta.html in the nearest (sub)division folder

sujato · November 16, 2016, 10:27am

I hope I haven’t given you the impression that any idea gets passed around here without me arguing about it! But like I said, you guys know this stuff much better than me.

That sounds like an excellent idea.

Have these been fixed?

Sure.

What kinds of cases are you thinking of here?

Vimala · November 16, 2016, 10:55am

I like to pretend I know … Blake is the real expert here

I fix those when I come across them.

It happens often in the translations that not all suttas in a (sub)division are by the same author or even the same source. For instance look at the French translations that I took from canonpali. They have different authors and different meta data for many of them. The German SN is divided in 3 distinct parts, each with its own author and history. Then there are various files where there are hyperlinks to the same sutta on an external site. So each meta-area is different for each sutta within that (sub)division because the hyperlinks are different for each.

sujato · November 16, 2016, 11:02am

Thanks.

Okay. But I’m not sure this justifies treating them differently. If we’re going to have separate text and metadata, would it not be simpler to just do this for all of them?

Of course, in such cases we don’t get the benefit of not having to repeat ourselves. But that doesn’t mean we can’t do it. For example, we may well end up with cases where a text is originally unique, but later others are added from the same author.

The case of hyperlinks, is, I admit, more difficult.

I don’t have any strong views about this, just trying to be clear.

Vimala · November 16, 2016, 11:06am

Ideally we would like to have all the texts translated in Pootle from the same author per (sub)division. But we are not there yet. Most of our languages still have large gaps.

blake · November 16, 2016, 2:08pm

The TIM is essentially an index, it analyzes the texts and generates an index. For example if I ask the TIM what it knows about SA 1 texts, this is what it returns:

 {'lzh': {
         'author': '',
         'bookmark': None,
         'cdate': '2016-06-10',
         'file_uid': 'sa1',
         'lang': 'lzh',
         'mdate': '2016-06-10',
         'name': 'SA 1（一） 無常',
         'next_uid': 'sa2',
         'path': PosixPath('lzh/su/sa/sa1-100/sa1.html'),
         'prev_uid': None,
         'uid': 'sa1',
         'volpage': None},
 'en': {
         'author': 'Bhikkhu Anālayo',
        'bookmark': None,
        'cdate': '2016-10-18',
        'file_uid': 'sa1',
        'lang': 'en',
        'mdate': '2016-10-18',
        'name': 'Discourse on Impermanence',
        'next_uid': 'sa2',
        'path': PosixPath('en/lzh/su/sa/sa1-100/sa1.html'),
        'prev_uid': None,
        'uid': 'sa1',
        'volpage': None},
 'vn': {
         'author': 'Tuệ Sỹ,Thích Đức Thắng',
        'bookmark': None,
        'cdate': '2016-05-18',
        'file_uid': 'sa1',
        'lang': 'vn',
        'mdate': '2016-05-18',
        'name': 'KINH 1. VÔ THƯỜNG',
        'next_uid': 'sa2',
        'path': PosixPath('vn/lzh/su/sa/sa1-100/sa1.html'),
        'prev_uid': None,
        'uid': 'sa1',
        'volpage': None}}

The TIM literally knows nothing more than can be discovered by examining the texts.

The IMM in a similar way builds up an in memory model from the table data stored on disk.

Both are entirely unlike databases, they store nothing and know nothing other than what is read from the data files.

I’ve actually argued this case before, much as Vimala said. Actually the TIM already looks for files called “metadata.html” and reads information from them, though this ability isn’t really used anywhere at the moment, except perhaps to populate the author field.

I do favor separating license information, in my first post I mentioned the idea of a standard header or standardized ways to link.

I looked into it in a little more detail, one way would be to use the <link> element, in HTML5 you can have <link rel="author" href="..."> and <link rel="license" href="...">. So we could have an explicit link in each file which can be very short and what it points to can be updated.

sujato · November 16, 2016, 10:55pm

Okay, so it sounds like we have an agreement to go ahead with this approach.

And presumably, we could include the source link, where needed.

Vimala · November 17, 2016, 11:44am

Of what format does this file need to be? I.e. how does the metadata need to be represented inside the file?
I tried making one but it gave me some errors so can you give me an example of the correct markup?

blake · November 17, 2016, 1:05pm

It would need to be the same as metadata, that is contain a <div id=metadata>, but as I said right now it doesn’t do very much except pull out author information and try to generate a “translated by…” tooltop from it.

blake · November 24, 2016, 8:38pm

Here are some additional requirements I’ve been working out at an architectural level

Strict separation of components.

Data loading into JSON format
Serving JSON data as an API
Rendering data into templates
Webserver

All of these components would be capable of functioning independently, altough naturally wouldn’t be able to do much. For example all the API could do without data is say “ain’t got no data” and the webserver would only be able to say “shit, looks like everything else is broken”.

There are a few reasons for this separation. The first is testability, when everything depends on everything else it is hard to isolate a component to test it. At the moment we can’t just test if the data loads okay, without also loading all the cherrypy stuff and stuff. That complicates things.

It also allows mixed programming languages. For example perhaps everything is done in Python, except the template rendering which can be done in Javascript in the browser, accessing the JSON serving API to get the data it needs to populate the templates.

Data Loading into JSON

Data would come from a number of sources:

Texts, including folder structure, html files and auxiliary json files. This would all be compiled together into unified JSON trees.
Relationship data, which would basically just be JSON and require no real conversion, maybe just some consolidation.
Localization Data, which would likely be converted from .po to JSON.

All the source data would thus be converted into JSON as a common intermediate, from where it is easily processed in various standardized ways as there are many tools that use JSON data.

Data Loading is not performance critical in any way because it only has to be done once.

Data Serving API

Data loaded in the previous stage can be registered with the JSON Serving API. The API doesn’t necessarily need to be public or forward-facing. It could just be a Python module used internally. But it would probably make sense to expose it as an HTTP REST API, both for use by our own Javascript and also so third parties can access our data in a formalized and convenient way.

The data will come in the form of subtrees, an overview of the subtrees would look something like this:

base
relationship
root_text
    pi
    lzh
    skt
    ...
translation
    en
    de
    fr
    ...
locale
    en
    de
    fr
    ...

Additional subtrees can be added as required.

The API will provide a few basic functions. First it’ll be able to provide the data for a uid.

Secondly it’ll be able to provide an entire subtree.

Thirdly it’ll be able to facilitate synchronization, a hypothetical service worker might keep copies of the data for offline usage. That service worker will be able to contact the API to discover if it needs to update its data. Most likely this synchronization will occur at the subtree level, so if a single pali file is changed that would invalidate the root_text/pi subtree and cause it to be re-downloaded (as a rough estimate, “everything about pali texts” would weigh in at about 500kb compressed which is a very acceptable amount to download, trying to be even more granular would be overkill)

Fourthly it will either directly serve or facilitate the serving of texts in the form of HTML documents and also facilitate synchronization of these texts, so the hypothetical service worker would be able to know if its cached copy of en/dn2.html is up to date or not. Perhaps it may also be able to pack many html file into a zip for bulk delivery

All synchronization will use MD5 hashing to know if something is correct or not as this is a very tried and true technique and is already used internally by the TIM for its caches. In fact the subtrees and synchronization described above is all very similar to what the TIM does internally right now, for example in suttacentral/db you find files like text-info-model-pi_9a51ab9521.pklz which is everything the TIM knows about pali texts, except in python pickle rather than JSON format.

The API is performance-critical, it should return results quickly. It could be partially delegated to Nginx, for example each subtree is really just a JSON file, and the texts are HTML files. This can be directly served by Nginx from the filesystem, reducing the load on the backend server and lowering the scope for DOS attacks (as Nginx is harder to overwhelm than a “smarter” server which does more work for each query).

Template Rendering

Template Rendering would be done by Python as currently happens. There would need to be a small amount of data-wrangling prior to rendering.

It is probable that in the future this task will be able to be delegated to a Service Worker to enable offline functionality and reduced latency when online, altough rendering on the webserver would still be used as a fallback in the case that Service Workers are unavailable. Since Service Workers are implemented in Javascript we could then use Nunjucks templates as they are interoperable with Jinja2 templates.

Web Server

The Web Server is responsible for serving web pages, it may also serve the data API.

The Web Server should be able to start up even if everything else fails horribly, in that case it should give informative error messages. This is a significant difference than what happens now, which is that web server remains silent until the TIM and IMM have finished loading making it sometimes hard to know what went wrong.

The webserver would need to provide the hypothetical Service Worker with what it needs to take on page rendering duties, namely assets and templates.

sujato · November 25, 2016, 9:22am

That all sounds good. I’m reading it slowly and looking things up as I go, so i have reasonably good grasp of the issues. I’m particularly happy to see the proposed support for a RESTful API.

Some remarks as I go:

The MD5 wikipedia page says things like “MD5 is a broken hash function” and that “no one should be using MD5 anymore”.

Good, this needs to be fundamental. We must assume that in the future we will be subject to massive DDoS attacks. The Buddhist/Muslim cyber war is real, and it is not going away.

Sounds like Go is off the table?

This sounds awesome, I have just checked out Nunjucks and it looks perfect. How does it work in practice? Do we write separate (but parallel) code in Python and JS? Or does it just work, i.e. the templating systems translate for us?

Vimala · November 25, 2016, 9:47am

This brings me to another, perhaps completely unrelevant question. As far as I understand it, our server, as well as our communications (gmail, hangout, skype) are all based in Silicon Valley. Now Google has probably the best backup systems you can have, but still, with the current political shifts in the US, how safe is this? Should we not at least have a backup elsewhere (or do we have this?) and alternative communication methods?

sujato · November 25, 2016, 9:52am

I agree, multiple backups are great. Still, with git there are already several backups. It would be prudent, I think, to ensure current backups in geographically distinct and stable areas.

In the longer term, I’m keeping my eye on things like IPFS. But they aren’t really mature yet.

Vimala · November 25, 2016, 10:44am

Both me an Blake have our local backup and I’ve been trying to get you to clone the git repository for ages but without much success

blake · November 25, 2016, 10:47am

This requires understanding context: MD5 is broken as a cryptographic hash, as such it should not be used to verify untrusted data, for example, say you download an Ubuntu ISO using bittorrent, you can generate a MD5 hash on the downloaded ISO using md5sum and compare it with the MD5 hash published by Canonical to verify the download completed correctly. Hypothetically, someone could create an Ubuntu ISO with a hidden backdoor, which through careful manipulation of the data shares the same MD5 hash as the official ISO making the MD5 invalid as a means of identity verification.

On the other hand for purposes like cache-busting there is nothing wrong with using MD5, or even stupider approaches like the file modification time. In fact normally you make it even less secure to get shorter filenames, a MD5 hash is 32 characters long but normally you slice it to 8 characters, for a roughly 1 in 3 million chance of random collision. In this case the data is sent over a secure channel, namely HTTPS, so there’s no trust issue.

Even more critically, the “dumbness” of the system makes the content easier to offload onto edge servers like Cloudflare’s. Taking our server down would be easy, taking down Cloudflare would be a much greater effort.

Hypothetically a Service Worker javascript client should be able to just download basic JSON files and not need an intelligent webserver at all, so the edge server can do all the work.

Consider it off the table but on the radar. There are some services which could be more easily implemented in Go than Python, such as certain kinds of custom data processing. And Go generally makes about 500x better use of hardware than Python, while being the most natural upgrade path from Python. Altough for the immediate goals there is no need to use Go and effort would probably be better invested into Service Workers.

[quote]
This sounds awesome, I have just checked out Nunjucks and it looks perfect. How does it work in practice? Do we write separate (but parallel) code in Python and JS? Or does it just work, i.e. the templating systems translate for us?[/quote]

Basically there is a common subset between Jinja2 and Nunjucks so the templates would be written using that common subset.

There would need to be a small amount of parallel code in a Service Worker to prepare data for the templates, for example even with well formed JSON data for relationships, there still needs to be at least one step to calculate what relationships a uid participates in. This code is very straightforward and easy to reproduce in different languages.

blake · November 25, 2016, 10:51am

That’s the nice thing about git, there’s no master repository, it’d be possible to push directly from my computer to the suttacentral server without using github at all.

The only thing we don’t have a proper redundancy solution for is text images, as it is too big for github and not generally suitable for git.

Vimala · November 25, 2016, 10:53am

Sure, but that has only the site itself, not discourse or whatever else there is. You’d still need to set up a whole new server and install the software on it in the worse case scenario.

But both of us have a local copy of those too.

blake · November 25, 2016, 11:34am

However the synchronization is very poor.