Moving the IMM from CSV to YAML

Jhanarato · March 29, 2015, 4:32am

OK, now for some plumbing…

At the centre of our python code is a module called the In Memory Model - the IMM. This reads in data from a bunch of CSV files and stores it in RAM on our server. Other code can then navigate and read the data very simply and quickly. This is “meta data” - the name of a sutta, its parallels, the avaliable translations and so forth. This is then separated from the actual content, say the text of a sutta in Russian or a Burmese font.

A while back we talked about moving the IMM data source from CSV format to YAML. One of the key advantages here is the hierarchical nature of YAML. The IMM is basically a hierarchy with Pitikas at the top and Suttas at the bottom (correct me if I’m wrong here, it has been a while since I looked at the code).

One way to move formats would be to write some python code to load the data into the IMM and write the whole lot out again to YAML. The next step would be to modify the IMM to read YAML but present the same API to the rest of the system.

I’m sure @blake has already given this some thought but may have other priorities. If you are interested in learning to work on the core python code this might be a nice project to start with.

sujato · March 29, 2015, 7:24am

We have moved forward in our thinking on this, and we’ve decided to use JSON rather than YAML.

This is something to discuss with @blake. It’s on our 2do list, along with about a million other things. But it definitely sounds like it would be a good idea.

FYI, here’s @blake’s comments on this from an emails some time ago. @vimala has also been involved in these discussions, as she’s interested to contribute on the programming side, too.

By the way in final conclusion we will use JSON. There are two reasons for this, the first is that Elasticsearch uses JSON and Discourse uses JSON and that JSON is the standard data exchange format of the internet.
The second is that JSON is about 100x faster than YAML. That is a big enough difference to matter. On my laptop loading the complete sutta data as JSON takes about 50ms while as YAML would take about 3s. While YAML is by no means unbearably sluggish the speed advantages of JSON allows for simpler code.
As it happens a while back I actually wrote most the code required for converting our current data into hierarchical form. My earlier code converted to XML, as due to some lapse of reason I had somehow thought this could be a good idea.
I’ve updated this to dump JSON instead. Both the sutta data (all languages) and the parallels data is dumped in JSON form.
One of the major requirements of the data form, is the data should be damn near ready to consume. Through clever design, the data essentially comes ‘precompiled’, this is where JSON shines because it is loaded directly into native Python data types.
The scheme I have devised results in reasonably elegant JSON. I’ve attached a zip containing the basic JSON data in tree form. One of the measures of a good format should be how comprehensible it is of course! Please take a look at the data and say what you think.
By the way, it is very pertinent to what I am doing right now, because I find myself increasingly working with JSON data, for example because Discourse practically breathes JSON it only makes sense to request sutta data in JSON form.

Jhanarato · March 29, 2015, 8:35am

In any case, both JSON and YAML easily support hierarchical data. With JSON being widely used and faster to import it makes sense.