I managed to scrounge up the old work I did on generating JSON data from last year.
Here is the JSON exactly as it had been dumped at that time.
json-dump.zip (1.7 MB)
One thing I experimented with was sectioning the data - it’s included in the zip. The basic idea is instead of putting the whole thing in one file, you split it into a folder/file structure, a file system is absolutely ideal for representing and working with hierarchies. But a file system is also a perfect example of an unordered data structure (the JSON object is also an unordered data structure), files are not ordered in a folder and it’s up to the file manager to make up an ordering. Because of that limitation it is necessary to have explicit ordering information, so for example in the pi folder, there is a folder called ‘pi-su’ for pali suttas, there is also a file called ‘pi-su.json’ which explicitly defines the order in which DN, MN, SN, AN and KN come - you do need to define the explicit ordering in some way, it is unavoidable.
The way I designed the sectioning scheme you can feel free to section at whatever depth you want, the loader is given an entry point, say “root”, it sees there is a folder called root and a file called root.json, it reads root.json and creates a preliminary structure:
"root": {
"pi": {},
"lzh": {},
"skt": {},
"bo"" {},
"gr": {},
"oth": {}
}
Next it examines the content of the folder root, it sees there is a pi.json and a folder called pi, it reads pi.json and populates the pi object with the data it finds in it:
"root": {
"pi": {
"name": "Pāli",
"pi-su": {},
"pi-vi": {},
"pi-ab": {}
},
...
Note: The way I built that tree, if there is a folder called “pi”, it’ll look for a file called “pi.json” for additional attributes. This is only one way of doing it. Another way would be to have a special json file, called something like ‘props.json’ inside the pi folder, and it would read props.json to populate the pi object. Both work equally well although the concept of a ‘props.json’ is not all that text editor friendly because you can end up with several ‘props.json’ files open and can’t see at a glance what it refers to.
Then it opens the pi folder and populates the pi-su object from pi-su.json:
"root": {
"pi": {
"name": "Pāli",
"pi-su": {
"name": "Pali Suttas",
"dn": {},
"mn": {},
"sn": {},
"an: {},
"kn": {}
},
...
It opens the pi-su folder, and finds only json files, one each for dn, mn, sn, an and kn. It reads each json file and populates the appropriately keyed object:
"root": {
"pi": {
"name": "Pāli",
"pi-su": {
"name": "Pali Suttas",
"dn": {
"name": "Dīgha Nikāya",
"type": "division",
"dn-vagga-1": {
"name": "Sīlakkhandha Vagga",
"type": "vagga",
"dn1": {
"name": "Brahmajāla"
},
"dn2": {
"name": "Sāmaññaphala"
},
"dn3": {
"name": "Ambaṭṭha"
},
"dn4": {
"name": "Soṇadaṇḍa"
},
...
Note: “dn-vagga-1” could have been made a folder and that folder could contain dn1.json, dn2.json and so on, the loader is completely agnostic when it comes to defining structure in file system or in a JSON file, you can use a single monolithic json file or a bazillion teeny-tiny files or anything in between. So if you have a small collection you don’t have to use a division/subdivision folder structure, you can just represent this structure in a single compact JSON file. Or if a collection is vast with enormous sub-sub-divisions you can have the folder nesting go deeper.
Another note: dn1, dn2 and so on do not require explicit ordering information. Why? Because in the absence of explicit ordering information, natural sort is used. If natural sort gets the correct order, explicit ordering information is simply not required.
But anyway, the loader recurses through the entire structure until it has created a single big JSON object containing all the data, which amounts to 4.3mb if dumped to disk (and that’s the small version, sans translation, volpage information and some other data - the complete version amounts to 11.5mb)
It might sound pretty complicated, but the loader I wrote was only 13 lines of Python code. The exact parallel between JSON objects and file systems makes it very simple.
Okay so the pros and cons of segmenting the data:
Pros:
- Much more text-editor friendly. 4.3mb for the basic skeleton is pushing it, 11.5mb for the fleshed out skeleton is pushing it even more. But text editors can handle files of these sizes, just sluggishly.
- Much more human friendly when navigating to a particular place in the tree, you get a good “overview” when descending through the folder system - a JSON file doesn’t provide an overview because “pi” and “lzh” are lightyears apart.
- Indentation is kept at a comfortable level.
- Much git friendlier - at a glance you have a much better idea of what has changed (consider the difference between seeing “dn.json” has been modified, vs “data.json” has been modified). Also git struggles with indentation changes. And git loves lots of small files, and doesn’t love big files.
- Loader can also handle a single big JSON file with no modification whatsoever.
Cons:
- Not search friendly (but any good advanced text editor will search over multiple files)
- The need to have explicit ordering information (although this is somewhat unavoidable, and explicit can be good)
- The need to compile into a single monolithic structure for consumption.
- violation of DRY, you have both “pi” the folder and “pi.json” to describe it’s properties.
Anyway that is the current state of the data - it is basically good enough to proceed with. I’ve also uploaded the scripts into the git repository.