Consumable JSON data

blake · March 2, 2016, 11:14am

I managed to scrounge up the old work I did on generating JSON data from last year.

Here is the JSON exactly as it had been dumped at that time.
json-dump.zip (1.7 MB)

One thing I experimented with was sectioning the data - it’s included in the zip. The basic idea is instead of putting the whole thing in one file, you split it into a folder/file structure, a file system is absolutely ideal for representing and working with hierarchies. But a file system is also a perfect example of an unordered data structure (the JSON object is also an unordered data structure), files are not ordered in a folder and it’s up to the file manager to make up an ordering. Because of that limitation it is necessary to have explicit ordering information, so for example in the pi folder, there is a folder called ‘pi-su’ for pali suttas, there is also a file called ‘pi-su.json’ which explicitly defines the order in which DN, MN, SN, AN and KN come - you do need to define the explicit ordering in some way, it is unavoidable.

The way I designed the sectioning scheme you can feel free to section at whatever depth you want, the loader is given an entry point, say “root”, it sees there is a folder called root and a file called root.json, it reads root.json and creates a preliminary structure:

"root": {
  "pi": {},
  "lzh": {},
  "skt": {},
  "bo"" {},
  "gr": {},
  "oth": {}
}

Next it examines the content of the folder root, it sees there is a pi.json and a folder called pi, it reads pi.json and populates the pi object with the data it finds in it:

"root": {
  "pi": {
    "name": "Pāli",
    "pi-su": {},
    "pi-vi": {},
    "pi-ab": {}
  },
  ...

Note: The way I built that tree, if there is a folder called “pi”, it’ll look for a file called “pi.json” for additional attributes. This is only one way of doing it. Another way would be to have a special json file, called something like ‘props.json’ inside the pi folder, and it would read props.json to populate the pi object. Both work equally well although the concept of a ‘props.json’ is not all that text editor friendly because you can end up with several ‘props.json’ files open and can’t see at a glance what it refers to.

Then it opens the pi folder and populates the pi-su object from pi-su.json:

"root": {
  "pi": {
    "name": "Pāli",
    "pi-su": {
       "name": "Pali Suttas",
       "dn": {},
       "mn": {},
       "sn": {},
       "an: {},
       "kn": {}
   },
  ...

It opens the pi-su folder, and finds only json files, one each for dn, mn, sn, an and kn. It reads each json file and populates the appropriately keyed object:

"root": {
  "pi": {
    "name": "Pāli",
    "pi-su": {
       "name": "Pali Suttas",
       "dn": {
         "name": "Dīgha Nikāya",
         "type": "division",
         "dn-vagga-1": {
           "name": "Sīlakkhandha Vagga",
           "type": "vagga",
           "dn1": {
               "name": "Brahmajāla"
           },
           "dn2": {
               "name": "Sāmaññaphala"
           },
           "dn3": {
               "name": "Ambaṭṭha"
           },
           "dn4": {
               "name": "Soṇadaṇḍa"
           },
           ...

Note: “dn-vagga-1” could have been made a folder and that folder could contain dn1.json, dn2.json and so on, the loader is completely agnostic when it comes to defining structure in file system or in a JSON file, you can use a single monolithic json file or a bazillion teeny-tiny files or anything in between. So if you have a small collection you don’t have to use a division/subdivision folder structure, you can just represent this structure in a single compact JSON file. Or if a collection is vast with enormous sub-sub-divisions you can have the folder nesting go deeper.
Another note: dn1, dn2 and so on do not require explicit ordering information. Why? Because in the absence of explicit ordering information, natural sort is used. If natural sort gets the correct order, explicit ordering information is simply not required.

But anyway, the loader recurses through the entire structure until it has created a single big JSON object containing all the data, which amounts to 4.3mb if dumped to disk (and that’s the small version, sans translation, volpage information and some other data - the complete version amounts to 11.5mb)

It might sound pretty complicated, but the loader I wrote was only 13 lines of Python code. The exact parallel between JSON objects and file systems makes it very simple.

Okay so the pros and cons of segmenting the data:

Pros:

Much more text-editor friendly. 4.3mb for the basic skeleton is pushing it, 11.5mb for the fleshed out skeleton is pushing it even more. But text editors can handle files of these sizes, just sluggishly.
Much more human friendly when navigating to a particular place in the tree, you get a good “overview” when descending through the folder system - a JSON file doesn’t provide an overview because “pi” and “lzh” are lightyears apart.
Indentation is kept at a comfortable level.
Much git friendlier - at a glance you have a much better idea of what has changed (consider the difference between seeing “dn.json” has been modified, vs “data.json” has been modified). Also git struggles with indentation changes. And git loves lots of small files, and doesn’t love big files.
Loader can also handle a single big JSON file with no modification whatsoever.

Cons:

Not search friendly (but any good advanced text editor will search over multiple files)
The need to have explicit ordering information (although this is somewhat unavoidable, and explicit can be good)
The need to compile into a single monolithic structure for consumption.
violation of DRY, you have both “pi” the folder and “pi.json” to describe it’s properties.

Anyway that is the current state of the data - it is basically good enough to proceed with. I’ve also uploaded the scripts into the git repository.

sujato · March 2, 2016, 11:29am

I’ll have a look at it in more detail later, but for now can I just ask what is actually included in this.

I think we have the following sources:

sutta parallels
Vinaya parallels
embedded parallels
inferred parallels
pali text cross-references
verse parallels

Are all of these included in the JSON?

blake · March 2, 2016, 11:32am

There is a parallels.json which includes basically exactly the same data as the csv (I did validate that it produces identical results), as a straight transformation it’s less sophisticated than what Vimala is working on.

But the bulk of the data is the hierarchical data required to build division pages and so on.

Some data isn’t included - mainly the stuff which is “apart from” rather than “apart of” the sutta data, stuff like bibliography entries and such.

blake · March 2, 2016, 12:15pm

There is also the TIM data, which is derived from indexing the texts, it includes translation titles, volpage and stuff. It is dynamic data (i.e. not to be committed to git), but since the TIM code is reasonably complex the first iteration of the nodejs version will probably just use a JSON dump from python. The TIM data as JSON comes to 15.6mb (1.9mb gzipped). Anything in the TIM data doesn’t really need to be in the main hierarchical data - that is things like titles and volpage info.

Actually when I think about it, volpage data in general probably should be separate from the main hierarchical data, it is definitely “apart from” rather than “apart of”, as it belongs to a manuscript, print or digital edition and does not belong to the sutta. Anyway this is just something to think about and work out, what stuff should be a part of the main data tree, and what stuff should be separated off. For example it might make sense to have a “edition/pts1.json” type deal going on where we can define an edition, it’s name, and a uid:volpage mapping. Then we can map an sutta to any number of editions - much more flexible and precise than the volpage/alt_volpage setup. These files could usually be dynamically generated from the texts also.

sujato · March 3, 2016, 1:16am

Thanks for thinking of those of us who mostly work in a text editor!

Actually, not so much, it’s clear enough. And it seems pretty clear that the segmented data is a better approach.

I just regexed something on the full_data_complete.json in Sublime Text, less than a second to do 11,000 replacements, so I don’t think this is such a problem.

Okay. So of the types I mentioned above it seems we have:

sutta parallels
Vinaya parallels
embedded parallels (I just noticed that the parallels for the arthaviniscaya are absent from the site, this is an oversight)
inferred parallels
pali text cross-references
verse parallels

Yes, but it is nice to have it clearly spelled out in a text file, even if only for luddites like me.

Yes, I agree.[quote=“blake, post:4, topic:2621”]
it might make sense to have a “edition/pts1.json” type deal going on where we can define an edition, it’s name, and a uid:volpage mapping. Then we can map an sutta to any number of editions
[/quote]

Sounds perfect.

Some miscellaneous comments.

In parallels.json, the way the partials are handled seems clumsy. (Also, “patrials” is misspelled!) Instead of:

{
"uids": ["dn2", "ea43.7", "t22", "da27", "sf1", "sbv80", "sf288", "dq70"]
 },
 {
"uids": ["dn2", "sa154-163"],
"patrial": true
 },
 {
"uids": ["dn2", "t1442.13"],
"patrial": true
 },
  {
   "uids": ["dn2", "t1450.10"],
"patrial": true
 },
{
"uids": ["dn2", "t1444.1"],
"patrial": true
 },

Could we not do something along these lines:

  {
"uids": [
	"dn2" {"partial": ["sa154-163","t1442.13","t1450.10","t1444.1"]},"ea43.7","t22","da27","sf1","sbv80","sf288","dq70"]
  },

I’ve compared the data for this text with that on the site, and it has exposed a couple of problems.

The data includes “sbv80”, but this does not appear on the site. This text is the Sanghabhedavastu of the Mula-sarv Vinaya, which we have recently reorganized, so it looks like something got broken here.
There is an issue with the nature of the partial parallel data. Now, since these texts are listed as parallels of dn1, we rightly do not infer that they are parallels of the other texts in the list. However, while it is correct to not assume this, it is still the case that often these texts will, in fact, be parallels or partial parallels. It would be desirable to check this in the future, and determine these relationships more exactly. So the data structure needs to allow for this. My sample JSON above doesn’t really cut it, perhaps we need another level of abstraction. We might want to say: “X,Y,Z” are full parallels. “1,2,3” are partial parallels of X. “1,2” are partial parallels of Y. “1,3” are partial parallels of Z. Another thing we might consider is to introduce what we might call “related” suttas. Take the case of DN2. It has two main components, the gradual training and the narrative with Ajatasattu. In some versions, the narrative is presented without the gradual training. In others, the gradual training without the narrative. These should properly not be considered as partial parallels, for they have little or nothing in common as texts. Yet the two together make up what in other versions is the full sutta. So we could consider these “inferred partial” texts as “related”. This happens quite a few times in the Mahaparinibbana, for example.
We should clarify the way the SHT is handled. They currently appear as full parallels, but in fact are fragments. They’re also listed separately in the JSON file. In a sense these are “partial”, but not in the same way as the rest of them. They are presumably from full parallels, but we only have a fragment of the manuscript. They could be considered as “fragmentary parallels”.

Here’s some other issues that have come up.

I’ve checked the data for the problem case of MN 36, which was flagged by a user some time ago. This lists MN85 as a partial parallel. On the site MN85 appears as both full and partial. It is inferred. Now, the JSON, unlike the csv, explicitly mentions MN85 as both full and partial parallel. This is good, in that the data is more fully expressed: presumably the JSON must include the inferred parallels. However, obviously they should not duplicate, so we need to check for and correct this error. Partial obviously takes precedence.
The data in this entry also shows a number of peculiarities. I though one of the main points of the JSON was to have “sets” of parallels. But here we have a whole list of entries. Why is sf64 in a separate entry, while sf5 is part of a set with mn85? Why does each of the Sanskrit texts get its own entry? It seems like the data is only half-digested.
Our IDs still use the “dq” acronym, which I loathe. It stands for “Derge/Peking”, which are two separate editions of the Tibetan canon. Unlike the Chinese Taisho canon there is no single edition that has become the defacto standard for reference. But this is no excuse for having such a weird hybrid as a basic element of our data. We should use the Derge edition as our basis, and list the relation with the Peking edition separately. I have discussed a better way of handling Tibetan texts here, we should take this opportunity to implement this.
Along similar lines, we sometimes have alternate titles listed for texts. Eg DN 11 Kevaddha [Kevaṭṭa]. This also should not be part of the basic data. We should choose one, based on the spelling used in our Pali text. The alternates should be listed separately and used in appropriate places. In fact, we really should do this:

generate a new list of titles extracted from our text,
filter out duplicates with existing titles
Include non-duplicating titles as variants. In most cases these will be identical with the actual variants as found in the MS text. However it still makes sense to display them in the division etc. views, as these variant titles were included specifically because they are actually often used. Most people know MN 26 as “Ariyapariyesana”, not as “Pasarasi”.

The hierarchical structure is still not fully represented in the JSON. Mostly we just have vaggas, with samyuttas and nipatas for SN and AN respectively. We need to include the following four the 4 nikayas. I haven’t checked the other texts yet, this needs to be done throughout. Levels that are not always present are bracketed. Note that with SN we need to distinguish “big” vagga and “little” vagga.

DN: vagga
MN: pannasa, vagga
SN: vagga,samyutta,(pannasa),vagga
AN: nipata,(pannasa),vagga

Vimala · March 12, 2016, 7:50am

I’ll be offline for 2 weeks so here is a new json dump of the parallel data we have at present, including the new dhp, avs, mvu and other recently added parallels.

https://github.com/suttacentral/suttacentral-data/blob/master/table/parallels.json