JSON parallels, thoughts and ideas

sujato · February 22, 2016, 11:30pm

From @blake, moved here from email, discussion with him and @vimala.

here are my thoughts, which are particularly bent on eliminating sutta uids created only to express parallel relationships.

First early on was mentioned “supersets”, or when there are multiple uids which are parallel to one id.

I suggest this could be done with commas or hyphens:

{
   "uids": ["ea9.7", "t792", "an1.1,an1.2,an1.3,an1.4,an1.5"]
}

Commas are not normally used in uids, by putting all the uids in one string it is clear they are as a group parallel to the others. It requires a parsing step to recognize multiple uids are in a single string but that’s okay.

And as for ranges:

{
   "uids": ["ea9.7", "t792", "an1.1-an1.5"]
}

With fully qualified uids this is viable.

In terms of parallel calculation logic hyphen ranges would be for the most part substituted with commas (could be considered an abbreviated way to express a comma separated list of uids), but the use a hyphen could be used as a hint to the presentation.

Is this complicated to display? Kind of, but it’s not too bad. When displaying the parallels for ea9.7, it can just show an1.1-an1.5 as a parallel. When displaying the parallels for an1.1 through an1.5, it shows ea9.7 as a superset parallel, while technically tricky it’s possible using rowspan to do something like this (quick mockup - not exactly what it’d need to look like):

Which I think could be pretty cool - and yes overlapping ranges can be handled using a combination of extra cells, rowspan, colspan and black magic.

The other kind of relationship is a “subset” relationship defined by ids within a text, and I suggest the same syntax be allowed: Use commas or a hyphen and always prepend an id with a hash even if it’s part of a range or comma group:

["dn2#wp11-#wp20"]

Or in the case of commas:

"dn2#wp11,#wp21"]

While you could fully fully qualify all uid#id range references as in ["dn2#wp11-dn2#wp20"] I feel this is overkill, simply prepending all ids with a hash should suffice as the '-#' and ',#' combinations are impossible to interpret in any other way, they must belong to the preceding uid. Another reason to dislike fully fully qualified is that overwhelmingly the use case for this will be multiple ids from the same sutta, I wouldn’t even like to support something like: "dn1#wp20-dn2#wp146" even if it’s not technically impossible it should probably be illegal just to simplify things. Final reason to dislike it is that it will be closer or identical to the form used in the URL: We definitely aren’t going to say /pi/dn2#wp11_dn2#wp21

So what I suggest is the following:

###Superset Relationship

When several things are parallel to one thing:

["ea9.7", "t792", "an1.1,an1.2,an1.3,an1.4,an1.5"]

or

["ea9.7", "t792", "an1.1-an1.5"]

as an abbreviated form.

This should be displayed where possible as being a connected group. Where not possible (or if it’s clunky) to visually represent this relationship we just do our best.

Uids should always be fully qualified, instead of an1.1-5, it should be an1.1-an1.5, although presentation logic is allowed to collapse this down to the prior form.

###Subset relationship

When a part of a thing is parallel to a thing

["uid#id"]

or

["uid#id1-#id4"]

or

["uid#id1,#id2,#id3,#id4"]

ids in a range or comma list should always be written in full and be prefixed with a ‘#’, this is to clearly disambiguate them from uids.

It gets a bit nutty to combine a superset and subset relationship (i.e. parts of multiple suttas, are parallel to another sutta) and it might be better to explicitly forbid it and only allow 3 types: The straightforward uid to uid, the superset and subset relationship, and not allow a superset of subsets.

Please note all of this is about representation in data, not in URLs or presentation.

Since the issue of hyphens in ids came up: It is easier if they go, but it is technically possible to deal with them, as you can use a regular expression to identify parts of the string which are text ids, such a regular expression can be automatically generated by reading all the ids from the texts and finding the common alphabetical-hyphen prefixes, and then that can be used in the parsing step.

Hyphens in uids are not a problem, the server at any time knows every uid in use so it can trivially and accurately identify uids within a string regardless of their component characters.

sujato · February 23, 2016, 12:08am

Is there any reason for this ambiguity? Would it not be better to simply use hyphens always? To put it another way, is there any use case where hyphens couldn’t be used?

This could get complicated very quickly: table display is not for the faint of heart! But okay, it sounds like a great way to display such complex relations, let’s see if we can work it.

In the image, it seems as if the Translations column is kind of orphaned. I wonder whether logically speaking it doesn’t belong to the left of the Parallels. It is an expression of the same text, whereas the Parallels are pointing to different texts. But anyway, a minor detail.

Throughout here, just to check I’m understanding correctly, you’re using “id” for the id tags inside texts, and uid for the text id, i.e. what is in the ID column of the tables.

So you’re suggesting that only hashtagged ids within texts should not be fully qualified. Everything else is fully qualified, is that right?

I agree. I can’t think of a case where this happens. But we should keep our eyes out, we may be missing something. There’s no reason in principle it shouldn’t occur, but thanks to the semantic organization of texts on SC it will hopefully be rare or non-existent. On somewhere like CBETA, for example, where the texts are divided by juan, i.e. arbitrary manuscript division, this will occur more often.

I suddenly thought: how on earth do you express a range of ids in a URL?

Did it? I thought we were talking about hyphens in uids.

Okay, as far as the computer is concerned, but they are still visually confusing, eg. sa-2.2-sa-2.4. One way to get around this would be to always use double hyphen for ranges: sa-2.2–sa-2.4. Then we can consistently convert – to en-dash where appropriate.

blake · February 23, 2016, 9:11am

There are actually, these two parallel relationships are two non-adjacent suttas combined into one superset:

ea-2.41,an2.19_an3.29,,
ea-2.3,an3.29_sn22.57,,

The need for commas is AFAIK restricted to those two cases, so it is rare, but it exists. If it weren’t so darn clumsy to define in our existing/old setup perhaps it might have occurred more often, I don’t know.

Indeed, not for the faint of heart. It involves adding extra columns as required and then colspanning away the “excess” when not needed. But even if we only deal gracefully with the simplest cases (such as when there is no overlap) it’ll probably handle it well 95% of the time. And a simple but not quite so graceful fallback such as just listing the other parallel multiple times in the column in the row for each sutta should still leave it clear enough what is going on.

I immediately thought the same thing upon looking at my mockup, that perhaps the parallels column should be the rightmost column.

That is right. I’m using “uid” to refer only to a reference to a text, and “id” to refer only to HTML id attributes. I suppose technically the combination of uid and id would be a URI.

You can’t as a HTML bookmark. But technically you can put anything you darn well like after the hash and let your own javascript make sense of it. It’s a very common technique these days because it permits bookmarking and sharing of links of page states too complex to be expressed as a mere location within a static document.

t-linehead came up as an example, I think there are others too, but perhaps only in translations.

Double hyphens is certainly acceptable

sujato · February 23, 2016, 9:49am

It seems like overkill to program this ambiguity into the whole system for these two obscure cases. I wonder if there’s another solution.

Let’s try this with our upcoming versions.

Okay, so you could have something like mn2#3?range=#3--#6

Then it would open at the right ID, and we could javascript a highlight for the range.

blake · February 23, 2016, 10:03am

In a sense it’s not really an ambiguity because the first thing the code will do is try to convert the range into a list of uids so it knows exactly what uids are participating in the relationship, the comma list is thus “close to the metal” and explicit and the hyphenated form is a user friendly abstraction on top of it which the software has to convert into a form it can work with.

There are a lot of things comma form can do in principle, one example is if a bunch of extraneous stuff is inserted into the middle of a sutta, you could exclude the inserted part by using commas, something like #1-#46,#190-#240, the exact part which is parallel could then be marked up by the server or javascript (that’s an example of commas in ids, but the same could apply to uids, like if a single long sutta in another place is found split up, but not purely sequentially).

I say it can be done in principle - obviously someone would have to do it. But it’s not more complex to implement (arguably the hyphenated form is more complex) and I would like if the possibility of that level of precision exists.

It wouldn’t open at the right ID because the browser will try to interpret the fragment identifier as a whole as the bookmark (i.e. it would try to find an id or name equal to “#3?range=#3–#6”). But it is trivially in javascript to scroll the window to the correct location. The dumbest algorithm is you just slice off one character at a time from the right until it matches an actual id, then scroll to that location, you could easily run such a simple snippet of javascript before loading other javascript so the browser would open at the right location. Javascript can run long before the page has even started to render.

blake · March 7, 2016, 2:00pm

While I’ve been working with JSON parallels I’ve come to realize our approach is seriously flawed, or at least very limited and we’re kind of bumping into that limit.

What we’re dealing with and want to define is fundamentally relationships between suttas. Our current methodology is only able to deal with symmetrical relationships. If A is a full parallel of B, then B is a full parallel of A. If X is a partial parallel of A, then A is a partial parallel of X. And that’s okay if symmetrical relationships are all we want. But I reckon it’s not all we want.

The idea of the set can deal just fine with symmetrical relationships, the set [A, B, C] means that A, B and C are all full parallels of each other, we can say that are all equal partners in the parallel relationship, none of them are in the center of it.

But how on Earth can that deal with asymmetrical relationships? It simply can’t! In an asymmetrical relationship we don’t just have a bunch of objects hanging out together, we have an object and a subject.

I’ll just make up a concrete but possibly useful example for someone studying suttas: “Mentions”. From time to time, a sutta is mentioned by name in another sutta. So we can express this by saying “A mentions B” and this is a perfect example of an asymmetrical relationship, If A mentions B, it doesn’t stand to reason and is practically an impossibility that B mentions A (one of them had to come first, after all!). But this relationship can be expressed in an equivalent inverse form: “B is mentioned in A” (in grammarian language, I guess it’s like active vs passive)
But along with relationships like “mentions” there are plenty of parallel-centric concepts, like one sutta being an abridged version of another, such a relationship is also clearly asymmetrical. One sutta is the abridged one, the other the unabridged. Or one sutta being a fragment and the other complete.

So that is issue #1: We have symmetrical relationships which are handled gracefully by sets, and we have asymmetrical relationships which really cannot be handled by our sets model at all.

The second thing is looking at it from the user perspective (in this case, the person defining the relationships). Sometimes that person thinks “I want to define a group of suttas which are parallel to each other”, but other times they think “There is this important sutta, and I want to define relationships between this sutta and other suttas”. If you’re studying A, you probably also want to check out X, the latter case is not gracefully handled by the concept of “a group of suttas which are equal partners” because in the thought process of the person defining the relationship, one sutta really is taking center-stage, and we can say that is okay, that it’s fine to define relationships in the context of one sutta.

So what I see, is we need two basic forms. The first form is “group-centric”, and the second form is “sutta-centric”.

{
  full_parallel: [A, B, G],
  title: The Sutta on Foos,
  abridged: B,
  expanded: G,
  see_also: [Q, X],
  notes: "A, B and G are about foos. B appears to be an abridged purely doctrinal version of A. The doctrinal content of G is substantially identical, but with an extended narrative about a past life journey to the mystical land of foos, Q is about bars but the narrative structure is strikingly similar"
}

The above would define a group of parallels, by using the “full_parallel” key. You can further refine relationships by adding extra clauses, these relationships are with the group as a whole and not any individual member of the group per-se. The see_also applies to the parallel group as a whole as do notes. You can add a descriptive title which may be used in display.
A key like “full_parallel” would be special and act as the object of the entry or the “primary key”, others like “see_also” are what could be called “subordinate clauses”, they can’t define an entry by themselves (another subordinate might be “fragment” to include parallels which are only fragments). Asymmetrical relationships could only be defined by subordinate clauses. Other “primary keys” might be “cross_reference”, “verse_parallel” and such, they intrinsically define the relationship between two or more suttas.

{
  sutta: A,
  see_also: X,
  mentions: [F, H],
  notes: 
}

The above would be the sutta-centric form, it’s not really any different to using parallel: [A] but if we aren’t defining a parallel relationship we shouldn’t use the key “parallel”.

Note that the above could also be equally expressed by the inverse relationships, along the lines of:

{
  sutta: X,
  see_also: A
},
{
 sutta: F,
 mentioned_in: A
},
  sutta: H,
  mentioned_in: A
},

You might wonder, couldn’t we do something like this:

{
  sutta: [F, H],
  mentioned_in: [A]
}

Well we could. But it’s important to clearly distinguish between things being grouped together because they have a relationship with each other, and things being grouped together for convenience, the fact that F and H are both mentioned in A, does not imply that F and H have a special relationship with each other. This is one reason to use an explicitly meaningful key name like “full_parallel” if we really are defining a group of parallels, if we are merely grouping things for convenience another key could be used.

So anyway, what we end up with is a bunch of relationships and their inverse, the inverse relationships would be automatically generated in code.

A parallel_to B, B parallel_to A
A mentions B, B mentioned_in A

Now as for display, we have two basic options, the first is to digest the data down to a bunch of pair-wise relationships like the above and display them in a list like we already do.

The other way would be to try and preserve the parallel groups in display, so instead of just showing a list, the parallels page for A would show that it is part of a group of parallel suttas: A, B and G, it could show the title for that group and any notes for that group and any other relationships which pertain to that group (such as see also), after that would come any relationships unique to A.
If you go to the page for B, you’ll see exactly the same group your saw for A, followed by any special relationships unique to B.
This IMO is a superior approach, even if the relationships are ultimately digestible down to a huge number of pairwise relationships, the form the data is entered in should provide a strong hint on how to display it.

In fact ideally I would love to see each parallel group having its own URI, so parallel groups become promoted to things in their own right. You could go to a “tradition agnostic” page which is for the “Buddha’s life story” suttas, or parajika 1.

So in this post I’ve brought up two basic concepts for consideration:

The idea of supporting asymmetrical relationship or inequalities, clearly we can live without this feature, because we have so far, but it opens a whole host of possibilities. Along with my example of “mentions”, things like specifying that A is an expanded version of B or that C is a fragment - these are simple, well defined and easily understood concepts that can’t be represented with our current data model.
The idea of promoting parallel groups to “first class citizens” with special display logic and possibly a URI.

LXNDR · March 7, 2016, 3:30pm

does the system of parallels display envisage explicit indication of the shared portion of the text?

sujato · March 7, 2016, 11:31pm

This all sounds good, give me a little while to digest it. Just a couple of questions initially, though.

I assume all this is incremental? We can enrich data as we go along?
There are a number of other cases of types of parallels: incremental parallels (mostly in AN), also “template” parallels (in SN), where different topics are treated through the same template.
If we are to have “clusters” as a first class entity, then they need a proper URL structure. Marcus Bingenheimer used the idea of a “cluster”, but he still keyed it off one text. I guess a text-agnostic cluster would have to have a completely arbitrary ID. Which is easy enough. But what is the use case, I wonder? Normally someone uses a text, then they want to find parallels. Why would you look for a cluster? Maybe if you’re searching for a topic, you get results back as a cluster? But I’m not sure that is useful: a cluster is a more abstract entity, and harder for a user to get their head around. Or would the “cluster” idea simply be for the sake of back-end logic?

sujato · March 7, 2016, 11:36pm

Yes. Currently we do this on a small scale. For example if you see the parallels for AN 1.39

And you click on the link for SF 180 you get sent to here, which has the parallel portion highlighted.

This is what we call an “embedded” parallel.

Currently it requires that we insert special tags to handle such cases. Under the new system, however, we could define this purely in data and apply it much more widely.

blake · March 8, 2016, 2:32pm

Agreeded. We should only use the word “parallel” when we are actually defining parallels.

Fair point. Could be relevant if we want to define relationships between clusters (do we? is that going too complex?). For the end user it’s only relevant if we have a URL pointing to the cluster and we want that URL to remain the same.

I am in total agreement with doing away with partials. Personally I’m not a fan of abbreviations we should use full words unless they are absurdly long. Using full words makes things more readable.

I’m afraid you don’t understand the limitations of JSON collection types. In the examples you have used something called a ‘Set’ in Python, In python you can say {1,2,3} it is an unordered collection which you can do cool things like intersections and differences with. JSON only has two collection types, they are the Array which is an ordered list, and the Object which is a key/value mapping. You have the choice of:
[‘A’, ‘B’, ‘C’]
or:

{
A: null,
B: null,
C: null
}

The key value mapping has to have a key and a value, you can set the value to nothing, or something, or whatever. It’s not cleaner than arrays. Something to bear in mind though if you actually want a key/value mapping (like you want to assign some value or some parameters to each uid), a common pattern is something like:

{
  A: {...properties of A},
  B: {...properties of B},
  C: {...properties of C}
}

This could be useful for us, for the sake of argument something like:

{
  A: "primary",
  B: "abridged",
  G: "expanded",
}

or:

{
  A: "primary",
  G: "primary",
  H: "embedded"
}

That’s not too different to the concept of using a list of strings with an ‘*’ for a partial parellel, but allows using string types or even full objects.

If we were to go this route, we could use mixed keys and uids, this would be by convention but it’s not too different to what you do in Python and Javascript with objects, the idea would be to do something like this:

{
  relationship_type: "parallel",
  note: "you can put anything you like here",
  A: "primary",
  G: "primary",
  H: "embedded"
},
{
  id: 79653764539,
  relationship_type: parallel,
  thing_type: verse,
  T: "primary",
  Yab: "primary",
  U: "resembles",
  Fcd: "primary"
  Q: "fragment"
}

I think that way of using objects also has potential. Also, while a Set cannot handle asymmetrical relationships, a Mapping can because the value can specify a type.

sujato · March 8, 2016, 11:55pm

Otherwise known as “downloading it from github.”

Sarcasm aside (yes, that was sarcasm!), i wonder whether this would be useful for visualization or some statistical analysis. Maybe we’d do something like this.

Finally a persuasive use case! Elsewhere I developed the idea of “things and metathings”. If clusters were first class entities it would allow us to make metathings for them.

sujato · March 9, 2016, 2:15am

Here’s some more (hopefully) considered thoughts on the original post.

First up, the idea of asymmetric relations is important, and I definitely think we should include it. But I’m not persuaded by all the details of the model so far.

I can’t understand why we need the two separate models for “sutta” and “parallel”.

And, BTW, I think we should get used to using “ID” rather than sutta, as all our data going forward should be based on the more abstract notion of IDs.

I suggest that as well as abstracting from “sutta” to “ID” we should also abstract from “parallel” to “relation”. That is to say, our data should express relations between IDs, rather than parallels between suttas.

Also, I realized an additional advantage in using the notion of an explicit cluster with an ID. It allows the cluster to continue even if the members of the cluster change. So each cluster is a set of relations between IDs.

Then “full parallel” becomes merely one form of relation.

I’m hoping to do away with the vague notion of “partial parallel”, so let’s try using just “parallel”. As for “see_also”, in fact we usually use cf. or cp. for this, from “compare”, so why not just use that?

One detail: should we not use the inherent JSON logic of objects vs arrays to indicate whether a relation is symmetrical or asymmetrical?

One more detail: I think it would be better to restrict the use of “subordinate clauses” to those IDs that are already present in the “primary key”. The reason is that if the ID is not present in the primary key, we run into the kind of problem which you tried to solve by having “sutta” and “parallel” kinds of entities. The last relation in the following examples has to be called a “see_also” (because there is no full parallel). But that means “see_also” ends up being both a primary relation type and a subordinate clause type. Maybe this is okay, but anyway see what I’ve done.

{
  cluster_ID: 97635479654,
  relation_type: parallel, 
  thing_IDs: {A, B, G}, //Since this is unordered use object rather than an array.
  title: The Sutta on Foos,
  sub_abridged: B, //Perhaps we should use sub_ to define subordinate types.
  sub_expanded: G,
  notes: "A, B and G are about foos. B appears to be an abridged purely doctrinal version of A. The doctrinal content of G is substantially identical, but with an extended narrative about a past life journey to the mystical land of foos, Q is about bars but the narrative structure is strikingly similar"
}

{
 cluster_ID: 564286548654,
 relation_type: compare,
 thing_IDs: {{A,B,G},{Q,X}},//"Compare this set of things with that set of things".
 sub_abridged: Q
}

{
  cluster_ID: 6852438547954,
  relation_type: mention,
  thing_IDs: [A, F], //In this case the array specifies the direction of the mention, i.e. "A mentions F"
}

One thing I’m not sure about when it comes to “mentions”. Consider the following cases:

The Buddha said this in the Foo Sutta: “All foos are bars”.
The Buddha said this: “All foos are bars”

Number 1 is a mention. Is number 2?

Anyway, to proceed with your examples:

{
  cluster_ID: 68527597556,
  relation_type: mention,
  thing_IDs: [A, H], 
}

{
  cluster_ID: 2654376594,
  relation_type: compare,
  thing_IDs: {A, X}
 }

Here are a few other use cases.

Three IDs are parallel. One is found embedded in a larger text.

{
  cluster_ID: 4579654063,
  relation_type: parallel,
  thing_IDs: {A,G,H}, 
  sub_embedded: H,
}

Five verses are parallel. In Y and F, the parallel only applies to half the verse. However, in Y, the half-verse includes half the material of the other verses, while in F it includes the full text, just in two lines. Moreover, U differs in phrasing from the others. Q is just a fragment.

{
  cluster_ID: 79653764539,
  relation_type: parallel,
  thing_type: verse,
  thing_IDs: {T,Yab,U,Fcd,Q}, 
  sub_section: Yab,
  sub_resemble: U,
  sub_fragment: Q
}

Perhaps we could use “thing_type” to define things whose nature is not obvious from the ID. A straight sutta or Vinaya text doesn’t need this, as the thing type can be inferred from the ID. Another possible thing type might be a simile.

sujato · March 10, 2016, 2:39am

Sure, yes. It’s not something that I can see a need for right now, but there’s no reason in principle why we shouldn’t do it. We could, for example, analyze degrees of closeness, sets of kinds of relations (eg. suttas that are mentioned more than once, or mentioned verses), and so on.

Okay, I’m just getting my head around this. Why, i am wondering, don’t JSON have a simple unordered list?

Vimala · March 10, 2016, 10:14am

I’m trying to get my head around all this because I have not looked at this topic for the last 3 weeks.
I am looking at the data we currently have and the discussion here, which is where we ideally would want to go to. But our current data-set is much too limited at present to achieve this, so yes, somebody would have to go over the whole lot and make adjustments.

When working with the dhp parallels, I noticed a couple of things that might be helpful for the current conversation:

First @blake mentioned about using commas or hyphens to denote subsets.
Maybe both should be allowed, not just one or the other.

hyphens would be easier in most cases because if subsets are very large, especially if you are working with parts of the text based on ID (fi. uid#id3-#id50 would become a very large string).
2.But there are cases where this does not work because part of one the relationship is between verse1 and a combination of two non-adjacent verses.
I found at least a few cases in the DHP parallels. For instance dhp106 with gdhp310 and gdhp320. Right now the current system is happy to denote gdhp310 as a parallel of gdhp320, because both are mentioned as a parallel of dhp106. So denoting it as [dhp106,"gdhp310,gdhp320"] could be a solution. But on the other side, this issue also has to do with the vague notion of “partial”.

Then there are the cases that we now denote with a * or with a ,1, in the csv. Which is partial of which? Is text1 fully inside text2 or the other way around? Right now we just denote both with a *, which is incorrect. I’m also in favor of doing away with the vague notion of “partial” but we still have to deal with the consequences that remain in our current data-set.

I have made a lot of changes to correspondence.csv lately, but many of these won’t show up on the site yet because I have included the id or ranges where needed so I commented out those entries in order for the system to keep functioning.

So where do we go from here? The biggest bottleneck I see is that our current data-set needs checking and updating with the correct #id and other necessary data.

blake · March 10, 2016, 10:56am

Mainly because javascript doesn’t have one. If for example there was a PYON data format then it probably would have an unordered list.

Actually even ECMA2015/2016 does not have an unordered list. It has a Set class, but it remembers the order you insert things into it. I think the reason is because the overhead of remembering the order is miminal to nil (depending on the underlying data algorithm) especially in the context of modern hardware and software. There are some use cases where you want to remember the order, and you can do anything with an OrderedSet which you could do with an UnorderedSet. For example with an UnorderedSet you always have to sort it before displaying it if you want a predictable and consistent order, with an OrderedSet you don’t need to sort it, but Set operations like Intersections and membership tests are completely in common. Having an enforced unorderedness is more ideological than practical (back in say 2004 there were practical performance reasons, not so in 2016).

blake · March 14, 2016, 8:48am

Further thoughts on mappings:

I alluded to the idea of including both uids and properties as keys, there are some conventions surrounding this in REST APIs. I’ll use Elasticsearch as an example. When you submit a document to Elasticsearch to be indexed, you do it as a JSON object, say the document looks like this:

{
  "title": "Foobar",
  "content: "A Foo which Bars",
}

You can actually submit it to elasticsearch just like that if you include the type information externally, but you can include the type and id in this way:

{
  "_type": "nonsense",
  "_id": "foobar",
  "_source": {
     "title": "Foobar",
     "content": "A foo which bars",
  }
}

_type, _id and _source are what might be called meta properties, starting with an underscore means they are reserved. There is also a shorthand form which omits _source!

{
  "_type": "nonsense",
  "_id": "foobar",
  "title": "Foobar",
  "content": A foo which bars"
}

When elasticsearch receives an entry like that, it takes everything which doesn’t start with an underscore to be a property of the document.

So getting back to defining relationships between groups.

The group may have a “type”, however this could be implicit in the participants.

A short hand form could like this like:

{
  "_id": 324,
 " _description": "",
  A: "primary",
  B: "primary",
  Q: "compare"
}

This group has three members, A and B are “primary”, Q is subordinate to the primary members.

What might a longhand form look like?

{
  "_id": 324,
 " _description": "",
  A: {"_type": "primary"},
  B: {"_type": "primary"},
  Q: {"_type": "compare", "_note": "The narrative structure is very similar"}
}

In that form, we have an object for each member and can attach arbitrary information to that object, perhaps even more detailed relationship data. Having shorthand forms and the ability to take shortcuts is pretty common in JSON, the reason being that since JSON can define pretty detailed relationships, it can get real verbose real fast. It’s not a good thing for clarity when you have many nested objects.

We need to consider vinaya parallels. In our system vinaya parallels are unique in that they can have “negated parallels” and “indeterminate parallels”, where we explicitly say that a rule doesn’t exist in a collection, or that we don’t know if it exists in a collection due to damage to an incomplete manuscript.

The mapping is suitable for this, as we could use symbols for shorthand, consider a contracted example:

["-pi-tv-bu-pm", "pi-tv-bi-pm#pj5", "lzh-dg-bi-vb-pj5", "?skt-sarv-bi-vb"]

The “-” indicates “does not exist”, the “?” indicates “unknown”.

Converting this data to a mapping is very simple:

{
  "pi-tv-bu-pm": "-", 
  "pi-tv-bi-pm#pj5": "=", 
  "lzh-dg-bi-vb-pj5": "=", 
  "skt-sarv-bi-vb": "?"
}

Here I used symbols. I used “=” to indicate equivalency, “-” for negation and “”?" for unknown. Symbols have potential because they are concise and often used to define relationships between things in a way which is clear and unambiguous, there are some other useful symbols like “>” for “greater than” and “<” for “less than”, also “~” can stand for weak equivalence where “=” isn’t correct.

In the case where the meaning of a symbol isn’t fairly self-evident for someone who is broadly familiar with mathematical or logical notation it would probably be better to use a word instead, we probably don’t want to assign arbitrary meanings to symbols.

sujato · March 14, 2016, 9:59am

That all makes sense.

So these could be used for:

a>b: a includes b
a<b: a is included in b
a~b: resembling (or "see also") parallel
a=b: full parallel

blake · March 14, 2016, 10:54am

Yes, that would be an acceptable usage, I think. So far I’ve been showing relationships with the group as a whole, so:

{
A: '=',
B: '='
C: '>'
}

Would mean that A and B are approximately equal and content, and C has “more” content than the “average” or “prototypical” member of the group.

But we could easily imagine being able to refine relationships within a group, imagine something like this:

{
  "mn10": "=",
  "ea12.1": "=",
  "ma98": "=",
  "dn22": {
    "_type": "=",
    ">": "mn10"
  }
}

Which expresses “All these things are equivalent enough to be called full parallels, and DN22 is a longer version of MN10”. One of the ultimate tests here is readability: When we look at the data, is it clear what it is expressing?

For instance I also considered:

  "dn22": {
    "_type": "=",
    "mn10": ">"
  }

The obvious problem being that it doesn’t follow any natural grammatical principles, is it saying that dn22 > mn10 or mn10 > dn22 ? We’d be relying on an arbitrary convention and it’d be easy to make mistakes.

Vimala · April 4, 2016, 6:56am

Would it not logically follow that if dn22 > mn and dn22 = ma98 and ea12.1 that ma98 > mn10 and ea12.1 > mn10? This does not read like that in the mapping example.

Is it not easier just to use the group notation as:

“uids”: [“dn22”, “ma98”, “ea12.1”, “>mn10”]

meaning that dn22, ma98 and ea12.1 are equivalent parallels and mn10 is a shorter version of the same.

now suppose that there is a sutta, call it qt12.1 that quotes (mentions) dn22 at sc-id 4, this can be written as:

“uids”: [“dn22”, “ma98”, “ea12.1”, “>mn10”],
“quoted”: [“dn22”, “qt12.1#4”],

Because these are ordered lists, the first item can always be the one that is quoted so this would mean that dn22 is quoted at qt12.1#4, which can be an embedded parallel notation at qt12.1#4 and mentioned in the dn uid list at dn22 as “quoted in qt12.1#4”.

?, ~ and - signs can also be used to denote a certain relationship with the main group. For instance if there is something that needs further explanation because it is not a sutta on SC:

“uids”: [“dn22”, “ma98”, “ea12.1”, “>mn10”, “?suttax”],
“quoted”: [“dn22”, “qt12.1#4”],
“remarks”: “suttax parallel is to be found at reference xyz”

sujato · April 4, 2016, 7:55am

Unfortunately, no.

In this example, MN10, MA 98, and EA 21.1 are of comparable length, and DN 22 is longer than all of them. But in truth each has quite a different length as compared to the others: DN22 > MA 98 > MN 10 > EA 12.1.

But “length” is a vague criterion, given the complexities of repetitions, different languages, and so on. MA 98, if my memory serves me well, has more sections than MN10, but each section is less verbose. So is it longer or shorter?

Only in the case of MN10 and DN 22 can we straightforwardly say that one text is precisely an expanded version of the other. That’s why need to keep the length criterion isolated to these texts, not applied to the whole group.

Oh, and by the way, we should add the abhidhamma texts to this. The details are:

Satipaṭṭhānavibhaṅga, Vb7, 193–202
Dharmaskandha, T 1537.5–6, T xxvi 475c25–479b24
Sāriputrābhidharma, T 1548.28–29, T xxviii 705a28–711a01