Publication metadata for SC translations

sujato · September 25, 2019, 10:39pm

So far we haven’t kept very good records for the publication of our texts. Let’s change that! I’m proposing that we keep a central record in our main Bilara-data repo that will record all publications made on SC, primarily those done on Bilara.

The basic idea is that we keep all such info in a single file, where it can be called upon and applied where needed.

This will complement the project for collecting author information. That project is about author information, this is about publication.

Feedback is welcome!

@blake @Aminah @sabbamitta @Robbie @sgns

Aminah · September 26, 2019, 7:42pm

Metadata rocks! A big hoorah for generating good metadata!

Beyond cheers and party streamers at the general level, in terms of the specifics, isn’t there likely to be quite some room for overlap with the above and:

Might it be good to really think about how to maximize efforts in this area; reducing duplication and getting as much bang for buck with respect to whatever potential applications of the metadata. In addition to the uses mentioned when the RDF ticket was drafted (compatibility with the BDRC project and maybe being searchable on worldcat.org are the ones I can remember of the top of my head) I’m also wondering if there’s some possible overlap with the structured data that tickles the tummy of the Google Monster (and other search engines) and also makes Bruce Lawson happy.

So. eg. maybe json-ld is something worth considering.

karl_lew · September 27, 2019, 3:12pm

A publication.json file in the Bilara repo would be versioned and regularly updated with current information. Those interested in current metadata would see current data. Those interested in historical metadata would be able to browse through the historical changes in Github.

I look forward to having publication metadata and would like to read about:

semantic versioning. Documents and their structure are an API. Some changes break things and semantic versioning conveys the downsteam impact for all users.
scope of change (i.e., which nikayas)
type of change (e.g., corrections v.s. translation glossary changes v.s., major new additions)
sources of change (e.g., perhaps newly discovered EBT material)
names of contributors and editors for academic reference (v.s. Github logins)
date of release (v.s. document change date)

Thank you.

Aminah · October 1, 2019, 5:57pm

In fact, having poked a little further into this, I’d bump this up to “json-ld is really worth considering” if we want to live up to our moral duties as respectable citizens of the internet (and, of course, reep—or more to the point, offer—the benefit of significantly improved searchability).

Why’s that? 'Cos Tim Berners-Lee said! The specific words he used were:

if you’re responsible – if you know about some data in a government department, often you find that these people, they’re very tempted to keep it – Hans [Rosling] calls it database hugging. You hug your database, you don’t want to let it go until you’ve made a beautiful website for it. Well, I’d like to suggest that rather – yes, make a beautiful website, who am I to say don’t make a beautiful website? Make a beautiful website, but first give us the unadulterated data, we want the data. We want unadulterated data. OK, we have to ask for raw data now.

(The Next Web, Tim Berners-Lee, 2009)

Now, of course, it would be entirely off the mark to call SC a database-hugger—quite the opposite—nevertheless, it does need to take a step or two further in order to contribute to the construction of the Semantic Web in the way that it should. Using json-ld is a pretty straightforward means of doing so and we’ll fare better in search engine as a result.

JSON-LD, you say?

“The JSON-LD data model allows for a richer set of resources, based on the RDF data model … JSON-LD is a concrete RDF syntax as described in RDF11-CONCEPTS. Hence, a JSON-LD document is both an RDF document and a JSON document and correspondingly represents an instance of an RDF data model. However, JSON-LD also extends the RDF data model to optionally allow JSON-LD to serialize generalized RDF Datasets.” (JSON-LD 1.1: A JSON-based Serialization for Linked Data)

Further, certainly with respect to the kind of metadata file outlined in the ticket given in the OP, implementation looks quite easy. It’s perfectly possible to build a SuttaCentral vocabulary and keep the exact value names given in the sample json. That said, however, on the face of it, it would seem a lot more sensible to just use the already existing, extensive and widely used schema.org vocabulary for most things and only self-generated terms for properties that doesn’t already exist elsewhere.

Now, although it stretches the scope of the thread a little (as it goes beyond specifically SC translations), as it’s directly related, it’s also worth noting that there’s no good reason why this shouldn’t be applied to legacy texts too; at the very least in some stripped down version that extracts data given in these files via a few additional classes.

My reading around the subject suggests that implementation here should also be reasonably straightforward; that it would very naturally tie in with the legacy text page upgrade, and further that it would be a better avenue to go down than what’s proposed in Update SC's social sharing information · Issue #1486 · suttacentral/suttacentral · GitHub (although may not resolve the issue at the heart of the ticket as popular social networking sites have their own mark up they like folks to use, too, so we’d have to see the results, but in any case it would be the first step).

karl_lew · October 8, 2019, 8:28pm

Aminah, thanks for the link to JSON-LD. I’ll be using that in my other projects. It’s great to see this kind of standardization. As you’ve mentioned above, we can use standard names when applicable and simply add new ones as required.

Cheers,

{
   "@context": "https://schema.org",
   "@type": "Person",
   "name": "Karl Lew"
}

Aminah · October 9, 2019, 8:02am

Marvellous; glad it was of use!

The more I’ve looked into it, the more the generation of linked data seems quite an involved undertaking; neat 'n all, but something that definitely requires a bit of thought and investment.

Whatever, the case though, using json-ld (and already existing vocabularies) seems to fit with the needs of the OP; is straightforward enough; offers immediate SEO benefits and lays a foundation for dataset production should the moment arise at some point.

sujato · October 9, 2019, 8:29am

I agree we should use JSON-LD, I have made an initial commit of a publications file, and will look at adapting the form:

github.com

suttacentral/bilara-data/blob/master/_publication.json

[
  {
    "publication_number": "scpub1",
    "source_url": "https://github.com/suttacentral/bilara-data/tree/master/translation/en/sujato/kn/thag",
    "author_uid": "sujato-walton",
    "collaborator": [
      {
        "collaborator_uid": "sujato",
        "author_name": "Bhikkhu Sujato",
        "author_github_handle": "sujato"
      },
      {
        "collaborator_uid": "walton",
        "author_name": "Jessica Walton",
        "author_github_handle": ""
      }
    ],
    "text_uid": "thag",
    "text_title": "Verses of the Senior Monks",
    "is_published": "true",

This file has been truncated. show original

karl_lew · October 9, 2019, 11:22am

In the JSON-LD world, perhaps all the JSON we are doing together belongs in the shared context of suttacentral.net. For example, we could use:

"@context": "https://suttacentral.net/json-ld"

And a Bilara text file would simply be:

{
   "@context": "https://suttacentral.net/json-ld"
   "@type": "bilara-text"
   "dn33:0.1": "..."
   ...
}

By availing ourselves of a shared context that defines each specific document type (e.g., “bilara-text”), we can specify the semantics and structure of every single SC document consistently and thoroughly. And we can do so with full json-ld compatibility and with complete fidelity to SC guidelines.

karl_lew · October 9, 2019, 3:47pm

As I look into bilara-data schema, I have come to realize that we may be misunderstanding JSON. The JSON specification clearly states:

An object is an unordered set of name/value pairs.

Therefore the following are equivalent:

{a:1,b:2}
{b:2, a:1}

Where this manifests in bilara-data is an implicit assumption that text segments are ordered. They are certainly NOT ordered by JSON semantics. They are ordered in a way that is hidden from JSON. Note that SCID comparison is quite subtle:

an1.2:2.3 is less than an1.10:0.1 (different than string compare)
dn33:1.2.31 is less than dn33:1.10.1 (different than string compare)

What this means is that any program that reads bilara-text files as JSON can only treat the segments as unordered. Indeed, I have had to write special code to restore scid order when reading bilara-data translations as JSON. It takes about
50 lines of Javascript to order scid’s properly. The array is the only JSON structure that preserves order. To store bilara text as proper JSON, we would need to do something like:

[
  "an1.1:0.1=Numbered Discourses 1 ",
  "an1.1:0.2=1. Sights, Etc. ",
  ...
]

or more verbosely as JSON-LD:

[
   {
      "@id":"an1.1:0.1",
      "text":"Numbered Discourses 1 ",
   },{
      "@id":"an1.1:0.2",
      "text":"1. Sights, Etc. ",
   }
]

In other words, the current Bilara text file format is not really JSON. Nor is it JSON-LD. I’m actually fine with the current Bilara text file format and just wanted others to understand the implications of our schema. It is not standard. But it is practical.

sujato · October 9, 2019, 10:34pm

This sounds great, but I really don’t understand the issues well enough to know whether it is worth doing.

Ahh, yes, I recommended that we use arrays for this exact reason. We would get correct JSON semantics at the expense of a very minor increase in verbosity. However I was overridden by our developers at the time! @blake, any thoughts?

Looking closer at the JSON-LD spec, it is too complicated for me to do, there are too many concepts I am unfamiliar with. If anyone wants to take it up, that would be great! Currently the publications data is unused, so there is no problem in changing it how you wish.

karl_lew · October 10, 2019, 12:54am

After more reading, my enthusiasm is somewhat tempered. JSON-LD seems to be primarily a data-exchange format. I.e., it is intended for the exchange of information about common objects (e.g., cars, contacts) amongst a large community having historically idiosyncratic conventions. It basically is a language and methodology for creating an understandable superset of all those conventions. So I’d only recommend JSON-LD for those SC documents that warrant such verbose treatment. For example, it might make sense to use JSON-LD for representing information about a translator, but I wouldn’t use JSON-LD for representing a sutta segment.

I agree. Let’s only use JSON-LD where it provides value to us and others. I’m not going to use it in Voice, but I think I might use JSON-LD in another project that uses common objects.

sujato · October 10, 2019, 6:14am

Thanks for that.

We do have a long-term goal to use RDFa, probably best implemented as JSON-LD, for metadata. But since it is a somewhat complex task, we’d like to assign an experienced dedicated volunteer to do it systematically. So far it hasn’t happened, but if it does we can revisit these ideas.

milkii · October 2, 2020, 11:19pm

If anyone is interested in learning about semantic web etc in general, there is a very long and good YouTube tutorial playlist from the basics to advanced; Knowledge Engineering with Semantic Web Technologies by Dr. Harald Sack - YouTube, made before JSON-LD came along.

Thinking also about transclusion and Ted Nelson …