Volunteer wanted! Help collect author/translator information

Well, the first one these things is already in the files above. But what we should make sure is that in the case of collaborative translations, each author is also listed independently. So we should have;

sujato-walton
sujato
walton

Okay, well that would be great.

I’m sure there will be a lot of constraints in the source material: some things hard to find or impossible, different kinds of information for each author, and so on. We shall probably have to proceed by bearing in mind:

  1. What we want, what is useful for us
  2. What is actually available
  3. What is idiomatic in something like RDF

There are degrees of structure. We could, at the loosest, simply dump each bio in a field. Or at the other extreme, every individual piece of information could be assigned to a key/value pair. We’ll probably end up somewhere in the middle: important and standard bits of information (DoB, name, etc.) end up as specific data fields, more general things have a general “biography” field.

2 Likes

Hi Bhante @sujato and @Robbie
Great. I don’t have a good sense of where to begin re: structure, key/value pairs, etc., but I’m sure I can learn whatever system we decide upon. I’m fine with Google Sheets for now.

What you write about structure makes sense, Bhante. Sounds like we want to gather info that includes but also extends beyond what Ven. @Vimala posted above (i.e. DOB and death). Perhaps we want to know what primary regions a person taught/wrote in? Will we need to compile a short bio blurb as well as a longer version?

Bhante @sujato do you have an example of a really good author info page/entry we can use as our ideal benchmark?

2 Likes

Great, the only thing we need do for now is to ensure that the right data is entered in the right column in the spreadsheet. But we don’t need to go nuts, the main thing is to have a nicely written short bio. Edit for clarity and spelling, etc.

Sure, a place would be good.

No, keep the bios somewhat short, and link to a longer version (like Wikipedia) if necessary. The length of the bio should be comparable to the length of the “blurbs” that accompany each sutta: a tweet, not an article.

I’m not sure, but you can have a look at this Tibetan site:

https://www.tbrc.org/#!persons

Here is an example from the Internet Archive:

We are going to find that the hard part will be tracking down the info. This will be especially hard for languages other than English. So we’re going to need to put on our detective hats and do some sleuthing. At the end of the the day, anything is better than nothing.


I have started a Google sheet. @robbie, you’ll need a gmail address to sign in and edit.

2 Likes

Great Bhante @sujato, that all sounds clear.

Just a few more questions… is this the complete list of authors we should compile info on? sc-data/author_edition.json at master · suttacentral/sc-data · GitHub

and in the above github link, is the short_name the same as what we’ll use for the slug?

In terms our own process, and SC norms and preferences: is it best for @robbie and I to communicate on here/some other SC channel/platform, or is personal email ok?

2 Likes

Yes.

No, the one you want is called uid. (The “short name” is used in buttons and similar contexts where space is at a premium. Actually I think that due to changes earlier this year we may not even use the short name at all.)

Note that author_edition includes both “authors” (= translators mainly) and also editions, which applies to root texts in Pali, etc. We are primarily looking for author info at the moment, but any edition info that you find would be great, too. Maybe keep a separate spreadsheet for that, if you come up with anything.

3 Likes

Perfect! I sent an edit request.

By the way, I developed something of a pipeline (in R) to convert SuttaCentral’s JSON data into a data frame (so that the data can be copy-pasted into a Google spreadsheet). I will add the author_edition data as soon as I can edit.

install.packages("RJSONIO")
require(RJSONIO)
json_data <- fromJSON("https://raw.githubusercontent.com/suttacentral/sc-data/master/additional-info/author_edition.json", encoding="UTF-8")
json_data <- lapply(json_data, function(x) {
  x[sapply(x, is.null)] <- NA
  unlist(x)
})
df <- as.data.frame(do.call("rbind", json_data))
View(df)

in the case of collaborative translations, each author is also listed independently

I recently learned about tidy data from Wickham (2014): https://vita.had.co.nz/papers/tidy-data.pdf.

This paper tackles a small, but important, component of data cleaning: data tidying.
Tidy datasets are easy to manipulate, model and visualise, and have a specific structure:
each variable is a column, each observation is a row, and each type of observational unit
is a table.

It seems like there might be some value in distinguishing singular slugs from compound slugs (that is, considering them different observational units). Fields like DoB and DoD are not relevant for a compound slug (they are only relevant for the associated individual translators). In the current author_edition dataset the information of multiple authors is put in one cell (which makes the data messy), e.g.

33 | author | aung-rhysdavids | Aung … | Shwe Zan Aung, C.A.F. Rhys Davids

I think this could be prevented by having one table which lists the associated authors for a compound slug, e.g.

1 | aung-rhysdavids | aung | rhysdavids

and then another table (based on the Google spreadsheet) with the author data, e.g.

1 | aung | Shwe Zan Aung | pli | en | . .  .
2 | rhysdavids | C.A.F. Rhys Davids | pli | en | . . .
2 Likes

@Robbie I don’t follow 100% but it sounds great :slight_smile:

If I understand, you will populate the “slug name” and “long name(s)” columns, and identify/populate whatever other info we are pulling from the current author_edition dataset.

Once that’s completed, will you tag me in this thread? Then I can start helping with the research.

1 Like

And I’ve just okayed it.

Okay so that’s awesome. Note that Bilara i/o offers this natively, but the relevant data is not yet on Bilara, so i’m not sure how exactly it would work.

Indeed, that sounds great. We should, in fact, end up with distinct JSON files:

translator.json
edition.jason
translation_collaborators.json

Translation collaborators should have only slug, short name, and long name. (Sometimes the long name is not inferrable from the individual names, eg. “T.W. & C.A.F. Rhys Davids”)

Of course this will only affect a few cases.

2 Likes

@sgns I have imported the data from the author_edition dataset! I created new UIDs (“collaborator UIDs”) for individual authors who previously only appeared in a compound UID (e.g. walton for Jessica Walton). All new UIDs are marked with an asterisk * in the A-column. (I have backed up this table, so there’s no risk of data loss. I also have made a table which lists the collaborator UIDs for each compound UID.)

4 Likes

Great @Robbie
I will dig in soon!

2 Likes

Hi friends,

I just updated the spreadsheet with the links provided by Phra @Dhammanando
I will return to this in November…

Sorry to miss you and the Sutta Central work party in California, Bhante @sujato. I will be in Placerville this coming week, but leaving the day before you arrive! I hope your travels go well.

3 Likes

Sorry to miss you. Oh well, next time. Today spent with the group in Santa Barbara with meditation and suttas, and a sprinkling of mythology: just the way I like it.

2 Likes

Hi everybody,
Just bumping this in case another volunteer comes out of the woodwork :slight_smile:
I’ve been trying to enlist a couple folks.
@Robbie are you still up for this? And Venerable @Dhammanando might you help as well?

2 Likes

Also, just thinking here… The sheer enormity is a bit daunting!
But two shortcuts come to mind:

(1) Can we pull bios from the Access to Insight page? How do we need to credit that, just note in the source?

(2) Could we ask the authors to submit info on themselves? I.e. send out a BCC email to every author we have contact info for. After that’s been tried, we go out and search for the rest manually. What do you think Bhante @sujato?

1 Like

Indeed you can. However, as a rule, it is fine to extract data, but actual sentences may be subject to copyright, whether from AtI or elsewhere. So to be completely confident that we are free of copyright, it’s better to rewrite things. Nonetheless, it’s always best to record the sources.

Honestly, I hadn’t even thought of that, it’s certainly a possibility. But many of them are dead or otherwise untraceable, so perhaps better to work concurrently. @Aminah can help with the contacts.

2 Likes

Bhante @sujato, the Access to Insight bios are under Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) licensing. So, we can copy and paste without re-writing, provided we attribute properly, no? Would each bio will have a space to note sources, licensing specs, etc.?

@Aminah that would be great to connect re: contacting authors. Send me a PM and we can work out the details?

2 Likes

Yes, technically you can, but all our information is under CC0, which means there is no legal requirement for attribution. When you mix material under different licenses, it gets complicated fast. You have to ensure that every system you build forever will always display the right license for the right thing. When there are hundreds of items, with different licenses, you can see how complex it will get. Much better to simply invest the time in the beginning, rewrite and adapt from the sources, and ensure that this is simply not an issue at all in the future.

2 Likes

Most definitely! My schedule is pretty much full until Dec 17, but I can work on it on Dec 18 and subsequent days. :slightly_smiling_face:

4 Likes

FInished lang-from (ISO), lang-to (ISO), and role! :smile: (For a few slugs I couldn’t find any data on SuttaCentral; I left those open.)

By the way, would someone with the requisite powers like to add Jessica Walton to the English Translator List?

5 Likes

Hey thanks, great to hear some good news on this.

Hmm, that list is auto-generated, so the best thing will be to wait until the new system is in place and make sure everyone is there.

2 Likes