Volunteer wanted! Help collect author/translator information

sgns · September 23, 2019, 2:43am

Great Bhante @sujato, that all sounds clear.

Just a few more questions… is this the complete list of authors we should compile info on? sc-data/author_edition.json at master · suttacentral/sc-data · GitHub

and in the above github link, is the short_name the same as what we’ll use for the slug?

In terms our own process, and SC norms and preferences: is it best for @robbie and I to communicate on here/some other SC channel/platform, or is personal email ok?

sujato · September 23, 2019, 7:13am

Yes.

No, the one you want is called uid. (The “short name” is used in buttons and similar contexts where space is at a premium. Actually I think that due to changes earlier this year we may not even use the short name at all.)

Note that author_edition includes both “authors” (= translators mainly) and also editions, which applies to root texts in Pali, etc. We are primarily looking for author info at the moment, but any edition info that you find would be great, too. Maybe keep a separate spreadsheet for that, if you come up with anything.

Robbie · September 23, 2019, 8:41am

Perfect! I sent an edit request.

By the way, I developed something of a pipeline (in R) to convert SuttaCentral’s JSON data into a data frame (so that the data can be copy-pasted into a Google spreadsheet). I will add the author_edition data as soon as I can edit.

install.packages("RJSONIO")
require(RJSONIO)
json_data <- fromJSON("https://raw.githubusercontent.com/suttacentral/sc-data/master/additional-info/author_edition.json", encoding="UTF-8")
json_data <- lapply(json_data, function(x) {
  x[sapply(x, is.null)] <- NA
  unlist(x)
})
df <- as.data.frame(do.call("rbind", json_data))
View(df)

in the case of collaborative translations, each author is also listed independently

I recently learned about tidy data from Wickham (2014): https://vita.had.co.nz/papers/tidy-data.pdf.

This paper tackles a small, but important, component of data cleaning: data tidying.
Tidy datasets are easy to manipulate, model and visualise, and have a specific structure:
each variable is a column, each observation is a row, and each type of observational unit
is a table.

It seems like there might be some value in distinguishing singular slugs from compound slugs (that is, considering them different observational units). Fields like DoB and DoD are not relevant for a compound slug (they are only relevant for the associated individual translators). In the current author_edition dataset the information of multiple authors is put in one cell (which makes the data messy), e.g.

33 | author | aung-rhysdavids | Aung … | Shwe Zan Aung, C.A.F. Rhys Davids

I think this could be prevented by having one table which lists the associated authors for a compound slug, e.g.

1 | aung-rhysdavids | aung | rhysdavids

and then another table (based on the Google spreadsheet) with the author data, e.g.

1 | aung | Shwe Zan Aung | pli | en | . .  .
2 | rhysdavids | C.A.F. Rhys Davids | pli | en | . . .

sgns · September 25, 2019, 12:44am

@Robbie I don’t follow 100% but it sounds great

If I understand, you will populate the “slug name” and “long name(s)” columns, and identify/populate whatever other info we are pulling from the current author_edition dataset.

Once that’s completed, will you tag me in this thread? Then I can start helping with the research.

sujato · September 25, 2019, 8:35am

And I’ve just okayed it.

Okay so that’s awesome. Note that Bilara i/o offers this natively, but the relevant data is not yet on Bilara, so i’m not sure how exactly it would work.

Indeed, that sounds great. We should, in fact, end up with distinct JSON files:

translator.json
edition.jason
translation_collaborators.json

Translation collaborators should have only slug, short name, and long name. (Sometimes the long name is not inferrable from the individual names, eg. “T.W. & C.A.F. Rhys Davids”)

Of course this will only affect a few cases.

Robbie · September 26, 2019, 6:57pm

@sgns I have imported the data from the author_edition dataset! I created new UIDs (“collaborator UIDs”) for individual authors who previously only appeared in a compound UID (e.g. walton for Jessica Walton). All new UIDs are marked with an asterisk * in the A-column. (I have backed up this table, so there’s no risk of data loss. I also have made a table which lists the collaborator UIDs for each compound UID.)

sgns · September 28, 2019, 4:45am

Great @Robbie
I will dig in soon!

sgns · October 19, 2019, 8:33pm

Hi friends,

I just updated the spreadsheet with the links provided by Phra @Dhammanando
I will return to this in November…

Sorry to miss you and the Sutta Central work party in California, Bhante @sujato. I will be in Placerville this coming week, but leaving the day before you arrive! I hope your travels go well.

sujato · October 20, 2019, 12:06am

Sorry to miss you. Oh well, next time. Today spent with the group in Santa Barbara with meditation and suttas, and a sprinkling of mythology: just the way I like it.

sgns · December 4, 2019, 1:46am

Hi everybody,
Just bumping this in case another volunteer comes out of the woodwork
I’ve been trying to enlist a couple folks.
@Robbie are you still up for this? And Venerable @Dhammanando might you help as well?

sgns · December 4, 2019, 1:51am

Also, just thinking here… The sheer enormity is a bit daunting!
But two shortcuts come to mind:

(1) Can we pull bios from the Access to Insight page? How do we need to credit that, just note in the source?

(2) Could we ask the authors to submit info on themselves? I.e. send out a BCC email to every author we have contact info for. After that’s been tried, we go out and search for the rest manually. What do you think Bhante @sujato?

sujato · December 4, 2019, 4:50am

Indeed you can. However, as a rule, it is fine to extract data, but actual sentences may be subject to copyright, whether from AtI or elsewhere. So to be completely confident that we are free of copyright, it’s better to rewrite things. Nonetheless, it’s always best to record the sources.

Honestly, I hadn’t even thought of that, it’s certainly a possibility. But many of them are dead or otherwise untraceable, so perhaps better to work concurrently. @Aminah can help with the contacts.

sgns · December 4, 2019, 6:19am

Bhante @sujato, the Access to Insight bios are under Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) licensing. So, we can copy and paste without re-writing, provided we attribute properly, no? Would each bio will have a space to note sources, licensing specs, etc.?

@Aminah that would be great to connect re: contacting authors. Send me a PM and we can work out the details?

sujato · December 4, 2019, 8:09am

Yes, technically you can, but all our information is under CC0, which means there is no legal requirement for attribution. When you mix material under different licenses, it gets complicated fast. You have to ensure that every system you build forever will always display the right license for the right thing. When there are hundreds of items, with different licenses, you can see how complex it will get. Much better to simply invest the time in the beginning, rewrite and adapt from the sources, and ensure that this is simply not an issue at all in the future.

Robbie · December 5, 2019, 4:22pm

Most definitely! My schedule is pretty much full until Dec 17, but I can work on it on Dec 18 and subsequent days.

Robbie · December 22, 2019, 4:26pm

FInished lang-from (ISO), lang-to (ISO), and role! (For a few slugs I couldn’t find any data on SuttaCentral; I left those open.)

By the way, would someone with the requisite powers like to add Jessica Walton to the English Translator List?

sujato · December 23, 2019, 8:49am

Hey thanks, great to hear some good news on this.

Hmm, that list is auto-generated, so the best thing will be to wait until the new system is in place and make sure everyone is there.