Perfect! I sent an edit request.
By the way, I developed something of a pipeline (in R) to convert SuttaCentral’s JSON data into a data frame (so that the data can be copy-pasted into a Google spreadsheet). I will add the author_edition
data as soon as I can edit.
install.packages("RJSONIO")
require(RJSONIO)
json_data <- fromJSON("https://raw.githubusercontent.com/suttacentral/sc-data/master/additional-info/author_edition.json", encoding="UTF-8")
json_data <- lapply(json_data, function(x) {
x[sapply(x, is.null)] <- NA
unlist(x)
})
df <- as.data.frame(do.call("rbind", json_data))
View(df)
in the case of collaborative translations, each author is also listed independently
I recently learned about tidy data from Wickham (2014): https://vita.had.co.nz/papers/tidy-data.pdf.
This paper tackles a small, but important, component of data cleaning: data tidying.
Tidy datasets are easy to manipulate, model and visualise, and have a specific structure:
each variable is a column, each observation is a row, and each type of observational unit
is a table.
It seems like there might be some value in distinguishing singular slugs from compound slugs (that is, considering them different observational units). Fields like DoB and DoD are not relevant for a compound slug (they are only relevant for the associated individual translators). In the current author_edition
dataset the information of multiple authors is put in one cell (which makes the data messy), e.g.
33 | author | aung-rhysdavids | Aung … | Shwe Zan Aung, C.A.F. Rhys Davids
I think this could be prevented by having one table which lists the associated authors for a compound slug, e.g.
1 | aung-rhysdavids | aung | rhysdavids
and then another table (based on the Google spreadsheet) with the author data, e.g.
1 | aung | Shwe Zan Aung | pli | en | . . .
2 | rhysdavids | C.A.F. Rhys Davids | pli | en | . . .