Would there be interest in a "export suttas as spreadsheet" web service?

sujato · January 25, 2022, 10:31pm

One of the things I really want to support is the idea that our data—mainly the translations—can be very portable and used by different people in different ways.

Now, one of the features we’ve built to do this is called “bilara i/o”. Essentially, it takes all of the relevant data for a given text or collection, and exports it as a tsv file. (Other formats are supported if needed.) Tsv stands for “tab separated values”; it’s just plain text with columns separated by tabs. It can be natively opened in any spreadsheet or text editor.

Then you have a spreadsheet with all the data for, let’s say, the Majjhima Nikaya. Pali text, different translations, markup, variant readings, references, comments, and so on. You can then filter out what you don’t want, and organize the rest how you like. You could use it for various things:

personal study
a website
EPUBs or PDFs for a specific purpose
class materials
and so on

Bilara i/o is awesome, but it does require a degree of setup. If we are losing you at “clone the repo” then you’re probably going to struggle!

So I’m wondering if there would be interest in offering this as a push-button web service?

Essentially, you would just enter the text or collection you want, then press go, and it would produce a tsv file. We could fancy it up a bit, but that’s the basic idea.

If you want to know what it would look like, here’s a tsv file for dn.

dn.zip (1016.4 KB)

chaz · January 25, 2022, 10:56pm

Well there is now my repo that kiiiind of does this.

I needed the data on demand as a backbone for projects that I envision for the future, so I put that together in the hope that it may benefit others as well. Check it out.

It’s still well in development, so happy to accommodate any changes if others are going to use it too.

Snowbird · January 26, 2022, 7:24am

Since, as it is, one has to install Python to get it to work, I’d say there’s no question that having a web service (which I assume is just a fancy word for a web page, not an api or something like that) that could generate these files would open up your data to a much larger pool of people. You were very kind to walk me through all the steps to get it working on my device, but I doubt if I could do it all over again without assistance.

There is the added issue of cross platform comparability. Remember I had problems getting it to work on Windows. So an online version would solve that problem.

I’ve always felt that the lack of ability to view page source was a huge blow for open data. So really at the moment it is only open in theory since you need to have a level of technical skill that most people don’t have. Of course if you have those skills, then the data is not only open but almost instantly usable in many different ways.

That’s my long winded way of giving a big "yes!"

stu · January 26, 2022, 8:31am

Bilara i/o is on my list of things to do, but there’s many things in front of that, and it keeps getting pushed back. I haven’t got a use case for this yet, but it is already beginning to stir some thoughts, so another "yes!" from me please.

sujato · January 26, 2022, 11:07pm

Nice! TIL what is Rda!

True. Although you can use the Inspect tool for the same thing. Not as nice tho.

Ok!

Hmm, I’m wondering whether the easiest way to do this would be to add it as a Github Action. We already run an extensive series of tests whenever a text is “published” in Bilara data. We could simply add bilara i/o to this, and export the top-level divisions (dn, mn, etc.) to a folder in Github.

That way there’s no UI to develop. A user just clones the repo or downloads it, boom! It’s kept automatically up to date. The only cost would be the time to run the GA, which is an issue. But if it’s added to an already-existing GA, and there are no new dependencies, it might not be too bad.

If you want something more specialized—say, to export a particular subset of data—you can still just set up bilara i/o and use it locally. In practice, though, it’s probably just as easy to start with the full set and just delete what you don’t need.

666tomanderson · January 27, 2022, 1:40am

I tried to set up Bilara i/o and even with help from @Snowbird I couldn’t manage it.

It’s great that you are turning your attention to exporting collections at the click of a button and giving users their choice of formats

I downloaded your sample DN and it presents some challenges such as formatting the Pali text. This may be included in the “we could fancy it up a bit”

dntsv

For the majority of users with little or no knowledge of regex or even Excel, they would probably give up at this point like I did (and I know my way around Excel).

I’m not sure what the proposed “web service” would look like, but for me it would be wonderful to see a web interface which precludes the need to “clone the repo” and works the magic behind the scenes so that I could choose what source material and language I wanted (All the suttas, a single book such as DN, a single book such as DN with Pali side by side with option line by line or paragraph by paragraph, Sutta Nipata, Vinaya, checkboxes to include variant readings, paragraph numbers etc.) Once I’ve selected the source material it would be wonderful to select the output: Web page(s), PDF, ePub, Text, .tsv etc.

In the meantime, I’d be very happy with up to date versions of the translations in Epub format as that format provides everything that I need for now and the links to the ePub downloads have been broken for yonks.

sujato · January 27, 2022, 6:10am

Well for a start you’re not using a Unicode font. It’s 2022, not 1997.

This is two rather different things. Our “publications” format will export text in a variety of usable formats, including standalone HTML, EPUB, and PDF. This is intended for end users.

What I am proposing here is for developers. It is for raw data, and it assumes a degree of technical know-how. Basically, the more assumptions you make, the more complex the app becomes, the more bugs it will have, and the more marginal the end-use cases become. We’re not going to be able to cater for every possible outcome, but we might be able to do a few useful things.

Snowbird · January 27, 2022, 9:48am

It probably has nothing to do with not using a Unicode font. If they weren’t using a Unicode font, either their OS would substitute or it would show tofu. I imagine it is because of selecting the wrong encoding format when importing it into excell/libreoffice.

You can see the first option is to select the character set. Must be UTF-8

sujato · January 27, 2022, 10:05pm

Okay, but the point remains:

How is utf-8 not default? (It’s a rhetorical question, don’t worry!)

Subharo · January 28, 2022, 2:32am

The popular geek thing to do these days is to have it in .json format, for programmatic manipulation. I’m not a spreadsheet guy. Just saying.

Snowbird · March 29, 2023, 9:38pm

Since this post, there has been the announcement of the SC Editions project:

Now raw html versions are available.

And I have created a web app that takes the json files and builds html files that gives you some flexibility in manipulating the headings as well as offers Pali-English versions.