Bilara i/o

sujato · September 20, 2019, 10:14pm

For all geeks and devs out there, we just rolled out a cool feature for Bilara. You can now export the data from Bilara in a range of file formats, and re-import it again!

This is a pre-pre-alpha feature. It will break and destroy everything! Use only if you do not value your own life or that of your loved ones!!!

Only kind of kidding! Actually on my very limited testing it seems to work fine, but best be cautious. I’m sharing it with you in the hope that someone might want to test it and play around and find some bugs.

This is not a user-facing feature. It is developed for two main use cases:

Internal SC work, if we need to make bulk or complex changes to the bilara-data
Consumption of bilara-data in external apps

How does it work? First, clone bilara-data:

Go to the .scripts folder. Change the python version to 3.7.2. (Other versions may work if you have a different version installed.) Run something like:

pip3 install -r requirements.txt

Ready to go, let’s export dn1 as a Libreoffice spreadsheet!

./sheet_export.py dn1 dn1.ods

Edit it, save, and run:

 ./sheet_import.py dn1.ods

Et voila, your changes appear in the bilara data file.

You can easily do something like this, too:

./sheet_export.py dn dn.tsv --include root, translation+en

“Export the whole of DN as a tsv file, including only the root text and English translation”.

It uses Pyexcel under the hood, so you have a wide range of formats to choose from.

http://docs.pyexcel.org

@karl_lew @jared @anon31486827 @Vimala

karl_lew · September 22, 2019, 2:12pm

Bhante, this is quite fantastic! Thank you!

In a week or two I had planned to revisit the Voice sutta storage in order to support i18n. Indeed the structure that Anagarika Sabbamitta and I had settled on looks remarkably like the Bilara translation folder. We had needed separation by language/nikaya rather than nikaya/language, so the Bilara structure is actuallly perfect.

What this means is that Voice can use Bilara data directly for all its sutta needs. Content updates will then be exquisitely simple: git pull.
The Bilara data is so cleanly structure that Voice will just read the Bilara data directly:

JSON.parse(fs.readFileSync(suttapath).toString());

…?

As I scan the Bilara repository, I notice a small omission. I can’t find the blurbs that Voice needs to display. Will the blurbs also live here?

The ending of defilements comes only when the truth is seen. But seeing the truth comes about due to a vital condition. In this way, twelve factors leading to freedom are united with the twelve factors leading to suffering.

sujato · September 22, 2019, 11:20pm

No worries!

Excellent! I spoke with @michaelh about this yesterday, and he will be using the bilara data too.

Eventually, yes. They are already in JSON, so there won’t be any change in the format, but the repo will shift to bilara. We’ll do this at some point in the future when we want to start translating them.

kora · September 25, 2019, 4:03pm

I follow the steps and make it into a Colab notebook.

https://colab.research.google.com/drive/1-dGdBJmSF-3O7_64fEGQx66OrT1T0v7a

It’s may not be very useful, but anyone can play with the notebook.

sujato · September 25, 2019, 9:58pm

Thanks! Have you managed to run it yourself?

kora · September 25, 2019, 11:48pm

Yes. I run them too. Although I didn’t change the ods, just export and import it back.

It may also be useful to do some analysis and visualization too, I haven’t done it yet.

karl_lew · October 28, 2019, 4:07pm

@Blake, @Sujato, @HongDa

One very peculiar thing that Anagarika and I noticed about bilara-data is the inclusion of trailing space throughout. Semantically, such spaces provide nothing and are actually little landmines that explode in odd ways. I would actually recommend removing those spaces, especially since there seems to be a convention of exactly one space at the end of each line. If we are indeed writing code that expects exactly one space there, then we should rewrite any such code to be robust and space agnostic. These spaces become a bit of a nightmare as we all start to use bilara-data in different ways.

sujato · November 7, 2019, 10:20pm

The convention is that each segment takes exactly one space except when the final character in the segment is em-dash —.

So basically we have to either:

Keep the spaces, or
Ensure that user agents respect that rule.

Of course it’s not going to matter for an audio app, but for print or screen it does.

sabbamitta · November 8, 2019, 8:10am

In German typography we use en-dashes surrounded by spaces – instead of an em-dash – so that rule wouldn’t apply for German. Rather, after an en-dash, it would be the same space at the end as everywhere else.

Snowbird · March 17, 2022, 6:49am

Has there been a change affecting the sheet_export.py script?

Now when I run this:

sheet_export.py dn dn.tsv --include root, translation+en

I get the following:

usage: sheet_export.py [-h] [--include INCLUDE] [--exclude EXCLUDE] uid [uid ...] out
sheet_export.py: error: unrecognized arguments: translation+en

The only thing I’m doing different from before is that I seem to have to run the command in the \.scripts\bilara-io folder instead of just the \.scripts\ folder.

Running without the translation+en is successful, however the translation is not included.

chaz · March 17, 2022, 7:43am

Don’t put a space between the arguments passed to the include flag.

--include root,translation+en

Snowbird · March 17, 2022, 8:09am

That fixed it. So strange. I know it worked with the space previously because I would just copy and paste from instructions I had written.

Thanks!!!

Snowbird · May 31, 2022, 2:22am

I’m getting a new error when I try to generate a file.

sheet_export.py mn mn-pl-en.tsv --include root,translation+en,html gives me this:

Reading html\pli\ms\sutta\mn\mn1_html.json
Reading html\pli\ms\sutta\mn\mn2_html.json

... [etc]

Reading html\pli\ms\sutta\mn\mn151_html.json
Reading html\pli\ms\sutta\mn\mn152_html.json
C:\GitHub\bilara-data\html\pra\pts\sutta\pdhp\pdhp1-13.json
Traceback (most recent call last):
  File "C:\GitHub\bilara-data\.scripts\bilara-io\sheet_export.py", line 52, in <module>
    rows = get_data(repo_dir, uids, include_filter, exclude_filter)
  File "C:\GitHub\bilara-data\.scripts\bilara-io\get_data.py", line 70, in get_data
    uid, muids_string = file.stem.split('_')
ValueError: not enough values to unpack (expected 2, got 1)

Bhante @Sujato, any ideas? It seems odd that it is looking for C:\GitHub\bilara-data\html\pra\pts\sutta\pdhp\pdhp1-13.json That’s a Prakrit Dhammapada, eh?

sujato · May 31, 2022, 10:43pm

Tagging @blake in on this one.

blake · June 2, 2022, 5:51am

The export script performs a quick inspection of the file system to figure out which files to include in the export, in this case it found filenames which are malformed and thus aborted since it couldn’t make sense of them.

blake · June 2, 2022, 6:23am

This issue should be resolved now.

Snowbird · June 2, 2022, 7:03am

It’s working now. Thanks!