Bilara publication model: some refinements

sujato · June 7, 2020, 12:02am

This suggestion follows on from previous discussion here:

That proposal envisages a UI on Bilara for managing publication. We are still some way off of doing that, so I would like to propose a more minimal solution for now. This can be a stepping-stone towards a fuller solution.

Currently on _publication.json we have a Boolean field is_published. This creates a potential issue, as “published” status is also (in principle) indicated by the git branch.

More subtly, it confuses the perspective of the data. A translator should not decide when a text is published: that is for a publisher to decide, i.e. the apps that consume bilara-data. A translator, rather, should indicate when a translation is ready for publication. That should then trigger an action on behalf of an app.

In addition, we still do not have a clear model for representing incremental translations. A translator may work either by translating a whole text at a time and then publishing it (as I did), or by publishing as they go (as Sabbamitta is).

To keep the logic manageable, let us require that incremental translation is on a per-sutta basis. Thus you can either publish the “Anguttara Nikaya” or “AN 1.7”, but you can’t publish “the second vagga of the AN Twos”.

So my proposal is this. In bilara-data we have two fields. Please let me know if my JSON is incorrect!

is_ready_for_publication: true/false,
is_completed: "true/false/[list, of, completed, sutta, IDs]"

Let us consider each combination of these.

is_ready_for_publication: false,
is_completed: false,

In this case, all translations are kept on the unpublished branch.

is_ready_for_publication: true,
is_completed: true,

In this case, once is_ready_for_publication is set to true, it sends a message to the developers on bilara-data, (Blake and myself) that this translation project (usually a nikaya or similar) is ready to be promoted to published branch. This will be handled manually for the time being, and possibly automated in due course. Subsequent edits will be to the published branch.

Apps should consume data from published branch. This can be automated or manual, that is up to the app developer.

So much for the simple cases!

What happens when:

is_ready_for_publication: false,
is_completed: true,

In this case, nothing need be done. We wait for a signal from the translator that they are ready.

What about:

is_ready_for_publication: true,
is_completed: false,

This should trigger a UI change on Bilara. The translator sees a “Publish” button on each sutta in the translation project. When they click that button, the relevant ID is added to publication.json.

is_ready_for_publication: true,
is_completed: ["mn1", "mn2", "mn3"],

This sends a signal to bilara-data developers that the stated texts are ready be promoted to the published branch.

One disadvantage of this approach is that we will end up with thousands of sutta IDs in publication.json. I’m not sure if there is a more elegant solution to this, but this is the best I can think of. At least it is simple and clear!

I am envisaging that for the time being, publication.json is still hand-edited, except for this one function. But we should move to GUI editing sooner rather than later.

Slightly related: Bilara should indicate the publication status of each sutta.

karl_lew · June 7, 2020, 4:02am

The classical solution to this is the exception list. In other words:

MN is_ready_for_publication
Except MN3

If [mn1, mn2, mn3] files are the only files present in bilara-data, then this is a very economical for an app to understand: “publish MN1 and MN2 but not MN3”.

The idea of the exception list was the potential publication.json change Anagarika and I had discussed separately. It works well for Anagarika because she rarely has more than one or two suttas in progress and the exception list is usually just one sutta. We never got around to discussing what the JSON might be, but perhaps this might be the time to think about such things.

p.s., I like your suggestion is_ready_for_publication. It expresses the semantics of intention and matches exactly what Voice needs. Voice actually hasn’t had a need for a publication branch and works well with existing publication.json. Go ahead and add the new field but please leave the existing field intact since production Voice actively uses _publication.json. We can remove the is_published field after the upcoming Voice release.

sabbamitta · June 7, 2020, 6:44am

Thank you for this proposal, Bhante, it sounds very good to me.

A few thoughts to these points:

If I understand this well, I think the exception list concept won’t work well once all translations stubs for a project are created. You would have to add thousands of suttas as “exceptions” in the first place and then gradually remove them. You still have the thousands of suttas.

I have been thinking about this point already earlier, and one possibility that seems perhaps doable would be to publish by book instead of Nikaya for AN, SN, and KN. MN has only 152 suttas which is still somewhat handlebar, and DN has even less. But if we make AN1 one publication unit, AN2, AN3, etc. (they have a few hundred suttas each, but not thousands), and SN1, etc., and for the KN we can have Dhp, Snp, Ud, etc., the number of suttas per publication unit would be better manageable and still the number of publication units itself wouldn’t go totally out of scope (as it would for example if we did it per Vagga).

Once a book is completed, is_completed could simply be set to true, and all the ID numbers for that book would fall away. For example for the AN for Sabbamitta, currently AN1, AN2, and AN3 would just be true, and AN4 would have the numbers from 1–198. All other books have either very few suttas or none at all, so could be simply false. In this way only the book that is currently being mainly worked on would create a somewhat bigger number of Sutta IDs in the file. Once the book of the Fours is completed it would simply go to true, and the number of Sutta IDs in the Fives would start to grow.

Would that provide a better overview and make things clearer?

sujato · June 7, 2020, 9:01am

Okay, that sounds good, let me think through the logic of it. We could implement an “exceptions” field instead of is_completed.

When fully published:

is_ready_for_publication: true
exceptions: none

Now, if there is incremental publishing, we might have:

is_ready_for_publication: true
exceptions: ["mn3"]

The logic for pushing files to published would then be:

If is_ready_for_publication: true
Then check what files exist in project.

For existing files:

If file is empty, leave as unpublished.
Else if file has content, check whether it is in exceptions.
- If it is in exceptions, leave as unpublished.
- if it is not in exceptions, promote to published.

There would also have to be additional logic at the Bilara side in order to populate the exceptions field:

If is_ready_for_publication: true AND not exceptions: none
Then display publish button on texts in project.

For texts in project:

If no translation strings have been entered, do nothing.
If translation strings have been entered AND publish has not been pressed, add text ID to exceptions.
If translation strings have been entered AND publish has been pressed, remove text ID from exceptions.

Does that seem right? Something like that anyway. It seems complicated and brittle, but maybe that is just my slow old mind. I guess at the end of the day there is a certain amount of complexity and that can either be dealt with by explicitly stating the content or by logically inferring it. Either way, the complexity doesn’t just vanish.

In favor of the “additive” approach, it makes it super-clear and explicit to anyone exactly what is published. You can just look at publication.json and either the whole project is published, or the listed texts. With exceptions, it’s a lot less clear. To work out what is actually published, you have to do:

files in project minus empty files minus exceptions = published texts

Perhaps we need to clarify exactly what publication.json is for. In my mind, it is the canonical authority of reference for all SuttaCentral publications. If anyone wants to know what we have published, when, or any other central details, it’s all there. I imagine this would be used by apps consuming bilara-data, but also it can be used as the data source to show what is published, or as a reference for researchers or interested parties, etc.

I’m thinking of, say, someone who is interested to make a print edition of an SC translation. Let’s say they want to publish the MN in French. They check publication.json and what does it tell them? It is “ready for publication”. Great! There’s just one sutta listed as an exception. Okay, not really a problem, that should be done soon. Thery check back a week later: a different sutta is listed as exception. Hmm.

Anyway, I think you see the point. The file should be clear, explicit, and human-readable.

Yes, unless you filter out the empty texts first, as I propose above. But that creates its own problems.

Hmm. In fact we already do this for KN, so it wouldn’t be hard to apply it to AN and SN too.

I like it, it would keep the simple logic and explicit data of adding things rather than listing exceptions, but would avoid letting the number of IDs grow out of control.

I’d like to hear from Karl and Blake on this idea.

Right, I thought of using “publishable”, but then I thought that for ESL speakers the semantics of “publish”, “published”, and “publishable” is probably a bit obscure.

Okay, no problems.

sabbamitta · June 7, 2020, 10:09am

My mind is still older and obviously slower too, for …

… such a possibility didn’t even occur to it!

Well, we’ll work out the best way of doing.

karl_lew · June 7, 2020, 1:15pm

Yes. That suits Voice quite well.

We would use Javascript null or empty array [] instead of "none", but that’s a detail.

Given a choice between algorithmic and data complexity, keep the data as simple as possible. Algorithms are just consciousness hand waving, impermanent and unsatisfactory. Data should match exactly what people ask for and need. No more. No less.

In this way many apps can offer different views of the same data. Apps are highly ephemeral. Data is the deva essence that lasts longer (but also evaporates). Soon the FB pages of dead people will outnumber the FB pages for live people. FB itself changes daily (and annoyingly). One day FB itself goes into the and complexity vanishes on its own and the dimension of infinite space…giggles.

Exceptions are more precise, but if adding books works better for all, then that is programmable. Voice already deals with ranges, so it’s a special case of that existing code. The exception code is actually simpler, however, and is actually one line of code:
files = files.filter(f=>!exceptions[f])

sabbamitta · June 26, 2020, 7:12am

Have any decisions been made with regards to this?

The background of my question: Any new updates for Voice production depend on a publication model being established. We still have a few issues to work on for the current release—and if they are finished we have more issues for more releases—, but just so we can plan a bit …

sujato · June 27, 2020, 8:14am

I discussed it with Blake last week and it is the number 1 priority for Bilara right now, hopefully there will be some concrete steps v. soon. But the basic implication for Voice, as for all apps including SC itself, is that we should consume data from published branch, so you can proceed with that understanding.

sabbamitta · June 27, 2020, 8:26am

Yes, thank you. Good news that this is coming soon!

karl_lew · June 27, 2020, 1:24pm

Voice uses _production.json to display only what is deemed to be published.

blake · June 30, 2020, 6:47pm

I want to outline my plan for implementing publication for SuttaCentral.

Fundamentally this would be based on branches, there would be three branches:

published is the main branch which people would see on github and clone by default. (“stable” in software terms)
unpublished is the working branch which is in constant flux. (“unstable” in software terms - currently master)
ready_to_publish is a staging branch

In the Browse view there will be a button that translators can click which stages translations from unpublished to ready_to_publish. This can be done at the level of an individual translation, or a subtree (as examples: you could stage sn, sn1 or sn1.1), this is merely the translators intent.

When a translation is staged, it is copied into ready_to_publish. From the ready_to_publish branch files can be copied to the published branch by administrators/“publishers” using an interface which is to be determined. This might involve a Review view.

It is important to note here that these will use a file copy mechanism rather than a branch merge, the branches will have an independent history that do not share commits. The alternative would be to cherry pick all the commits which pertain to that file, but the thing is we (the end users) aren’t really interested in the minutiae of the translation history for a particular file, the published branch commit history is just of major events in the publication history of the file. Of course the detailed translation history is preserved in the unpublished branch. Note that in a way this is nearly identical to a git squash workflow where a bunch of original commits are squashed together into a brand new commit, that new commit only appears in the master branch, causing the branches to have independent commit history.

As a practical matter, ready_to_publish and published should be treated as strictly read only for any user other than the machine user sc-translatatron, that is to say ALL changes should propagate from unpublished, to ready_to_publish, to published and that is the only sort of action sc-translatatron will perform. We do not want fixes made directly to the ready_to_publish branch, because they would then be clobbered by any future updates from the unpublished branch. Hence Github Permissions should be used to lock down these branches absolutely for all users except sc-translatatron.

Also any pull requests from a branch or fork, should only ever be made into bilara-data/unpublished

The Justifications:

First is that end user applications can simply clone the repository, they do not need to parse and understand _publications.json to know what is published and what is not, only to get the attribution and licensing details. If it is there, it is published. It would also be possible (and actually easy) to checkout the bleeding edge translations, but that would be an extra step for users who know that is what they want.

Of course it also allows for a separate working space and review space, so a translation can exist in a published state, and also the next edition can be being prepared.

Avoiding a branch merge model is primarily to avoid merge conflicts and other funny business associated with merges. There can be no merge conflicts with a simple file copy. Earlier versions of Bilara used more branches, but this proved to be merge conflict nightmare land. Also commits don’t always map perfectly to files (i.e. a bunch of files might be added in a single commit).

Implementation Detail

As an implementation detail, Bilara server will actually maintain a separate checkout of each branch on the filesystem, and will push each branch separately to Github. This contrasts with the alternative which would be to checkout and push on the fly, using file locks to prevent the repository being modified until the operation is complete. This alternative is also a buggy nightmare, and checkouts are fairly slow.

Separate checkouts is fast and efficient (especially with special git options to share state between git repo instances).

sujato · June 30, 2020, 11:10pm

Okay, thanks Blake, that looks clear.

One question I have: what happens when a text is published and a translator makes changes? They edit the unpublished branch on Bilara, then what? Do we have to review each typo that a translator fixes?

Probably better to say, once it is in published, review is not necessary, let it go upstream automatically. Or make it optional.

Generally, my idea would be that, since we cannot personally review every translation, the aim of the “publisher” review would be to ascertain on a case-by-case basis that the project is handled well, texts have been proofread, and we are satisfied that the work is of sufficient standard for SC. We might make a checklist (proofreading, punctuation, spellcheck, etc.). Once a text has been accepted as published, I think it is generally okay to assume that further changes will be acceptable. Usually it will be simple corrections, standardization, and so on.

sabbamitta · July 1, 2020, 7:28am

How do you think to handle the proofreading? I find it difficult for me to find a proofreader. There would be one person who would be willing to read two or three Suttas for me, but they don’t have the time to do more.

I have nevertheless decided to publish my translation in this stage in Voice. There will be many years to go until the canon is completed, and I am hoping to find someone to help with a thorough proofreading over time. It should be understood that the translations are work in progress.

Would that be something SC can live with?

sujato · July 1, 2020, 7:38am

Sure. The thing is, we know you, and we know that you are very careful and thorough.

Having said which, it would be fantastic to get some proofers for you! I had four proofreaders, and I still find mistakes.

sabbamitta · July 1, 2020, 8:46am

I keep looking around.

The problem is perhaps, my standards are too high to easily accept someone as proofreader.

When I went to school I was told that newspapers are a reliable medium to find correct German language, with proper spelling and grammar and a good style. Unfortunately these times are long gone …

blake · July 1, 2020, 7:47pm

One approach to this would be to allow a user to “self-publish” revisions, like they hit the button once to send it to ready_to_publish, and a second time to publish it (but only if it’s already published), basically the review step becomes optional at the discretion of the translator.

sujato · July 1, 2020, 10:31pm

Having skilled helpers is invaluable; but don’t underestimate the value of unskilled readers; after all, that will constitute the bulk of your readership. An amateur might notice things that a professional misses.

I dunno, it sounds funky. How about:

If text is unpublished, click Ready to publish

If text is published, click Publish changes

sabbamitta · July 2, 2020, 9:36am

And then it goes both to “ready_to_publish” and to “published”?

karl_lew · July 3, 2020, 2:35pm

Please don’t rename master. It would disrupt a lot of code. Transitions take time.

sujato · July 4, 2020, 12:19am

I think so. But let’s wait and see what Blake comes up with.

Master is going to be renamed, I’m afraid. This proposal was made in January, and will happen sooner rather than later; I am hoping next week. If this affects your code, I would encourage you to begin refactoring to published. If there is anything we can do to smooth the transition for you, let us know, but to be clear, all apps will be asked to pull data from published only.