SuttaCentral

Bilara-data folder structure proposal


#1

As we move forward to adopt bilara-data as the core data structure for SuttaCentral, we’ve come across some inconsistencies and opportunities that should be discussed, especially with regard for supporting multiple root sources as well as multi-lingual translation support for those varied sources. To understand the challenges, let’s take a look at where the Anguttara Nikaya (AN) files are stored today in bilara-data.

Today, bilara-data is structured somewhat inconsistently as follows:

  • comment/an/…
  • html/an/…
  • reference/an/…
  • root/pli/ms/an/…
  • translation/en/sujato/an/…
  • variant/pli/ms/an/…

Automation thrives on consistency. That valuable consistency in part prompts a proposal for a new structure for bilara-data. Additionally, engagement with different root sources also presents additional motive for a restructuring:

  • comment/pli/ms/sutta/an/…
  • html/pli/ms/sutta/an/…
  • reference/pli/ms/sutta/an/…
  • root/pli/ms/sutta/an…
  • translation/en/sujato/sutta/an/…
  • variant/pli/ms/sutta/an/…

Specifically, two changes are proposed for consistency and breadth:

  1. introduce the use of language and authoring-entity folders throughout
  2. introduce folders for sutta, vinaya, abhidhamma

ATTN: @blake, @sujato, @brahmali, @sabbamitta, @michaelh, @Aminah, @hongda + any others who should be included

Please share your thoughts on the implications of this proposal with considerations of value and cost. This is a wiki post, so feel free to edit our emerging shared consensus here.

:pray:


#2

Some things to bear in mind here, as per our discussion:

  • some kinds of things are intrinsically bound to an edition, others are not.
  • In the first case is, obviously, translations, variants, comments, and root.
  • in the second, references and HTML (usually,)
  • applications can nevertheless consume data as they wish. For example an Italian translation may want to display English comments (because they haven’t been translated)
  • HTML should by default be applied universally, but bear in mind it MAY be useful to supply different sets of HTML(for example, a language may have different conventions for paragraphing dialogue.) But this not something to implement as yet.

#3

Here’s a proposal for the rollout of the new structure:

  1. Karl adds new folders and complete new imports (e.g., abhidhamma)
  2. Karl copies (not move) existing content into suttas and vinaya
  3. @Blake and Karl change all code to use new folder structure (this will take a while)
  4. Karl updates new folder structure with any remaining updated content from old folder structure
  5. Karl removes old folder structure when software is done

Ajahn Brahmali and Anagarika @Sabbamitta will be affected by steps 2&3 since there is somewhat of an ambiguity of where their edited content goes given that there are two copies of any sutta. We can probably manage that separately according to codebase. Ajahn Brahmali can coordinate with Bhante Sujato and Blake. I’ll work with Anagarika Sabbamitta. Aminah can update content in new or old folder structure as required. Step 4 assures that all changes will be migrated into new folder structure.

This rollout will not be instantaneous. Adding the new folder structure first and delaying removal of the old folder structure gives us some flexibility with timing and execution at the cost of some ambiguity due to the existence of multiple copies.


#4

Thanks for taking good care of us translators! :grin: :pray:


#5

We are currently executing the first step of the rollout plan in the import-pali branch of bilara-data. Notice that this branch currently has a mixture of old and new folders. In the html folder, the new structure is here.

Once import-pali is completed, we’ll merge that branch into master and proceed to step 2.


#6

Hey Karl, thanks so much for your continued incredible work!

With respect to this,

I have to confess to taking the encouragement to let go of things much more seriously these days, and in all honestly I’ve no clue about the work that’s been undertaken here! :grin:

More or less only serving as legacy text care taker these days, I’m delightfully privileged to not even try to understand all this fancy pants stuff so mostly don’t follow a conversation unless its obviously relevant to the legacy texts.

With respect to your proposal for Bilara-data what implications do you envisage for sc-data/html_text or anything else that I might need to take care of?


#7

No implications at all, which is a relief. We are only changing bilara-data. Thanks for the update!


#8

Status update 2/7: we’re slightly over 2/3 done with the import of Bhante’s Pali documents into bilara-data. That’s almost 10,000 new files. :open_mouth:
The good news is that there are at most 5,000 more new files to import. :grimacing:

Once the import is completed, I’ll start copying the existing files into the new folder structure. This will allow us to update all software dependencies. After @blake and I update all software dependencies, I’ll remove the files in the old bilara-data folder structure. That’s about 24,000 files.


#9

Status update 2/12: Bhante and I are wrapping up our work on the “import-pali” branch. I will merge that branch into master when we are done.

Today I created the sutta-vinaya-abhidhamma branch, which will have all current work moved to the new folders and can be used to test any software changes. Please review this branch. Even now as I look at it, questions are emerging:

  • Should blurbs really be in root/en/blurb? The blurb structure and file names don’t quite match the new folder structure. :see_no_evil: Bhante @sujato?

#10

Okay great, checking out sutta-vinaya-abhidhamma now. I’m not sure exactky what I should be seeing.

  • The pli-tv folder is still missnamed, if we are going to do this it must be vinaya.
  • In root everything seems in place.
  • In translation there are bunches of empty files.
  • In reference and html not everything where it should be: the old files are still there, and pli-vi is outside pli.
  • variant has similar problems.
  • comment lack sutta

I think it’s okay? The blurbs do not, in fact, match the structure of the suttas. Each sutta, for example, is a separate file. But the blurbs are all in one file per collection or whatever. And there are blurbs for things that are not suttas, eg. nikayas or whatever. So I think it can stay as-is.

At some point we will also import the static pages to Bilara (Home page, essays, and the like). But we don’t need to worry about that right now!


#11

Thanks, Bhante. I have updated the sutta-vinaya-abhidhamma branch of bilara-data and completed the proposed folder restructure.

I also added a new script (.script/rm-blank-translations.js) that removes blank translations. The script removed 3000+ files.


#12

I spoke with Blake about this last night, he is happy to proceed, he thinks the impact on Bilara will not be great.

I’m checking the files now!


#13

Alright, so checking the folders only, not the file content.

  • Seem good: root, translation, reference, comment, variant
  • HTML: kv is duplicated in both sutta and abhidhamma (it belongs in abhidhamma)

I’m pushing a couple of minor changes to the metadata.


#14

Thank you, Bhante. This is a great relief.

I deleted sutta/kv since it was older.

I will now start updating Voice using the sutta-vinaya-abhidhamma branch.

NOTE: my head is exploding slightly and to simplify matters, I’ve merged in Anagarika @Sabbamitta’s German translations as well as @Kaz’s Japanese translations into the sutta-vinaya-abhidhamma branch. The Voice automated tests rely on this content. The merged content is in the new folder format.


#15

Okay, great. Now, I will check things more thoroughly, but we need some testing and so on for the next steps.

I don’t want to explode your head further, but on the 2-do, once all the Pali texts are present and accounted for:

  • run text-content diff of bilara pali texts vs. original mahasangiti files, to make sure we haven’t accidentally lost any texts. When to do this? It should definitely be done as a final thing, but maybe do interim tests as well?
  • Redo segment numbering of previously-segmented texts to ensure consistency across the whole corpus. This will mainly affect things like headings.
  • Merge in reference data from all_concordance.

Shall we have a skype to chat about these?


#16

We should have issues to track each of these until I can tackle them. Right now Voice is on the operating table and I’m performing a bit of brain surgery, moving slowly and carefully. After that, I’ll need to sit down with Anagarika Sabbamitta to address our delayed release. We’ve delayed all our releases substantially to accommodate the bilara-data imports and restructuring. If we have one issue for each of the above then we can slowly integrate progress across all the issues. Each one of the issues will require a chat to understand what needs to be done in detail.

BTW. Just chatted with Anagarika Sabbamitta. We both recommend strongly that these three issues should be done AFTER the merge of sutta-vinaya-abhidhamma into master. The bilara-data folder structure divergence should be minimized and the merge should completed as soon as @Blake and I are ready. I plan on being done with Voice software changes next week. Blake, let us know if you can’t complete your own software changes by next week. There’s a fair amount of testing that will need to be done in staging as well beyond that software change.


#17

Update 2/16: scv-bilara v1.2.0 is now available. The scv-bilara library is the search engine used by Voice. The v1.2.0 version automatically detects bilara-data folder structure and is happy with both new and old folder structures. Starting tomorrow I’ll integrate the latest scv-bilara into Voice itself for testing in our staging environment next week. Once that testing is completed Voice will be ready for merging sutta-vinaya-abhidhamma branch into bilara-data master.

@Blake, please advise if you need any more time to complete your own transition for the sutta-vinaya-abhidhamma folder structure. We might be able to merge the new folder structure into bilara-data as early as next week.

@Sabbamitta, large earthquake merge approaching! Date is not fixed but we are discussing merge date. Please advise if you need more time to merge all your own translation branches into master.

@Kaz, this won’t affect you since you’re essentially working sutta by sutta offline.

@Sujato, the import-pali branch is actually the basis of sutta-vinaya-abhidhamma branch. Please make any changes to sutta-vinaya-abhidhamma branch instead of import-pali branch, which will eventually be deleted.


split this topic #18

2 posts were merged into an existing topic: There’s blank pages where there weren’t any before


#20

Thanks for the heads up! No, I don’t need extra time to merge translation branches. I am not working on more than one branch at a time right now, and the suttas in this section are all fairly short, so there is no problem.


#21

Update 2/17: Voice v2.1.21 is ready for staging. This release of Voice works with either bilara-data folder structure.

At this point I would like to propose that we merge sutta-vinaya-abhidhamma branch of bilara-data into master on Wednesday 2/19. Please advise if this merge would disrupt anything. The merge will change all existing content in root, translation and other folders. ATTN: @blake, @sujato, @brahmali, @sabbamitta