Specifications for UIDs for the texts

Snowbird · June 17, 2023, 11:26pm

Specifications for UIDs (DRAFT!)

UIDs are all lowercase
UIDs will only have ASCII characters
UIDs may contain a dash (-)
- to indicate an existing range (not just ad-hock ranges)
- In the Vinaya in the text of the UID itself. e.g. pli-tv-bu-vb-pc1 to indicate the Pali, Theravada, monk’s, Vibhanga pācittiya.
UIDs may contain a single period (.) to indicate a chapter and sutta number.

Other quirks:

In a regular URL you can create a UID for an individual sutta within a range sutta (e.g. https://suttacentral.net/dhp1/en/sujato) and you can successfully get that sutta. However, dhp1 is not really a valid UID because using it in the API (e.g. https://suttacentral.net/api/bilarasuttas/dhp1/sujato?lang=en) will fail.
In a regular URL the UID is not case sensitive, however it’s not a good idea to do this.
The Vinaya UIDs are the only UIDs that contain language, school, etc.
In range suttas like the Dhammapada, the range is in human readable areas using an en-dash, e.g. “Dhp 1–20” not “Dhp 1-20”. Using the en-dash in a UID will break it.

Snowbird · June 17, 2023, 11:28pm

I’m wondering about edge cases for UIDs so I can make sure that my apps can handle them. In the OP I’ve added what I know so far to be true. It’s a wiki post so anyone can edit.

If you have time, I’m wondering if Bhante @Sujato, @HongDa, or anyone else in the know can offer corrections or additions. If you just want to reply I’ll add them to the OP.

sujato · June 19, 2023, 12:42am

You probably want to also read Blake’s definition of an MUID (you’ll find it on our Github.)

Yep.

In which case it would always be selected by \d-\d? I think?

Yes, but is this only Vinaya? I can’t recall, but I don’t think it was a specific design constraint. Anyway, this would be selected [a-z]-[a-z], I think.

Right.

Indeed, a questionable design choice, but forced by the massive parallelism of the Vinaya, which is quite unlike the suttas.

That’s correct, but it speaks to the more general distinction between the UID and the acronym.

The UID is the jeans-&-T-shirt version, the acronym is suit-&-tie. Acronyms are case sensitive, and they have spaces. The most important thing, which I have had to drill into developers’ heads again and again, is that the meaning of acronyms is context-dependent and therefore they cannot be automatically derived from UIDs. Instead we use a lookup table, otherwise we will get capitalization wrong.

So far you’ve addressed sutta UIDs; are you wanting to define segment IDs as well?

Snowbird · June 19, 2023, 1:03am

Hmm. Is this in documentation somewhere? I did a search and found it several places in the code. Is the MUID the part of the file name after the _? Until now I haven’t messed with those much, but it would be great to document them.

Sure. I haven’t spent much time working with them other than just as an array to build the suttas.

Is there already documentation on all this in the repository? I couldn’t find any.

sujato · June 19, 2023, 5:28am

We break the file into what I call the “uid” (dn1) and “muid” (“root-pli-ms”), by convention in Sutta Central we use “uid” to refer to a particular sutta, in this case “dn1”, and in Bilara I came up with “muid” (meta-uids) which refers to a collection, for example “root-pli-ms” being the pali root texts and “translation-en-sujato” being english translations by sujato.

github.com/suttacentral/bilara

Bilara Server and Models

opened 02:55PM - 29 May 23 UTC

blake-sc

stx major change back end

# An overview ## Git First. Bilara stores its data in a git repository and the… git repo is the ultimate source of truth. This contrasts strongly with most translation software which use a database for data. This means that in Bilara "import" and "export" don't really exist, it's enough to just update the git repository on Github and that is expected to be promptly reflected in Bilara. Of course an index to the data is still required for performance reasons. Bilara also uses Github for login, though in developer mode login can be masqueraded. ## The way the current code works Presently when Bilara starts up some python code examines the file system and creates some in memory model (basically a big mess of dicts - in fact lets just call it the BMOD) that is then used for the API. The BMOD gets updated as translation happens. The actual translation strings, always reside on the file system, except of course they are loaded into a database (ArangoDB) for the purpose of search. This architecture would likely remain the same, that is to say the text data itself wouldn't be in the database except to facilitate search, but all the greater structure: projects, users, translation progress etc would be tracked in the database. The BMOD approach actually worked really well in the first iteration of Bilara and reduced the need to keep things synchronized, but as scope expanded it became a big mess, to the point where maintaining synchronization with the database would likely be much less work than maintaining the BMOD code. Also bilara currently uses _[config].json files which is a bit of a mess. The proposition then, is that Bilara would basically work like this: The root and translation texts live in a git repository. An index to the data is stored in the database. The database also manages projects, sessions, users, and some transient data like proofreader suggestions. The database is used as the source for view generation, except that the actual files would be read for populating source and translation strings. # Models: ## Project Conceptially a Project defines a "source dir" and a "destination dir", these paths can be to a file or folder within the git repo, if pointing to a folder any files within that folder or subfolders are the texts to be translated. The translation files are created as the translation takes place: in Bilara creating a project should be nearly zero cost regardless of how much text is encompassed - even thousands of files and hundreds of thousands of strings -, comparable to how creating a branch in a git repo is nearly instant, this can be contrasted with Pootle which could easily take quarter of an hour to add a language to a large project as it would have to generate many translation units, but for Bilara creation of a project mainly just involves a quick scan of the folder to see what files are encompassed. Because projects are quick to create, they can also be "re-created" quickly such as if new root files are added. As a note: by convention we tend to use "root_path" and "translation_path", truth be told though they don't need to be root or translation. Bilara should support for instance translation to translation for a translation of a translation, or root to root for segmentation of a new root edition, and in the general concept of standoff data the "destination" could in principle be any kind of standoff data such as hypothetically translating HTML to Markdown, in that sense it might be better to use source_path and dest_path, with the understanding that it is usually the root path and translation path. A project also defines a set of translators and proofreaders. ``` Model Project: id string: "translation-en-bob" source_path string: "root/pli/ms/sutta" dest_path string: "translation/en/bob/sutta" translators []string: ["bob"] proofreaders []string: [] files []string: [...] ``` A project contains a list of the files encompassed by the project, importantly these files may or may not exist in the git repo, Bilara does not create empty files in the git repo, it creates them lazily as translation takes place. However even non-existent (empty) translation files need to be shown in views, such as selecting them to be translated. Translators and proofreaders should be a list of usernames or other user identifiers. A project should also support access for anyone, and perhaps even access for users who aren't logged in, this perhaps could be supported by a special username like "*". ## User: A User is derived from GitHub auth. ``` Model User: login string: "bob" name string: "Bob M." email string: bob@example.com ``` At the moment we use the GitHub username. Though users can change their GitHub username, so perhaps there is a more permanent ID from the GitHub API that can be used. ### Permissions Users have permissions presently defined by the projects. We may want other permissions independently of projects, such as global admin status. ## Strings File: A JSON file containing key:value pairs, where the key is a segment ID and the value is a string. Generally speaking standoff data has to be combined with other data to make total sense. We (conceptually) don't want to store the entire contents of the file in the database but just some information about it. ``` Model StringsFile: path string: "root/pli/ms/sutta/dn/dn1_root-pli-ms.json" uid string: "dn1" muid string: "root-pli-ms" (or perhaps []string: ["root", "pli", "ms"] whatever is more convenient for indexing) segment_ids []string: [] unpublished_hash string: published_hash string: ``` An array (logically a sorted set) of the segment_ids contained within the file are stored in the DB for ease of analyzing completion. It also stores a hash of the file in the unpublished branch and published branch or empty string. Comparing these two identifies whether the file is unpublished, modified or fully published and the hashes can also be used to help indicate whether the the database and git are synchornized. As a note, the published branch does not share history with the unpublished branch, basically when a file is published it's just copied bitwise to the published branch, this means the published branch has a record of the publication updates to the file. Note that the database entry may be required for files that don't exist yet because translation has not begun, this would depend on the particulars of the view generation, but we want the lazy translation file creation to be totally transparent, like you can select or search for a file for translation even when it hasn't been created in the file system yet. ### Completion The completion of a translation is basically defined as follows: The root text contains a set of segment ids. The translation contains a set of segment ids. The union of the two sets is compared with the root text segment ids to determine the completion. It is inadequate to simply count the number of segment ids that exist in the respective texts. However due to the repetitive nature of SuttaCentral's source material it is permissible for a translation string to be left empty and still be considered complete, this is represented as an empty string. So that is to say, only non-existence counts against completion. ### Implicitly Related Files A project explicitly defines a relationship between source (root) and translation (destination), however there can also be implicit relationships. For example say there's a root file "root/pli/ms/sutta/dn/dn1-root-pli-ms.json" And there's a translation "translation/en/sujato/sutta/dn/dn1_translation-en-sujato.json" Then there's another translation "translation/de/sabbamitta/sutta/dn/dn1_translation-de-sabbamitta.json" Sabbamitta, as an individual translator, is free to add a column for "translation-en-sujato" so that she can refer to Sujato's translation instead of relying just on the pali. This relationship should not need to be explicitly defined in the project, it's the translator's free choice what columns they have, other than the mandatory source and dest. If the next translation file Sabbamitta opens, also has a "translation-en-sujato", it should be automatically included. How does this work? We break the file into what I call the "uid" (dn1) and "muid" ("root-pli-ms"), by convention in Sutta Central we use "uid" to refer to a particular sutta, in this case "dn1", and in Bilara I came up with "muid" (meta-uids) which refers to a collection, for example "root-pli-ms" being the pali root texts and "translation-en-sujato" being english translations by sujato. Therefore we want to index the uid and muids of a text so we can identify the implicit relationships. When the translation view is generated, the list of all muids associated with that uid should be included in the "column picker" interface. Note that the muid sub-components are also important for identifying langauge and type and stuff, so they should be indexed separately, maybe, it's actually more important for search filtering, bilara mostly treats the muid part as a blob. ## Segment Attached Data (Naming things is hard) For advanced features, we sometimes need data to be attached to particular segments, while this is what Standoff data basically is, the SAD can be considered to be generally very sparse and transient, it only exists until its purpose has been fulfilled. Examples of SAD: * Proofreader suggestions. * Changes to root string have possibly invalidated translation, requiring flagging the translations. * Translator "notes to self" separate from translator comments ``` Model SegmentAttachedData: file_path string: segment_id string: creator string: ctime date: type string: value string: ? (or perhaps payload JSON object) ``` Whenever a translation file is opened or possibly when a segment is navigated to, any associated SAD should be attached to it. Note: SAD does not exist in the current Bilara code base, so exact requirements are a bit hazy. It might be desirable to have project and user fields on it. For example one view we want to have would show all proofreader suggestions applicable to a project/user, while it would be possible to use JOINs with the project it might be better to have some denormalization. ## Translation Suggestions/Matches Basically "Translation Memory" though it's a bit of a misnomer. The idea in Bilara, like other translation engines, is that translation memory is a separate service that can be given a string and it returns similar translation units. Presently in bilara it's essentially a fuzzy text search finding similar strings, but we would like to have machine translation options too, that would offer up machine translation that could be used by the translator. Interface details should be worked out, but the translation matches service basically takes a source language, source string and destination language. Finding matches is relatively expensive and is done on the fly as the translator selects the next string, with a little pre-fecthing to keep things responsive on slow connections. Currently in Bilara both full text search and translation matches are powered by ArangoDB ArangoSearch.

BTW, I’m loving Github’s new search, it’s really well done!