Desmond Schmidt proposes a mechanism to overcome the shortcomings of XML-based markup, worth take a look.
Towards an Interoperable Digital Scholarly Edition
Standoff properties as an alternative to XML
for digital historical editions
Desmond Schmidt proposes a mechanism to overcome the shortcomings of XML-based markup, worth take a look.
Towards an Interoperable Digital Scholarly Edition
Standoff properties as an alternative to XML
for digital historical editions
Very interesting and useful, thanks. Just to note, the PDF file isnāt working. Can you upload it here?
Iāve looked at TEI to some degree, and, in my own uninformed way, was groping towards some of the same conclusions in the article. But itās good to clarify it. @Blake and @Vimala, could you have a look at this when you get the chance?
On SC we are moving in some respects in this direction right now. For example, currently we have inherited cross-references, which are embedded as metadata in the Pali text. These have been extracted by @Vimala and will be encoded as JSON. One of the aims of this is to ensure that all our parallel data is maintained in a single consistent form, which will benefit both us and anyone who wants to use our data.
In addition, our text is being translated via PO files, which extract the HTML markup and set it to one side, as it were. The key feature going forward will be the segments of the text, each of which has a unique ID in the system. We can subsequently enrich or enhance the text based on these segments. The PO text will be exported both back to HTML and to TEX for printing.
Thanks so much. Iāve read through the paper, and frankly itās a godsend. It seems to solve at a stroke so many of the problems that we face: annotations, notes and variants, text/image mapping, rich markup, and most of all, interoperability.
The idea of defining relative positions is genius.
May I amend my previous request for @blake and @vimala to look at this when you have some time. Please read this immediately! Iād like to discuss adopting this approach right away, and making use of it for our ongoing revision. It seems to address so many of the issues weāve been having, and solve them in a comprehensive way.
And no, itās not just because itās an Australian invention!
Iāve read through the linked articles and I can readily see the merits.
The question is how extreme do you want to go? We already try to define things in terms of ids instead of inline markup tainting the text (a good example is how I think that embedded parallels are evil and need to be purged from the system in favor or āstandoffā definitions).
The idea where you have the texts as plain texts, and then have standoff markup, well, itās a cool idea and computers are definitely fast enough these days. My concern is extremely immature tool support. There is a reference to STIL and a program called STRIPPER - more info can be found in http://ecdosis.net and the tools in the github repositories https://github.com/ecdosis
From what I can tell weād be on the bleeding edge of using such technique, the point of most bleeding would likely be in editing documents, as you clearly need some kind of friendly editing utility. Just converting between formats is probably not going to be difficult.
Exactly my thoughts. Iāve briefly glanced at the Github, although ecdosis.net is down for me.
Probably the way to go is to start implementing some things that we are working on already, and see how we go.
Ultimately Iām thinking we should go the full monty.
We can probably get in touch with the developers; as academics they may well be interested to assist a project like ours.
Do you? What about using Git to keep track of versions? The STIL data could be updated via Git rather than directly from an editor. Obviously as JSON it could be hand edited as well.
Now I am working on a standoff markup editor based on Codemirror and React.js , will let you know when a working prototype is ready.
You need a way of seeing offsets in the text, and ideally you want an easy way to assign some markup to a particular range in the text. Sure, JSON is hand-editable but this use case isnāt particularly friendly to hand-editing.
Thanks so much. Are you in contact with the people at Ecdosis?
Okay I sat down and read the docs and carefully examined the example STIL data. The biggest surprise is that the algorithm appears to work on (utf8) byte offsets not unicode codepoint offsets.
The difference is this, lets take the sentence:
evaį¹ me sutaį¹
How long is that sentence in characters?
If youāre using unicode codepoints, that sentence has a length of 13 characters, on the other hand if youāre using UTF-8 it has a length of 17 characters, because the character į¹
takes 3 bytes to encode in UTF-8. The first word is 4 unicode codepoints, but 7 UTF-8 bytes.
Why would it be implemented in terms of UTF-8? Well thatās because of the whole nest of snakes which is unicode support in varying languages. Lets take ānirodhaā in Brahmi:
š¦šŗššš„
How long is it? To a human it looks like 3 characters, in Unicode it is 5 codepoints because i
and o
are combining marks. But in different languages a string will give different lengths:
Python (unicode string): 5
Javascript (unicode string): 10
UTF-8 (byte string): 20
Python is pretty unique in that it says it is 5 unicode codepoints long altough this is only true of wide builds of python and python >= 3.3. Javascript and many other languages (including ānarrow buildsā of python < 3.3) will say it is 10 characters long. This is because Brahmi is outside the BMP unicode range and in a language like Javascript which uses UTF-16 internally unicode codepoints within the BMP range use 1 character, and codepoints outside the BMP range use 2 characters (so called surrogate pairs).
For much more on the fun which is different handling of unicode https://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/
This means that a standoff markup defined in terms of character offsets is not as straightforward or universal as one might hope and very easy to get wrong thanks to the great many programming languages which get unicode almost, but not quite, right. Using UTF-8 makes a lot of sense from a systems programming perspective as UTF-8 is utterly consistent across all platforms. On the other hand, itās still a binary encoding and defining offsets in terms of bytes looks archaic, (a few) modern programming languages have graceful and correct unicode handling in terms of codepoints. Using byte offsets kind of makes sense now and really makes sense 10 years ago but I donāt think itāll make sense in 30 years time, UTF-8 will surely stick around for a long time but programming languages will move towards text processing in unicode rather than bytes.
I feel quite strongly that offsets should be defined in terms of unicode codepoints and not UTF-8 byte offsets. Also, this post serves to emphasize that doing this right is tricky, the risks of making a a frail system are high.
I intend to implement the algorithm in javascript so it can be run in a browser which introduces all sorts of interesting possibilities, the way of using a java service seems way heavyweight and the basic algorthim is not too difficult to implement.
Agree, it is much easier to use Unicode codepoint offsets than byte offsets, e.g, charAt or charCodeAt in Javascript .
in Java a surrogate pair needs 6 bytes in utf8. ( 3 bytes for high surrogate and low surrogate), while most system only take 4 bytes.
Huh. Well, learn something new.
If i understand the linked article correctly, python latest is one of those languages, so thatās good.
Yes, this is definitely something that needs considering. The few examples of standoff that Iāve seen are small, and limited to just one text in one language. Whether it can scale to what we need is TBD, I guess.
Indeed. I saw ājavaā and ādrupalā and was like, āway to kill the mood, thanks guys.ā
We should stay in touch with Yap, heās also working on this in javascript.
Bonus! Double points for using Brahmi as an example!
Heh. The reason I used Brahmi as an example is because at present we use absolutely nothing from outside the BMP and so poor Unicode handling is presently a non-issue, a naive implementation using a programming languageās native unicode strings would produce consistent results on any python or javascript or java or numerous other languages. BUT, it would then produce inconsistent results with Brahmi, which as an obscure historical script is outside the BMP.
So weāve got cuneiform covered. Excellent!
But somewhat more usefully, there is a non-trivial chance that at some point weāll want to handle obscure CJK ideographs, which are being assigned to the Supplementary Ideographic Plane. As time goes on, itās likely that more precise handling of these will be implemented by the Chinese text projects and we will want to take advantage of this.
Hi. Iām working on the Kangyur, the Tibetan Tripitaka and Iām trying to represent various editions in standoff markup. Did you guys figured out a way to implement this? Thanks for your great work!
Hi Trinley, this sounds awesome. I am very interested in standoff, but we are focusing on other areas of development at the moment. However, Brother @yap is working on standoff tools for the Chinese texts, so maybe he could give us an update?
It would be amazing to have the Pali, Chinese, and Tibetan texts all using standoff properties!
As for what we are doing, while we have not implemented standoff properties, we are moving in that direction to some degree. For example, the texts I am working on for translation have the HTML extracted, and will be reinserted later on.
Can you share some more details about your project with us?
Hi all,
I came across this topic thread when looking into standoff property editors. I have developed an open source JavaScript-based real-time standoff property editor which allows the user to modify the text without breaking existing annotation character indexes. It uses a different method from Desmond Schmidtās. At present it isnāt built around Unicode, although Iām working with a contributor to add Arabic script rendering support.
If anyone is interested, Iāve published the editor as a standalone module on GitHub.
This module is integrated into my other project, called āThe Codexā, which is a kind of text-as-a-graph system that interleaves texts with graph entities via standoff property nodes. As I am only an English speaker, I didnāt build it with Unicode in mind, but Iād be interested in exploring the possibility of extending it to handle Unicode. Iām not sure how possible this is as text ranges are referenced via character indexes, but perhaps a some kind of byte length mapping could be stored to translate between indexes and code points? This is new to me, so I donāt really know what Iām talking about here. But the standoff property editor works well for my purposes in Latin-based languages, so far, with a complete edition of Michelangeloās letters entered in English, and a large portion of the Diary of Luca Landucci (a Florentine contemporary of Michelangelo).
Happy to correspond further if anyone finds this helpful.
Best regards,
Iian
Thanks so much for posting this @argimenes. This looks great!
@sujato and @blake - can you have a look at this?
I noticed you use bootstrap and jquery. Our current setup for SuttaCentral uses neither. We are in the process of moving to lit-elements which basically uses vanilla js so implementing it in our current structure would not be straightforward and need some adaptations.
Hi Vimala,
Thanks for your feedback, glad that you find this approach interesting! I should just clarify that the editor module itself is written in vanilla JavaScript. Bootstrap, jQuery and KnockoutJS are only used by the code that makes use of the editor, and only to capture click events, fetch standoff property JSON files, etc.
So you could use the editor module itself in any web project without being tied to other dependencies.
Best regards,
Iian
Hi Iian,
Wow, this is looking great. Thanks so much for sharing this here. I think the potential for standoff properties is so powerful!
Currently on SC we are moving towards standoff markup as opposed to standoff properties. We divide the texts into semantically meaningful segments, then key text, translation, notes, variants, references, markup, and so on off that.
You can see the data here:
This is all a work in progress, but the basic idea should be clear enough. Currently @blake is working on an online editor (Bilara) for working with this data; it is essentially a thin client layered on top of Github APIs.
With this system we lose the granularity of standoff properties, but gain a lot of ease of use and tooling. For our purposes so far it has been good enough. Ultimately weād still like to employ standoff properties, once the tooling is there.
We deal with large numbers of texts, so we donāt have incredibly detailed text-critical apparatus for specific passages, such as one might find in a smaller and more densely-studied corpus.
However, for cases where we do need more granularity than the standoff markup, we employ a markdown-like system for annotating within the segment. This is mainly for the ease of typists, and would be converted to HTML etc. for publishing.
Like I said, work in progress, but this is an area where standoff properties would be great.