Desmond Schmidt proposes a mechanism to overcome the shortcomings of XML-based markup, worth take a look.
Very interesting and useful, thanks. Just to note, the PDF file isn’t working. Can you upload it here?
I’ve looked at TEI to some degree, and, in my own uninformed way, was groping towards some of the same conclusions in the article. But it’s good to clarify it. @Blake and @Vimala, could you have a look at this when you get the chance?
On SC we are moving in some respects in this direction right now. For example, currently we have inherited cross-references, which are embedded as metadata in the Pali text. These have been extracted by @Vimala and will be encoded as JSON. One of the aims of this is to ensure that all our parallel data is maintained in a single consistent form, which will benefit both us and anyone who wants to use our data.
In addition, our text is being translated via PO files, which extract the HTML markup and set it to one side, as it were. The key feature going forward will be the segments of the text, each of which has a unique ID in the system. We can subsequently enrich or enhance the text based on these segments. The PO text will be exported both back to HTML and to TEX for printing.
Thanks so much. I’ve read through the paper, and frankly it’s a godsend. It seems to solve at a stroke so many of the problems that we face: annotations, notes and variants, text/image mapping, rich markup, and most of all, interoperability.
The idea of defining relative positions is genius.
May I amend my previous request for @blake and @vimala to look at this when you have some time. Please read this immediately! I’d like to discuss adopting this approach right away, and making use of it for our ongoing revision. It seems to address so many of the issues we’ve been having, and solve them in a comprehensive way.
And no, it’s not just because it’s an Australian invention!
I’ve read through the linked articles and I can readily see the merits.
The question is how extreme do you want to go? We already try to define things in terms of ids instead of inline markup tainting the text (a good example is how I think that embedded parallels are evil and need to be purged from the system in favor or “standoff” definitions).
The idea where you have the texts as plain texts, and then have standoff markup, well, it’s a cool idea and computers are definitely fast enough these days. My concern is extremely immature tool support. There is a reference to STIL and a program called STRIPPER - more info can be found in http://ecdosis.net and the tools in the github repositories https://github.com/ecdosis
From what I can tell we’d be on the bleeding edge of using such technique, the point of most bleeding would likely be in editing documents, as you clearly need some kind of friendly editing utility. Just converting between formats is probably not going to be difficult.
Exactly my thoughts. I’ve briefly glanced at the Github, although ecdosis.net is down for me.
Probably the way to go is to start implementing some things that we are working on already, and see how we go.
Ultimately I’m thinking we should go the full monty.
We can probably get in touch with the developers; as academics they may well be interested to assist a project like ours.
Do you? What about using Git to keep track of versions? The STIL data could be updated via Git rather than directly from an editor. Obviously as JSON it could be hand edited as well.
Now I am working on a standoff markup editor based on Codemirror and React.js , will let you know when a working prototype is ready.
You need a way of seeing offsets in the text, and ideally you want an easy way to assign some markup to a particular range in the text. Sure, JSON is hand-editable but this use case isn’t particularly friendly to hand-editing.
Thanks so much. Are you in contact with the people at Ecdosis?
Okay I sat down and read the docs and carefully examined the example STIL data. The biggest surprise is that the algorithm appears to work on (utf8) byte offsets not unicode codepoint offsets.
The difference is this, lets take the sentence:
evaṃ me sutaṃ
How long is that sentence in characters?
If you’re using unicode codepoints, that sentence has a length of 13 characters, on the other hand if you’re using UTF-8 it has a length of 17 characters, because the character
ṃ takes 3 bytes to encode in UTF-8. The first word is 4 unicode codepoints, but 7 UTF-8 bytes.
Why would it be implemented in terms of UTF-8? Well that’s because of the whole nest of snakes which is unicode support in varying languages. Lets take ‘nirodha’ in Brahmi:
How long is it? To a human it looks like 3 characters, in Unicode it is 5 codepoints because
o are combining marks. But in different languages a string will give different lengths:
Python (unicode string): 5
UTF-8 (byte string): 20
For much more on the fun which is different handling of unicode https://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/
This means that a standoff markup defined in terms of character offsets is not as straightforward or universal as one might hope and very easy to get wrong thanks to the great many programming languages which get unicode almost, but not quite, right. Using UTF-8 makes a lot of sense from a systems programming perspective as UTF-8 is utterly consistent across all platforms. On the other hand, it’s still a binary encoding and defining offsets in terms of bytes looks archaic, (a few) modern programming languages have graceful and correct unicode handling in terms of codepoints. Using byte offsets kind of makes sense now and really makes sense 10 years ago but I don’t think it’ll make sense in 30 years time, UTF-8 will surely stick around for a long time but programming languages will move towards text processing in unicode rather than bytes.
I feel quite strongly that offsets should be defined in terms of unicode codepoints and not UTF-8 byte offsets. Also, this post serves to emphasize that doing this right is tricky, the risks of making a a frail system are high.
in Java a surrogate pair needs 6 bytes in utf8. ( 3 bytes for high surrogate and low surrogate), while most system only take 4 bytes.
Huh. Well, learn something new.
If i understand the linked article correctly, python latest is one of those languages, so that’s good.
Yes, this is definitely something that needs considering. The few examples of standoff that I’ve seen are small, and limited to just one text in one language. Whether it can scale to what we need is TBD, I guess.
Indeed. I saw “java” and “drupal” and was like, “way to kill the mood, thanks guys.”
Bonus! Double points for using Brahmi as an example!
Problems with marking up references for CBETA texts
So we’ve got cuneiform covered. Excellent!
But somewhat more usefully, there is a non-trivial chance that at some point we’ll want to handle obscure CJK ideographs, which are being assigned to the Supplementary Ideographic Plane. As time goes on, it’s likely that more precise handling of these will be implemented by the Chinese text projects and we will want to take advantage of this.
Hi. I’m working on the Kangyur, the Tibetan Tripitaka and I’m trying to represent various editions in standoff markup. Did you guys figured out a way to implement this? Thanks for your great work!
Hi Trinley, this sounds awesome. I am very interested in standoff, but we are focusing on other areas of development at the moment. However, Brother @yap is working on standoff tools for the Chinese texts, so maybe he could give us an update?
It would be amazing to have the Pali, Chinese, and Tibetan texts all using standoff properties!
As for what we are doing, while we have not implemented standoff properties, we are moving in that direction to some degree. For example, the texts I am working on for translation have the HTML extracted, and will be reinserted later on.
Can you share some more details about your project with us?
If anyone is interested, I’ve published the editor as a standalone module on GitHub.
This module is integrated into my other project, called ‘The Codex’, which is a kind of text-as-a-graph system that interleaves texts with graph entities via standoff property nodes. As I am only an English speaker, I didn’t build it with Unicode in mind, but I’d be interested in exploring the possibility of extending it to handle Unicode. I’m not sure how possible this is as text ranges are referenced via character indexes, but perhaps a some kind of byte length mapping could be stored to translate between indexes and code points? This is new to me, so I don’t really know what I’m talking about here. But the standoff property editor works well for my purposes in Latin-based languages, so far, with a complete edition of Michelangelo’s letters entered in English, and a large portion of the Diary of Luca Landucci (a Florentine contemporary of Michelangelo).
Happy to correspond further if anyone finds this helpful.
Thanks so much for posting this @argimenes. This looks great!
I noticed you use bootstrap and jquery. Our current setup for SuttaCentral uses neither. We are in the process of moving to lit-elements which basically uses vanilla js so implementing it in our current structure would not be straightforward and need some adaptations.
So you could use the editor module itself in any web project without being tied to other dependencies.
Wow, this is looking great. Thanks so much for sharing this here. I think the potential for standoff properties is so powerful!
Currently on SC we are moving towards standoff markup as opposed to standoff properties. We divide the texts into semantically meaningful segments, then key text, translation, notes, variants, references, markup, and so on off that.
You can see the data here:
This is all a work in progress, but the basic idea should be clear enough. Currently @blake is working on an online editor (Bilara) for working with this data; it is essentially a thin client layered on top of Github APIs.
With this system we lose the granularity of standoff properties, but gain a lot of ease of use and tooling. For our purposes so far it has been good enough. Ultimately we’d still like to employ standoff properties, once the tooling is there.
We deal with large numbers of texts, so we don’t have incredibly detailed text-critical apparatus for specific passages, such as one might find in a smaller and more densely-studied corpus.
However, for cases where we do need more granularity than the standoff markup, we employ a markdown-like system for annotating within the segment. This is mainly for the ease of typists, and would be converted to HTML etc. for publishing.
Like I said, work in progress, but this is an area where standoff properties would be great.