Should we use a dedicated glyph (null?) for segmenting texts?

sujato · August 16, 2016, 10:30am

On Pootle we segment texts based on major punctuation. This is very useful as an initial approximation, as it breaks the text into (mostly) semantic segments.

It’s far from perfect, though. The punctuation is not consistent, for example. I have been recording such instances, and intend, when my translation is done, to go back over the whole corpus and resegment. Obviously this is not urgent, but I thought I’d record my thoughts here.

This gives us an opportunity to create segments using any means we like. One important criterion is that the texts should be reversible; that is, we should be able to resegment them. Here’s some options.

Change the punctuation (in original text) to make it consistent.

Advantages: simple, improves the text. (I will be doing this to some extent regardless.)
Disadvantages: It won’t work every time. Particularly, it breaks on ellipsis, which sometimes should indicate segment, sometimes not. Also, punctuation might be corrected at some point, which would ruin everything!

Just do it by hand in the PO files.

Advantages: pure coding, no changes to text required.
Disadvantages: Loses reversibility.

Use some other markup. Add an HTML span or something to indicate segments.

Advantages: umm …
Disadvantages: makes the code crappy.

Use standoff

Advantages: super clean, nothing is in the text.
Disadvantages: it doesn’t exist.

Insert some dedicated glyph at segmentation points

Advantages: Simple, unambiguous, robust. We could use NULL character, which, if I am not mistaken, has a similar usage in C, etc.
Disadvantages: Might bug out? If the text is reused, people might not know they are there. Maybe use ␀?

I’m leaning to the last one. @blake, @vimala, any thoughts?

blake · August 16, 2016, 10:47am

Tangentially related, when I made the thig/thag po files with translations I started by making an intermediate segmented form like this:

<blockquote class="gatha">
<p><a class="sc" id="2"></a><a class="verse-num-sc" id="verse-num-sc1"></a><a class="verse-num-pts" id="verse-num-pts1"></a><a class="pts" id="pts123"></a><a class="ms" id="p_19Th2_2"></a><a class="msdiv" id="msdiv1"></a><span data-msgstr="Sleep happily Therī,">“Sukhaṃ supāhi therike,</span><br>
<span data-msgstr="Wrapped in the cloth you've made">Katvā coḷena pārutā;</span><br>
<span data-msgstr="Your desire is now at peace">Upasanto hi te rāgo,</span><br>
<span data-msgstr="Like dried vegetables in a pot">Sukkhaḍākaṃ va kumbhiyan”ti.</span></p>
</blockquote>

Advantages of 3:

Trivial for computers to process because you can simply use XML/HTML libraries
You can attach other data to segments

There is a significant amount of work involved in segmenting based on characters or HTML void elements like   because it can be done cleanly neither with an XML library nor regex/text processing. I just do it messily, but wrapping segments in an actual element is by far the cleanest approach for computers to deal with.

sujato · August 16, 2016, 10:57am

So you do it so that the HTML is transformed into the PO msgstr, rather than (as I was thinking) living as a span in the PO text. (yuk!) That would be fine; but then could you include extra data? Where does it actually live in the PO file? Unless, as you have it, it’s just the translated segment.

It still complicates the HTML, but maybe we could live with that.

I was thinking, though, that such spans (like our reference data) would be points (i.e. just ) rather than ranges, to avoid nesting problems.

blake · August 16, 2016, 1:22pm

If you want to do anything interesting with segments, you need to clearly and unambiguously identify them.

First thing is I think it’s essential that segments have an explicit id, although this makes it longer, it has robustness advantages - you can resegment part of a text without having a cascade of numbering changes throughout the rest of the text. It also means that when an XML processing library loads the text (i.e. a browser) it can immediately start working with the ids instead of needing to do an additional slow, error-prone segmenting step.

Second thing, is the choice of how to mark segments. Basically you can mark the start, or the end, or the boundary between two segments, or the start and end. If you use a span to wrap the segment then you have marked the start and end in a way which any XML loader understands.

Standoff

The principle of standoff is good. I suggest that as much meta-content as possible be included separately and referenced by id.

For example a file which adds PTS page numbers could look like this:

"pts-1st-ed": {
  "dn1:1.1": "1",
  "dn1:2.1": "2",
  "dn1:5.1": "3",
  ...
}

or perhaps if you want to be more explicit:

"pts-1st-ed": {
   "1": ["dn1:1.1"],
   "2": ["dn1:1.2", "dn1:1.2", "dn1:1.3", "dn1:1.4" ],
   "3": ["dn1:1.5", "dn1:1.6", ...]
  ...
}

In principle almost anything could be attached to segments in this way, such as variant notes, commentary or translation strings. Sub-segments could also potentially be defined such as dn1:1.1:3, meaning “the 3rd word of the 1st segment of the 1st paragraph of dn1”

So what I’m suggesting overall is to use HTML markup for the basic root content, and then standoff (based on ids) to attach extra non-essential content. This keeps the basic text in a form which web browsers (and other things which understand HTML) can easily work with, and the code for merging the root content with the standoff content can leverage functionality baked into HTML like getting elements by id.

Altough this will involve wrapping segments in spans with an id, because it would allow removing most other markup with id’s (such as paragraph numbers, notes etc) the code would not be significantly more ugly, and in many cases would be cleaner.

sujato · August 17, 2016, 1:06am

Great, that sounds perfect. I realize now that this is the way to go, for all kinds of reasons.

It also gibes with the standoff approach being developed by @yap, in that he anchors the standoff ranges to an explicit paragraph ID. We can do this for the segment IDs.

The outcome of this is that the punctuation-based approach we’ve used so far is just an initial approximation. Once I’ve gone through and fixed all (!) the quirks in the current approach, it’ll be IDs all the way down.

I’m still hoping for an all-standoff approach in the long term, but we’re getting there! There are some problems we can’t solve with segments: variant readings and multiple editions, overlapping hierarchies, and so on. But it is certainly a big step forward.

Having variant readings that cross over segments would be an issue, although, as it happens, I don’t think there are any. If we come across them we’ll have to handle it case by case, I guess.

blake · August 17, 2016, 9:38am

Actually all those things could be solved with segments, if we allow defining a word offset in a segment.

For example, maybe a variant note could be defined as applying to the 5th word of a certain segment. When rendering the variant note, the code could either attach it to the segment, or it could perform the more difficult task of identifying the 5th word and attaching it to the 5th word specifically. (this obviously has some of the downsides of segmenting based on characters, but it is dealing with much shorter and simpler units, and if there is a bug in how it segments by character that failure won’t cascade to the rest of the document, also a hypothetical alternative client using the data could decide to skip that difficult and error prone step and just work at a segment level and discard the word offsets)

An overlapping block could be defined as pertaining to multiple segments, or from a word offset in one segment, to a word offset in another segment, there are straightforward techniques for creating the markup required to represent that, for example Discourses markup can correctly figure out what is intended by: foo bar baz (foo bar baz)(edit: this works in the preview) , it basically turns it into something like foo bar baz, messy, but literally any level of overlap can be handled with enough slice dice and wrap.

Even an entire alternative edition could be defined as a set of operations which replace segments, remove segments or insert segments (between existing segments), this would obviously be sensible only for closely related editions.

I’m using this “could” word, maybe not everything we could do are should dos. But it is possible.

sujato · August 17, 2016, 10:01am

Surely character offset is the way to go? All the standoff systems do this. Anyway, in Chinese there is no option.

The real problem is the very idea of a variant note, which is an archaic, print-based notion. We should really keep different editions in layers, and a diff view.

In fact as far as current variants go, there’s little practical advantage to doing anything else, really. There might be a few edge cases, but almost always simply displaying a variant as applying to a segment would be fine. I mean, the idea that a variant is restricted to such and such a word is itself just an approximation; in fact it’s usually just a character or some characters. Anyone reading it can usually see at a glance what is meant. Here’s case i just worked on:

Seyyathāpi, bhikkhave, sāmuddikāya nāvāya vettabandhanabandhāya cha māsāni udake pariyādāya hemantikena thalaṃ ukkhittāya vātātapaparetāni bandhanāni tāni pāvussakena meghena abhippavuṭṭhāni appakasireneva paṭippassambhanti, pūtikāni bhavanti;

Here, and this is a typical case, the variants apply to just a few letters in the variant word. If we just attached the note to the segment, we’d lose a little precision, it’s true, but nothing serious. There would, of course, be other cases where it might make more difference, but they would, I think, be few.

Anyway, I guess it just depends on how hard the programming is. Maybe we could roll out a simple version with the notes attached to the segment, and see how it goes. If there’s any need, we could enhance it later.

So you keep the text clean, and the JSON clean, and handle the messiness in the programming. What about overlapping hierarchies as applied to multiple files? This is still a major problem, for example converting CBETA texts to SC. Or else managing radically different structures for the same text, as with the three versions of SA.

I mean, I’m not expecting to solve all these problems right away, just keeping the big picture in mind.

blake · August 17, 2016, 10:07am

[quote=“sujato, post:7, topic:3102”]
So you keep the text clean, and the JSON clean, and handle the messiness in the programming. What about overlapping hierarchies as applied to multiple files? This is still a major problem, for example converting CBETA texts to SC. [/quote]

I don’t see that being a major problem. If we get used to working with segments, a file can be defined as a range of segments.

As above, if we’re working at a segment level, we can consider a file to be a range of segments. Normally very linear, but possibly rearranged, with some removed or added.
Of course if versions are so different as to have little in the way of shared segments it would be best to treat them as separate texts rather than trying to shoehorn them all into one mould.

sujato · August 17, 2016, 10:35am

Sure, but the problem here is that the files are different. Sometimes many CEBTA texts are one SC text, sometimes (more often) the other way around. It’s not that the texts differ, but their organization. The way this happens is complex and varies in pretty much every case. This is along term problem, as it makes it pretty much impossible for us to keep our texts updated as the CBETA project advances.

To be clearer, what I’m thinking is something like this.

Imagine a perfect world, where daisies bloom all year round
, children laugh and play , and no-one uses Windows .

In this perfect world, there’s a complete separation between text and markup. Text files are purely plain text. Markup lives in JSON, and all standoff apps work perfectly and reliably.

Now, the very idea of a “file” that corresponds with a “text” is, of course, just another convention, and one that does not reflect our sources. Different sources, while containing much the same text, have very different divisions of “texts” and hence “files”. We just impose these ideas for our own convenience.

So instead of defining a basic “text” or “file” as “what is most useful for display and organization purposes in most cases”, which is what we do today, we define it as:

The smallest clearly defined and unambiguous range as agreed by all sources.

Now, in the core cases this would in effect be a nikaya. There’s plenty of variation in the different editions about where to divide the suttas (and even vaggas) in SN and AN, but everyone agrees on the boundaries of the nikayas.

Okay, so our basic entity is no longer a “sutta” but a “nikaya”. (In the Vinaya, this would have to be “Ubhatovibhanga” and “Khandhakas”, each taken as a whole.)

This is kept in one plain text file. If we have multiple Pali editions, each one is a separate file. There’s no such thing as “variant readings”. There’s just readings.

So when it comes time to present it, we slice and dice. If we want to read any one edition, we can, and variations in other editions can be viewed as diffs of that. The numbering of suttas is no longer coded into file names, but, like everything else, in JSON. If one edition presents a certain range of text as one sutta, while another has it as two, so be it. If one edition counts a peyyala series as a hundred texts, while another counts it as two hundred, so be it.

If corrections are made to the source texts, they are automatically updated. If a new edition is added to the mix, it’s matched with existing sources purely on intelligent text matching, and markup can be applied automatically.

And, it goes without saying, all these things apply to translations just as to the basic texts.

sujato · August 17, 2016, 10:44am

Sure, but the problem here is that the files are different. Sometimes many CEBTA texts are one SC text, sometimes (more often) the other way around. It’s not that the texts differ, but their organization. The way this happens is complex and varies in pretty much every case. This is along term problem, as it makes it pretty much impossible for us to keep our texts updated as the CBETA project advances.

To be clearer, what I’m thinking is something like this.

Imagine a perfect world, where daisies bloom all year round , children laugh and play , and no-one uses Windows .

In this perfect world, there’s a complete separation between text and markup. Text files are purely plain text. Markup lives in JSON, and all standoff apps work perfectly and reliably.

Now, the very idea of a “file” that corresponds with a “text” is, of course, just another convention, and one that does not reflect our sources. Different sources, while containing much the same text, have very different divisions of “texts” and hence “files”. We just impose these ideas for our own convenience.

So instead of defining a basic “text” or “file” as “what is most useful for display and organization purposes in most cases”, which is what we do today, we define it as:

The smallest clearly defined and unambiguous range as agreed by all sources.

Now, in the core cases this would in effect be a nikaya. There’s plenty of variation in the different editions about where to divide the suttas (and even vaggas) in SN and AN, but everyone agrees on the boundaries of the nikayas.

Okay, so our basic entity is no longer a “sutta” but a “nikaya”. (In the Vinaya, this would have to be “Ubhatovibhanga” and “Khandhakas”, each taken as a whole.)

This is kept in one plain text file. If we have multiple Pali editions, each one is a separate file. There’s no such thing as “variant readings”. There’s just readings.

So when it comes time to present it, we slice and dice. If we want to read any one edition, we can, and variations in other editions can be viewed as diffs of that. The numbering of suttas is no longer coded into file names, but, like everything else, in JSON. If one edition presents a certain range of text as one sutta, while another has it as two, so be it. If one edition counts a peyyala series as a hundred texts, while another counts it as two hundred, so be it.

If corrections are made to the source texts, they are automatically updated. If a new edition is added to the mix, it’s matched with existing sources purely on intelligent text matching, and markup can be applied automatically.

And, it goes without saying, all these things apply to translations just as to the basic texts.

blake · August 17, 2016, 10:54am

I think I see what you mean. So essentially, we would want a way to segment the CBETA texts, and use those segments to build the texts we display.

CBETA texts -> bunch of segments -> Our suttas

So we no longer store chinese texts at all, we just have code which fetches the CBETA texts, chops it up into segments, then reassembles those segments according to a skeletal structure defined in JSON.

sujato · August 17, 2016, 11:04am

Not just CBETA, any text at all. Basically there is a common store of texts in plain text, and different applications use it in their own way. CBETA assembles the text into a replica of the Taisho edition. We assemble it into entities corresponding to Pali suttas. And so on.

This is, if I understand it correctly, how @yap is thinking these days.

In this way, we could finally make good on the promise of digital texts, to progressively correct and enrich them, without having to start from scratch every time. Doing a new digital edition of a manuscript would be solely a matter of typing and proofreading.

As to the exact manner in which this happened, again this would be for each application to determine, according to what’s useful for them. For us, segmenting at a phrase level is useful, while for someone who simply wanted to, say, print texts for reading there would be no need to do this.

yap · August 17, 2016, 12:27pm

Since Taisho is addressable at character levels,
e.g:T25p100a1203 means taisho volume 25 page 100 column a 12th line 3th character.
this notation is used in CBETA XML as paragraph id.
we can define a text segment as a Taisho character range,
A range can be stored in a compact form if they share common prefix.
t25p100a1203~15 , to 15th character ( same page, same column, same line)
t25p100a1203~1301, to 13th line , 1st character ( same page, same column
t25p100a1203~c0105, to column c, 1st line, 5th character (same page)
t25p100a1203~101b0206 , to page 101, column b , 2rd line, 6th character (cross page)

In memory, a Taisho character pointer can be packed into a 30 bits number (vol 7 bits , page 11 bits, column 2 bits, line 5 bits, character 5 bits), if we don’t allow cross vol range (not likely) , a range can be packed in 53 bits ( Javascript allows exact 53 bits integer).

The Taisho pointer is very space efficient, each paragraph is only a 64 bits value, the fastest unit a modern computer can process, and a sutta is an array of number, if the text SC decide to use is slightly different from Taisho, we can apply a diff patch after we fetch the text by Taisho pointer, still very efficient.

Taisho pointer is friendly for both computer and human, it is durable and easy to locate and verify, we can have higher level of construct on top of it.

Vimala · August 17, 2016, 2:44pm

Just on a completely irrelevant note: the post accidentally appeared twice.