Text-critical information styling

sujato · November 10, 2017, 1:38am

In the old SC, we introduced a method of displaying and clarifying text-critical information, mainly by using CSS styling. This must be adapted for the new site. @vimala, @blake, thoughts and opinions please!

Depending how difficult it is we can decide whether this is a requirement for the new site.

Background

In text studies, there are a variety of notations developed for handling the complex kinds of issues encountered when dealing with old manuscripts. These include such things as unclear letters, variants in different manuscripts, changes made by an editor, and so on. In print editions these are typically indicated with conventions like [square brackets], <angles>, and so on. Not only are these unsightly, but they are not very informative.

In addition, there are cases where one wants to indicate different parts of a text that are not specified in HTML alone, such as “term” and “gloss”.

Starting from the late 1980’s, a loose consortium of academics has developed a specification called TEI (Text Encoding Initiative), which aims to handle such matters with markup. TEI is currently implemented as an XML specification, and many of our sister projects such as CBETA use it. However, TEI is complex, and we have not adopted it. These days, plain old HTML works just fine.

Still, where possible we adopt the naming and markup conventions of TEI.

The Job

Over time, our text-critical system has become somewhat over-complicated, and it’s not always implemented consistently. (This is, incidentally, a problem that has bedeviled TEI and prevented its wider adoption.) We should try to simplify it where possible and ensure it is correctly implemented on the new site. Because it is somewhat complex, we should not make it a requirement for the launch of the new site, but an enhancement.

Resources

Most of the relevant styles are in common.scss. This also gives explanations for the various classes.

Notes

I have not made a detailed study, but here are some corrections that are required.

There’s a confusion in the application of .add and .pe. .pe is used to indicate an abbreviation, and tell the reader where to read the full version. It is used in this way in en/ds. In en/vb and (I think) Vinaya trans., .add is used for the same thing. We should use pe for this, and reserve .add for text actually added.
Note the markup and CSS for choice/corr/sic. This applies a CSS popup in the old version. We should probably make this consistent with our “var” markup.
The markup for .lem and .rdg duplicates the normal .var. In fact .lem and .rdg are the standard terms in TEI for “main” reading and “variant” reading. We should replace all instances of lem/rdg with .var.

Styles

Here is a rough guide to the style transformation, mostly dealing with the colors. It also gives an example of a text where the markup is used.

class	old display	found in	new display
.term, .gloss	color: pastel-color(dirty-green)	san-lo-bi-vb-np11	accent-color
.supplied	color:#B3A72D	san-lo-bi-vb-np11	primary-color
.supplied2	color:#9B8D00	san-lo-bi-vb-np11	primary-color (different shade)
.suttainfo, .xu, .w	color: misc-color(dark-medium-gray); background-color: misc-color(off-white)	/da1/lzh	secondary-text-color;tertiary-background-color
.cr	color:#4805A2	en/vb8	default link style
.add	color:#767676;@include sans-serif;	en/vb8	secondary-text-color
various classes	#767676		secondary-text-color
.pe	font-style:italic; color:#9B8D00;	en/ds2.1.1	secondary-text-color, italic
.metre	color:#9B8D00	pra/pdhp	accent-color
.expanded	color: misc-color(light-medium-gray)	dn10/rhys-davids	secondary-text-color
.sic	color: pastel-color(bright-red);	san-lo-bi-vb-np11	accent-color
.del	color:#EE1700	gdhp	text-decoration:line-through
.t-gaiji	color:darkcyan	lzh-mu-bi-pm	accent-color
.var		/pli/kv17.1	secondary-accent-color
.scribe	italic	gdhp
.gap	color:#B2AF8C	g2dhp	secondary-text-color
.unclear	color:#767676	g2dhp	secondary-text-color
.del-scribe	none?	g3dhp	same as "del", but with title clarifying.

Definitions

Here are working definitions of these classes, taken where possible from TEI.

class	definition
.term	contains a single-word, multi-word, or symbolic designation which is regarded as a technical term
.gloss	identifies a phrase or word used to provide a gloss or definition for some other word or phrase
.supplied	signifies text supplied by the transcriber or editor for any reason; for example because the original cannot be read due to physical damage, or because of an obvious omission by the author or scribe.
.cr	(cross-reference phrase) contains a phrase, sentence, or icon referring the reader to some other location in this or another text The TEI term is actually "xr". Should we adopt this?
.add	contains letters, words, or phrases inserted in the source text by an author, scribe, or a previous annotator or corrector.
.pe	Used frequently in the Pali Abhidhamma to indicate a span of text describing how an abbreviation is meant to be filled in. Note that it shouldn't be used to indicate a mere abbreviation with ellipsis.
.metre	Represents the metrical form of a line of verse. We don't follow TEI usage
.expanded	Text that is abbreviated in original but expanded in translation. Used solely for old DN translations that have been filed in. Not TEI.
.sic	contains text reproduced although apparently incorrect or inaccurate
.del	contains a letter, word, or passage deleted, marked as deleted, or otherwise indicated as superfluous or spurious in the copy text by an author, scribe, or a previous annotator or corrector
.t-gaiji	Indicates Chinese characters not in the Unicode set as used in our CBETA texts.
.var	indicates a variant reading. Similar to TEI rdg, except that refers only to a single reading, whereas for us var is used a little more loosely.
.scribe	a remark by or about a translator or scribe that is found in the original text, usually recording their work or commenting on it in some way. (see discussion below)
.gap	indicates a point where material has been omitted in a transcription, whether for editorial reasons described in the TEI header, as part of sampling practice, or because the material is illegible, invisible, or inaudible
.unclear	contains a word, phrase, or passage which cannot be transcribed with certainty because it is illegible or inaudible in the source.
.del-scribe	the same as del, except deleted by scribe. However, this is included under the TEI definition for del.

Notes on usage

.scribe

Scribe is intended to note a remark by or about a translator or scribe that is found in the original text, usually recording their work or commenting on it in some way.

It is encountered occasionally in the Chinese texts. In Sanskrit (san-mg-bu-pm), for example, we find śākyabhikṣu śrīvijayabhadralikhitamidam “This was written by the Buddhist monk Śrīvijayabhadra”.

In san-mu-kd6, however, it is used incorrectly, eg.:

MS adds jānantaḥ pṛcchanti ajānanto na pṛcchanti |
Manuscript adds “knowing , he asks, not knowing he does not ask”.

What’s happening here is a well-known embarrassment in later Buddhism, where they felt the need to explain away the evident fact that the Buddha had to ask questions about things, whereas we all know he’s really omniscient and never really needs to ask about anything. The Pali commentaries make similar remarks, but here the remark has intruded on the text. The modern editor (Wille) identifies it as an alien intrusion on the text.

Let’s look at the original source for this at:

http://gretil.sub.uni-goettingen.de/gretil/1_sanskr/4_rellit/buddh/vinv_06u.htm

There are a number of problems here:

Text marked as scribe is indicated in the source with {} not ⟪ ⟫.
However, {} does not mean “deleted text”.
Text marked as supplied is not in fact supplied text, but is text attested in the manuscript. Supplied is only for text missing in the manuscript.
{} is used in a variety of ways:
- Indicating apparent mistakes in text (= sic)
- Describing mistakes in previous editions (“the following first two lines of fol. 94r have not been transliterated by Dutt”)
- Random textual notes (“Tib. inserts the whole Vairambhasūtra; cf. AN II 54-57”)

So it would be best to mark these as add, not scribe, and replace supplied with .

In san-mu-kd7 and san-mu-kd8 scribe is used yet another way, to mark up numbers in the text.

http://gretil.sub.uni-goettingen.de/gretil/1_sanskr/4_rellit/buddh/vinv07_u.htm

Since these are simple editorial additions, just use add.

Things that are missing

Apart from the classes mentioned here, there are a number of classes used in the Vinaya. They need to be assessed as well.

In addition, there are various classes used to indicate the end of texts, uddanas, and so on.

Method

At the moment, the implementation of these things is haphazard, as many of them were introduced little by little over time. Also, we are learning as we go, so sometimes old and new approaches are mixed up. We can take this chance to clean things up and make them nice and consistent.

I propose that we handle all this non-standard markup in a single web component. Ooh, can we call it sc-hypercritical.html? That would be cool.

So basically standard HTML stuff is handled in one place: paragraphs, headings, lists, quotes, tables. Then all the stuff that’s specifically pertaining to text-critical markup is in one component, keeping all the JS and CSS tidy.

Things to consider

Should we go all-in on the custom elements approach, and replace eg.  with <sc-add>? It would be more work, as we would have to consider things like the interaction of block and inline elements. But it would create a cleaner and more meaningful markup.
Speaking of, we should determine which of these is used for block-level elements, which for inline only, and which for both.
We should also determine the scope of each class, so that we don’t end up, eg., using the same styling to indicate two different things in the same text. Generally speaking, though, it would be best for each class of thing to have a unique style, so far as this is reasonable.

Vimala · November 10, 2017, 4:14am

Promise to have a look tomorrow. I seem to have a meeting at 4 am so I’d better get some sleep now

sujato · November 10, 2017, 7:16am

Yes, you had better do that!

Vimala · November 10, 2017, 2:00pm

This is the whole list I’ve got for this and the span tags I’ve been using:

uncertain reading → unclear
(* ) editorial restoration of lost text → supplied
⟨* ⟩ editorial addition of omitted text–>add
⟪ ⟫ scribal insertion → scribe
{ } editorial deletion of redundant text → del
{{ }} scribal deletion → del-scribe
. lost part of an akṣara → gap
? illegible akṣara → gap
+ lost akṣara → gap
/// textual loss at left or right edge of support → gap

This has not yet been imported in next-sc. I noticed it the other day because title-attributes obviously use the browser’s own markup rather than ours. But more than the styles, there is also the JS for some of these things which has not yet been imported in next-sc. (and I just noticed that that has broken on SC when the pali-lookup is activated but never mind, won’t happen in next-sc).
For instance, the code bhagavā in the html translates to:

bhagavā<table class="deets"><tbody><tr><td>bhagavāti </td><td><ul><li>Chulachomklao Pāḷi Tipiṭaka 2436 (1893)-3</li></ul></td></tr></tbody></table>

So it’s not just a simple css issue, but there is a lot of js in the background too. For most of the others, the js adds a title-tag explaining what this is.

How to go about correcting this? Any ideas?

I noticed this just applies a color, but does not add a title tag like with most others so the user never knows why the color is different.

No it doesn’t. It only duplicated the css, not the js.

On a separate note, if this is an enhancement, I could do that when the guys are finished with the main bulk of the site. The JS for such things is pretty easy, but it might be a little tricky with the shadow-dom now (old site does not use shadow-dom). Will see.

sujato · November 11, 2017, 12:07am

So it looks like I missed out unclear, scribe, and gap. Obviously we need a better way of organizing this stuff!

Right.

I think we’ll have to do it by hand, check each volume and see what’s used. If it’s concistent in the volume it won’t be hard to fix, but if different uses of “add” are found it’ll be tricky. Still, regex should help us out.

I’m thinking that we should rejig this markup so it is like “var”. Basically it does the same thing: shows one reading and supplies another as alternative.

The difference is the semantics. In var, the alternate is actually found in the manuscript, in corr it is corrected by the editor. So we can do the same thing, color the text and show the alternate in a tooltip, but indicate in the tooltip what is going on. Probably best to show the corrected text by default, and in the tooltip: “Corrected by editor from ‘xyz’”.

The use of the choice/sic/corr structure is inherited from TEI and it’s not necessary to keep it.

I meant that it duplicates the meaning: they both show variant readings. There’s no need for two separate markup conventions to indicate the same thing.

That sounds like a good idea. Anyway, it looks like the main text pages will be done soon!

Vimala · November 11, 2017, 12:21am

Guess that is my job for next week then after Hubert is done.

Vimala · November 13, 2017, 12:30am

Here’s a file with a python dump of all the various span classes we have on the site, just in case we are forgetting one. Maybe you can go over them and see if there are any that can be removed?

spanclasses.txt.zip (515 Bytes)

This is an interesting one also, in avs10.html: itvā ca

Looks like the roman numerals also need some cleaning up. There are about 4 different classes for that.

sujato · November 13, 2017, 2:09am

Excellent, thanks so much. I’ll have a look at these this morning.

There’s a few issues I’d like to discuss with yourself and @blake, this being one of them, so maybe we can talk after the scrum this evening. Or tomorrow, you know what I mean!

sujato · November 13, 2017, 2:54am

I was just starting work on this, when I realized we need more. Not all the relevant classes are applied to spans: some are on <div> and some are on regular text markup like  or <hX>. Can you see about extracting these into the list as well? And we should retain the markup context, so we know what HTML the classes are applied to. Something like:

supplied; span
add; span, p, h1, h2

Vimala · November 13, 2017, 3:25am

Sure, no probs. Will do 2morrow after scrum.

Vimala · November 13, 2017, 2:35pm

I talked to Hubert and he made a very good point: webcomponent-markup cannot be imported into innerHTML. So changing the classes to wc-markup is not going to work. So for now we will keep this in regular html-markup and I’ll make the stylesheets and javascript for them. If we have time left over in Feb or so, we can see if we can find a way around it but it looks like html-markup is the best way forward for now.

I’ve attached another 2 files. One is a run over the old html files and one over nextdata. The difference are those things that are in the pali files of the first 4 nikayas, which were removed from nextdata because they moved to pootle. I will go over this again in detail and remove obvious errors.
classes.zip (2.6 KB)

I will make a separate file for the a-tags because that would be helpful for Hubert’s paragraph stuff as well.

What does this have to become? I take it that an empty class is not what you had intended here (you made this file)?

With the a-tags I’ve come across something peculiar:

<a class="bjt3"></a> text ......
<a class="bjt4"></a> text ......

Every paragraph has a separate class in the Vietnamese translations like cp1.html. You also made those. Is this a mistake and should it be something like this:

<a class="bjt" id="bjt3"></a> text ......
<a class="bjt" id="bjt4"></a> text ......

Same with several pts-vp-pi numbers on a tags like in english ja527.html.

Here is the list for a-tags only: aclasses.txt.zip (11.2 KB)

I want to sort some of these problems out before I go further and upload a new updated list.

Vimala · November 13, 2017, 11:04pm

And an updated file for the all classes after removal of the above mentioned errors and making them into id instead of classes:

classes.txt.zip (2.1 KB)

Could you please go over these and see where there are more errors. I have questions about some of them. Thanks

sujato · November 14, 2017, 1:31am

This has all of the v/p ref spans in it: I thought we weren’t including these?

Vimala · November 14, 2017, 1:54am

Sorry, I’m not with you.

sujato · November 14, 2017, 1:56am

The volume page references, empty <a> tags like we discussed, hos91 and so on. They should not be in the list.

Vimala · November 14, 2017, 2:14am

These are defined as paragraph numbers as such:

<a class="hos91" id="8"></a> atha pīṭhasya parivrājakasya etad abhavad iti me śramaṇo gautamaś cetasā cittam ājānāti. … yena bhagavāṃs tenāñjaliṃ praṇamya bhagavantam idam avocat.
<a class="hos91" id="9"></a> labheyāhaṃ bho gautama svākhyāte dharmavinaye pravrajyām upasaṃpadaṃ bhikṣubhāvam. labdhavāṃ te svākhyāte dharmavinaye pravrajyām upasaṃpadaṃ bhikṣubhāvam. evaṃ pravrajitaḥ sa pīṭhaḥ parivrājakaḥ yasyārthe kulaputrāḥ keśaśmaśrūṇy avatārya kāṣāyāṇi vastrāṇy ācchādya samyag eva śraddhayāgārād anagārikāṃ pravrajanti. pūrvavad yāvad arhan babhūva suvimuktacittaḥ. pīṭha iti sūtraṃ ||

sujato · November 14, 2017, 2:16am

yes, they should not be in the list.

Vimala · November 14, 2017, 2:20am

So you mean you want to have a separate list for the a-tags (to check for errors in that) and for the rest. Sure, will do tomorrow.

sujato · November 14, 2017, 2:22am

No, I mean the job we are doing has nothing to do with the a tags. We are checking the semantic markup, not the reference numbers.

sujato · November 17, 2017, 8:31am

Here I’ll respond to various points raised on Jira. It’ll take me a little while to check them all, though, so this is a work in progress.

you only want to do certain classes and not all of them?

The text span classes, yes. Not the reference classes.

The point is that the text span classes need to be handled properly in Next, otherwise we cannot finish the text display task. The reference numbers are already working. If there are any mistakes, they can be fixed, but this is not blocking the progress on the site, so we should not put time into it. From now on, we need to focus on those things that are slowing or blocking implementation of features on Next.

I found and corrected something called class=uncertain in the list. I concluded that that should have been class=unclear

Yes, this seems to be correct.

I’ve created a new branch where I’m changing colors/styling according to new specs.

Okay, good.

supplied2 color:#9B8D00 san-lo-bi-vb-np11 primary-color (different shade) What does different shade mean?

We have a number of shades of the primary color in sc-colors.html. Try using --sc-primary-color-dark.

.cr The ones that are “a” will have the normal link style. How about the span ones? Should they be replaced?

If they are links, yes. If not, hmm, not sure.

what is “various classes”

Don’t worry, just whatever you find.

.var not in sc-text-styles

Huh.

.scribe: now italic and right-align. Should this change?

Right align sounds wrong; keep it just as italic.

.del-scribe: same as “del”, but with title clarifying. What does “with title clarifying” mean? Do you just mean with a different JS title?

That’s right, yes.

title="Deleted by scribe"

Choice/corr are a bit more complicated right now than sic.

The principle behind this is that there is a “choice” of readings; either the (apparently incorrect) reading found in the manuscript (marked sic) or the (apparently correct) reading as determined by the editor (marked corr). It’s called “choice” because there’s no right or wrong in which of these you want to display: either way can be fine so it’s up to the final publisher (us!) to choose.

The way it is presented now is derived from TEI, but it should be simplified. I haven’t reviewed the implementations we have currently, but in any case, let’s put all occurrences in the same form as described here.

Let us, as a matter of policy, display the corr reading by default. This assumes that the editor is competent and has correctly identified mistakes in the original. Since the editors of our texts are experts in the field, this is a safe assumption! So we then display the sic reading as an alternate. This should work similarly to the var readings, except the text in the popup is different.

Given this, we should simplify the number of classes used, and since the corr reading is the one displayed by default, use that. We should completely eliminate sic and choice.

For example:

Bill and Ben the flower pot men.

Currently we have code like atyunnamayya abhyunnamayya
To make it consistent with .var notation it would have to become:
atyunnamayya
Please correct me if I’m wrong.

That’s exactly right.