Sanskrit texts markup

Vimala · September 8, 2016, 7:04pm

Please have a look at https://suttacentral.net/skt/lal5/14.433-

Can we put some (soft) hyphens in there somewhere?

Also, please refresh my memory on span classes: what to do with emendations? Would that be “add” or “supplied” or something entirely different?

What about asterix * ? In gandari.org they mean either supplied or added text, but it seems to mean something different here: http://gretil.sub.uni-goettingen.de/gretil/1_sanskr/4_rellit/buddh/dhrmsk_u.htm
In Lalitavistara you have changed them to plus + with span class gap.

sujato · September 8, 2016, 11:13pm

I thought the skt texts were soft-hyphened already, but yes, this needs to be done. @blake, can we apply the same formula as the pali?

“add” is for things that are not part of the text, but have been added by an editor for clarification, such as a note or explanation.
“supplied” is for things that in fact do belong to the text, but have disappeared from the manuscript and have been reconstructed by the editor.
“sic” is for an apparently incorrect reading as determined by the author.
“corr” is for a corrected reading supplied by the editor. (see the files for how these are used.

So emendation is indicated by “corr”. The basic definitions of these are in common.scss. You can also check the TEI documentation (eg for corr) since we use the same terms in the same meaning.

I have no idea what the asterisks in this text mean. In fact, they don’t seem to mean anything, so far as I can tell. Just leave them for now. I’ll see if I can find an explanation, but GRETIL is pretty obscure when it comes to such things.

Vimala · September 8, 2016, 11:51pm

Done.
I tried to read through Gretil to find out about the asterix but what I find does no really make sense. One explanation was that it means a dot below the previous letter but that did not seem to cut it entirely.
I have not deleted the asterixes for now, just hidden them.

sujato · September 9, 2016, 12:05am

That’s a good idea.

Meanwhile, I noticed that GRETIL has pretty much completed uploading the PTS Pali edition. If we ever move to a proper standoff system, we might consider adding that as a variant text.

I can see the Dharmaskandha, but it’s not showing up in the main menu for some reason.

I wonder whether it might be a good idea to change the ID for this from dk to dhsk. That would keep our ID in line with GRETIL’s. What do you think?

Vimala · September 9, 2016, 12:07am

Open a new incognito window and try again to see the main menu. The rest will follow after some time.

I’m all for simple IDs … but fine with me if you want this.

sujato · September 9, 2016, 12:10am

Cool, you’re right.

If it’s not too much work, let’s go ahead and make the change. The Buddha praised non-proliferation, and I’m sure he would apply that to abbreviation conventions, too!

Vimala · September 9, 2016, 12:12am

Maybe tomorrow …

sujato · September 9, 2016, 12:14am

But what if you die overnight? With your last breath, you’ll be thinking, “If only I’d changed those IDs for the Dharmaskandha …”

Vimala · September 9, 2016, 12:15am

more likely “If only I managed to get that Pali lookup to remain sticky” or “If only I got that hyphenation-module to work for me”

Vimala · September 9, 2016, 3:52pm

I didn’t die so I had some time this morning to change this

Had a look at the hyphenation and did something with the Lalitavistaraḥ but am not happy with it yet. Maybe the segments need to be a bit bigger. Also not entirely sure about the correct breaks. It needs to be finetuned a bit.

Vimala · September 10, 2016, 4:40pm

I’ve done the hyphenation for the Lalitavistara. Before I do the rest, @Sujato, can you have a look at it?

sujato · September 11, 2016, 12:15am

May I ask what is the formula for doing the hyphenations? Hyphenation is hard, and even in English there is no universally accepted way of doing it. In traditional manuscripts, they had no concept of correct break points, they just broke the word at the end of the line regardless.

I’ve checked Lal 1 and mostly it’s good. Here’s part of the text, with soft-hyphens replaced by asterisks:

ekānte sthitāśca te śuddhāvāsakāyikā devaputrā bhagavantametadavocan—asti bhagavan lalitavistaro nāma dharmaparyāyaḥ sūtrānto mahāvaipulyanicayo bodhisattvakuśalamūlasamudbhāvanaḥ tuṣitavarabhavanavikiraṇasaṃcintyāvakrama**ṇavikrīḍanagarbhasthānaviśeṣasaṃdarśano ’bhijātajanmabhūmiprabhāvasaṃdarśanaḥ sarvabālacaryāguṇaviśeṣasamatikramasarvalaukikaśilpasthānakarmasthānalipisaṃkhyāmudrā—gaṇanāsidhanukalāpayuddhasālambhasarvasattvaprativiśiṣṭasaṃdarśanāntaḥpuraviṣayopabhogasaṃdarśanaḥ sarvabodhisattvacariniṣpandaniṣpattiphalādhigamaparikīrtano bodhisattvavikrīḍitaḥ sarvamāramaṇḍalavidhvaṃsanaḥ tathāgatabalavaiśāradyāṣṭādaśāveṇikasamuccayo ’pramā**ṇabuddhadharma*nirdeśaḥ pūrvakairapi tathāgatairbhāṣitapūrvaḥ

Now, most of these are fine, the soft hyphens are inserted only in long words, and generally they follow sane syllable breaks. A couple of issues:

Occasionally , eg in rama**ṇavikrīḍana, we have two soft hyphens. In both cases here (and in Lal2), these occur in almost identical words, I’m not sure if that’s significant.
There seems to be an issue with breaking “pr”, eg. in pramā**ṇabuddha*dharma. In this case, and others I noted in the text, there should be no break here. Maybe just drop all hyphens in pr. Other problem conjuncts are ks, (including kṣ and kś), nj or ñj, and rṇ. All these should stay together.

Here’s a case from Lal 2:

samyakp*rahā*ṇaṛddhi*pādend*riya*bala*bodhya*ṅ*gamār

would ideally be:

samyak*prahāṇa*ṛddhi*pād*endriya*bala*bodhyaṅ*gamār

Not sure what can be done here.

Vimala · September 11, 2016, 4:08am

It’s Blake’s python module for hyphenation, with a few changes to incorporate Sanskrit differences. But I had to make some educated guesses that might not always have been correct.
The double hyphens should have been automatically taken out. Obviously they had not so I have changed that now. Also changed the p-r issue and built in a function to make sure that all those others you mention stay together.
How about hyphens in between tv, tk and sv?

The rest takes a bit more study so will do that tomorrow.

Vimala · September 11, 2016, 7:02pm

I’ve updated it. Please have a look but I would like to know a little more about various word-patterns that need a break (or not).

sujato · September 11, 2016, 10:54pm

The latest version looks good. I can’t see any problems with it.

I’m not sure if I can help here; but if you have any specific questions, try me. There is a LaTeX hyphenation pattern for Sanskrit, perhaps that might be useful.

Vimala · September 25, 2016, 7:21pm

I’ve now done all Sutra and Abhidharma texts too. Not yet Vinaya.
I’ve changed things a little based on the LaTex hyphenation pattern as you suggested.

Vimala · September 26, 2016, 7:04pm

Vinaya done as well, but only where needed. Most files already had normal hyphens in it.

sujato · September 26, 2016, 10:17pm

Thanks so much for this. I’ll keep my eye on the Skt texts as i use them, hopefully we won’t have any more problems.