Sanskrit texts markup

@sujato:

Please have a look at https://suttacentral.net/skt/lal5/14.433-

Can we put some (soft) hyphens in there somewhere?

Also, please refresh my memory on span classes: what to do with emendations? Would that be “add” or “supplied” or something entirely different?

What about asterix * ? In gandari.org they mean either supplied or added text, but it seems to mean something different here: http://gretil.sub.uni-goettingen.de/gretil/1_sanskr/4_rellit/buddh/dhrmsk_u.htm
In Lalitavistara you have changed them to plus + with span class gap.

I thought the skt texts were soft-hyphened already, but yes, this needs to be done. @blake, can we apply the same formula as the pali?

  • “add” is for things that are not part of the text, but have been added by an editor for clarification, such as a note or explanation.
  • “supplied” is for things that in fact do belong to the text, but have disappeared from the manuscript and have been reconstructed by the editor.
  • “sic” is for an apparently incorrect reading as determined by the author.
  • “corr” is for a corrected reading supplied by the editor. (see the files for how these are used.

So emendation is indicated by “corr”. The basic definitions of these are in common.scss. You can also check the TEI documentation (eg for corr) since we use the same terms in the same meaning.

I have no idea what the asterisks in this text mean. In fact, they don’t seem to mean anything, so far as I can tell. Just leave them for now. I’ll see if I can find an explanation, but GRETIL is pretty obscure when it comes to such things.

Done.
I tried to read through Gretil to find out about the asterix but what I find does no really make sense. One explanation was that it means a dot below the previous letter but that did not seem to cut it entirely.
I have not deleted the asterixes for now, just hidden them.

That’s a good idea.

Meanwhile, I noticed that GRETIL has pretty much completed uploading the PTS Pali edition. If we ever move to a proper standoff system, we might consider adding that as a variant text.

I can see the Dharmaskandha, but it’s not showing up in the main menu for some reason.

I wonder whether it might be a good idea to change the ID for this from dk to dhsk. That would keep our ID in line with GRETIL’s. What do you think?

Open a new incognito window and try again to see the main menu. The rest will follow after some time.

I’m all for simple IDs … but fine with me if you want this.

Cool, you’re right.

If it’s not too much work, let’s go ahead and make the change. The Buddha praised non-proliferation, and I’m sure he would apply that to abbreviation conventions, too!

Maybe tomorrow …

But what if you die overnight? With your last breath, you’ll be thinking, “If only I’d changed those IDs for the Dharmaskandha …”

1 Like

more likely “If only I managed to get that Pali lookup to remain sticky” or “If only I got that hyphenation-module to work for me”

1 Like

I didn’t die so I had some time this morning to change this :slight_smile:

Had a look at the hyphenation and did something with the Lalitavistaraḥ but am not happy with it yet. Maybe the segments need to be a bit bigger. Also not entirely sure about the correct breaks. It needs to be finetuned a bit.

I’ve done the hyphenation for the Lalitavistara. Before I do the rest, @Sujato, can you have a look at it?

May I ask what is the formula for doing the hyphenations? Hyphenation is hard, and even in English there is no universally accepted way of doing it. In traditional manuscripts, they had no concept of correct break points, they just broke the word at the end of the line regardless.

I’ve checked Lal 1 and mostly it’s good. Here’s part of the text, with soft-hyphens replaced by asterisks:

ekānte sthitāśca te śuddhāvāsakāyikā devaputrā bhagavantametadavocan—asti bhagavan lalitavistaro nāma dharmaparyāyaḥ sūtrānto mahāvaipulyanicayo bodhisattvakuśalamūlasamudbhāvanaḥ tuṣitavarabhavanavikiraṇasaṃcintyāvakrama**ṇavikrīḍanagarbhasthānaviśeṣasaṃdarśano ’bhijātajanmabhūmiprabhāvasaṃdarśanaḥ sarvabālacaryāguṇaviśeṣasamatikramasarvalaukikaśilpasthānakarmasthānalipisaṃkhyāmudrā—gaṇanāsidhanukalāpayuddhasālambhasarvasattvaprativiśiṣṭasaṃdarśanāntaḥpuraviṣayopabhogasaṃdarśanaḥ sarvabodhisattvacariniṣpandaniṣpattiphalādhigamaparikīrtano bodhisattvavikrīḍitaḥ sarvamāramaṇḍalavidhvaṃsanaḥ tathāgatabalavaiśāradyāṣṭādaśāveṇikasamuccayo ’pramā**ṇabuddhadharma*nirdeśaḥ pūrvakairapi tathāgatairbhāṣitapūrvaḥ

Now, most of these are fine, the soft hyphens are inserted only in long words, and generally they follow sane syllable breaks. A couple of issues:

  1. Occasionally , eg in rama**ṇavikrīḍana, we have two soft hyphens. In both cases here (and in Lal2), these occur in almost identical words, I’m not sure if that’s significant.
  2. There seems to be an issue with breaking “pr”, eg. in pramā**ṇabuddha*dharma. In this case, and others I noted in the text, there should be no break here. Maybe just drop all hyphens in pr. Other problem conjuncts are ks, (including kṣ and kś), nj or ñj, and rṇ. All these should stay together.

Here’s a case from Lal 2:

samyakp*rahā*ṇaṛddhi*pādend*riya*bala*bodhya*ṅ*gamār

would ideally be:

samyak*prahāṇa*ṛddhi*pād*endriya*bala*bodhyaṅ*gamār

Not sure what can be done here.

It’s Blake’s python module for hyphenation, with a few changes to incorporate Sanskrit differences. But I had to make some educated guesses that might not always have been correct.
The double hyphens should have been automatically taken out. Obviously they had not so I have changed that now. Also changed the p-r issue and built in a function to make sure that all those others you mention stay together.
How about hyphens in between tv, tk and sv?

The rest takes a bit more study so will do that tomorrow.

I’ve updated it. Please have a look but I would like to know a little more about various word-patterns that need a break (or not).

The latest version looks good. I can’t see any problems with it.

I’m not sure if I can help here; but if you have any specific questions, try me. There is a LaTeX hyphenation pattern for Sanskrit, perhaps that might be useful.

I’ve now done all Sutra and Abhidharma texts too. Not yet Vinaya.
I’ve changed things a little based on the LaTex hyphenation pattern as you suggested.

1 Like

Vinaya done as well, but only where needed. Most files already had normal hyphens in it.

1 Like

Thanks so much for this. I’ll keep my eye on the Skt texts as i use them, hopefully we won’t have any more problems.