Major upgrade to PTS dictionary

sujato · September 6, 2016, 2:40am

As we have discussed in length previously, the state of the digitized PTS Pali dictionary is poor. We have not rectified all the problems—far from it—but we have made some major improvements. This work was by @vimala and myself.

Convert references in PTS and F&F dictionaries. This involves

changing the abbreviations to those used on SC
spelling all references (as far as possible) in full,
marking them up.

Fixing miscellaneous issues with punctuation and formatting.
Shift etymology sections to the end of the entry (makes it easier to find the basic meaning of the word).
Use proper Greek characters
Add links for terms.
Improve page design.

You can see a more detailed discussion over at Github (true geeks only!). The major item on the 2-do list is to linkify the references; this is a difficult task, but will hopefully happen before too long.

We’ve made anekasatasahassāni changes. Given the poor state of the original files, it is inevitable there will be some mistakes, so help us out. Remember, though, that we’re not (at this stage) aiming to correct all the issues with the dictionary, just avoiding making new ones.

sujato · September 24, 2016, 2:08am

Today we are making a further major upgrade to the PTS dictionary. Corrections were checked against the scanned edition from the internet archive:

The main change is that entries are now marked up as lists in HTML. This gives structure and clarity to the entries, and is especially useful in the case of very long entries, which in the original are extremely hard to use.

Discerning the correct structure of the many nested lists is no easy matter. The original was reasonably accurate in this regard, if not always consistent. Still, there were errors in the original, and many more introduced through the digitization process. But the biggest difficulty was not the mistakes as such, but simply the lack of clarity about where sections began and ended. I have done my best, but I cannot guarantee I have always reconstructed these correctly.

In addition, i have made many thousands of additional corrections to punctuation, typos, references, and etymology sections, as per previous upgrade. Still, many errors remain.

I have also spelled out some abbreviations, especially the basic grammar info at the start of each entry.

LXNDR · September 24, 2016, 9:52am

this i found lovely

or the ass of bhikkhus, nuns, laymen and female devotees

as appears in parisa

as they say, everyone (me that is) conceives to the extent of their own depravity

sujato · September 24, 2016, 10:02am

This is obviously a misprint. How could you miss this? Clearly, what it means is:

The badass bhikkhus, nuns, laymen and female devotees

sujato · October 26, 2016, 11:11pm

Just to note a further update to the PTS dictionary. This time:

The references to Pali texts have been virtually completed.
Certain mistakes in the last update were corrected; especially, references to Mahavastu and Mahavamsa were accidentally conflated.
Grammatical terms for verbs are more clearly marked. This is incomplete.
Miscellaneous corrections.

The main work here was the completing of the original text references. I have now adapted and marked up 123,713 references. There’s doubtless a few that I’ve overlooked, and a very small number where I could not discern the references or which are in some way obscure. But this is very close to the total number of original text references in the PTS dictionary. Which, for a work compiled by two people in the paper age, is a remarkable piece of work.

Previously I had done what could be achieved by regex. But the latest updates were done mostly by hand, brought to you by patience and the following sneaky bit of code:

@font-face {
font-family: 'dictnum';
src: local('Skolar Sans PE Bl');
unicode-range: U+0030-0039;
}
dd {
font-family: dictnum, "Skolar PE"
}

This selects all the numerals that are not part of a marked reference and makes them superbold. The trick is to use the unicode-range feature of @font-face. I’m recording it here so I don’t forget it, it’s really useful!

waiyin · November 13, 2016, 11:08pm

Pertaining to the abbreviations used in SC…
…some the ones for the dictionary are not in the list of Abbreviations
e.g. Ep.
suspect it is (from PED) ep. = Epithet

Do you need help in compiling them?

sujato · November 13, 2016, 11:52pm

Hi Waiyin,

Yes, abbreviations are a problem. There are very many of them, and they are not always listed in the Dictionary, nor are they always consistent. More to the point, they are only really relevant in a print context, and—with the exception of well-known, standard abbreviations such as “etc.”, “i.e.” and so on—I would rather spell them all out.

In fact I have already done so to some degree, and have just added “Ep.” to that, so next time the dictionary is updated it will read “epithet” instead. But this reveals the problem, as sometimes “Ep.” also stood for “Epic” as in “Epic Sanskrit”. In addition, the punctuation and syntax is complex; for example, since the abbreviation “Ep.” is capitalized, it does not change when it starts a sentence. So we can’t mechanically replace it with lower-case “epithet”. Of course, we can fix this by simply capitalizing when it follows a period. Except the punctuation is often missing or incorrect. Then again, “ep.” is not always capitalized, so we have to deal with that; and in one instance, it is incorrectly OCR-ed, so we have to correct to “cp.”

So anyway, you get the point. At some time, I would love to go through and spell out all abbreviations (as well as marking up grammatical terms, and giving contemporary references in full and consistent form). But that day is not this day! You can see our plans for the dictionary here: