Here, following on from the discussion about Pali Dictionaries, I’ll present what we have with the current CPED, and discuss what needs to be done to make basic corrections. There are a few technical issues that I am unclear on, and have asked Blake about. I’ll leave them aside for now.
First, the basic sources.
- The data as we have it on SC. cped_data.py.zip (404.5 KB)
- A pdf of the CPED. Concise Pali-English Dictionary_Buddhadatta.pdf (2.8 MB)
##What exactly are these files?
###The SC dictionary data
This data is, I believe, ultimately derived from the metta.lk project to digitize the CPED, and has probably passed through a number of hands since then. I guess the version we use was pulled from the DPR.
It is a JSON file which has the following columns.
- 0=Defn, (i.e. word. This is in Velthuis, not Unicode)
- 1=Grammar,
- 2=Meaning,
- 3=Source, (I don’t know what the sources are)
- 4=InflectGroup, (This gives the grammatical group which determines inflections. This is used for auto-analyzing the grammar)
- 5 =InflectInfo, (gives further grammatical information. I don’t understand everything in this column.)
- 6 =BaseWord, (in the case of derivative words, this defines what the base is. But it is only used occasionally.)
- 7 =BaseDefn, (Defines whether or not a word is the Base Definition for that word. I guess. But it doesn’t seem to be done properly. Maybe I misunderstand it.)
- 8 =FuncStem, (This gives a “stem” form, which is really just the word with the end dropped off. This is used for lookup)
- 9 =Regular (Defines whether a word has a regular or irregular inflection.)
The file may be easily imported as a spreadsheet. However, there is an irregularity with the number of columns. The top line, defining the columns, has one less than the entries. I am not sure why it is so, but it can easily be fixed.
###The PDF
This is a proper digitized PDF, rather than a scanned copy. It was made in 2004, but there is no info as to who did it. The text is quite good. There are some mistakes, but without seeing the original it is not possible to know whether these were copied over from the original. Nevertheless, it can be used to help correct the digital files.
###Relation between them
There are many differences between the PDF and the JSON files. The entries are in many cases organized quite differently; the JSON file splits entries up, while the PDF has sub-entries under main words. This relation may be expressed in the JSON data. The alphabetical order is sometimes different, but this doesn’t matter much.
Mistakes may appear in either or both. For example:
- The PDF gives akamaka, which the JSON corrects to akāmaka.
- The PDF correctly has ākaḍḍhati, but the JSON mistakes it as akaḍḍhati.
- The PDF has “flowless” for akāca, as does the JSON. But this is a typo for “flawless”.
- Some errors seem to have arisen from flawed processing. We find, eg. under akaraṇīya: "pt.p. of ". But it doesn’t tell us what of.
##Where to start?
I would suggest the first thing is a careful proofreading and correction of the basic file.
- Ensure that the English spelling is correct.
- This is a job for a non-Pali expert. Probably the easiest way is to import the relevant column into a spreadsheet and use a spellchecker (US English). In addition, normal proofreading would be good. @ElissaJ, is this something you’d be interested in?
- Ensure that the Pali spelling is correct.
- Check entry definitions and grammatical information against DOP where available.
Further improvements probably require more advanced knowledge:
- Fill out the BaseWord information. For example, akari is listed as a derived term from karoti. But akaronta, akaraṇīya, and many others aren’t. I can’t discern a pattern here, it seems to be just incomplete data.
- Determine which words are not found in the EBTs. This can partly be done by searching, and partly by reference to DOP, but ultimately will be determined by our word list.