Correcting the CPED

sujato · January 15, 2016, 2:56am

Here, following on from the discussion about Pali Dictionaries, I’ll present what we have with the current CPED, and discuss what needs to be done to make basic corrections. There are a few technical issues that I am unclear on, and have asked Blake about. I’ll leave them aside for now.

First, the basic sources.

The data as we have it on SC. cped_data.py.zip (404.5 KB)
A pdf of the CPED. Concise Pali-English Dictionary_Buddhadatta.pdf (2.8 MB)

##What exactly are these files?

###The SC dictionary data

This data is, I believe, ultimately derived from the metta.lk project to digitize the CPED, and has probably passed through a number of hands since then. I guess the version we use was pulled from the DPR.

It is a JSON file which has the following columns.

0=Defn, (i.e. word. This is in Velthuis, not Unicode)
1=Grammar,
2=Meaning,
3=Source, (I don’t know what the sources are)
4=InflectGroup, (This gives the grammatical group which determines inflections. This is used for auto-analyzing the grammar)
5 =InflectInfo, (gives further grammatical information. I don’t understand everything in this column.)
6 =BaseWord, (in the case of derivative words, this defines what the base is. But it is only used occasionally.)
7 =BaseDefn, (Defines whether or not a word is the Base Definition for that word. I guess. But it doesn’t seem to be done properly. Maybe I misunderstand it.)
8 =FuncStem, (This gives a “stem” form, which is really just the word with the end dropped off. This is used for lookup)
9 =Regular (Defines whether a word has a regular or irregular inflection.)

The file may be easily imported as a spreadsheet. However, there is an irregularity with the number of columns. The top line, defining the columns, has one less than the entries. I am not sure why it is so, but it can easily be fixed.

###The PDF

This is a proper digitized PDF, rather than a scanned copy. It was made in 2004, but there is no info as to who did it. The text is quite good. There are some mistakes, but without seeing the original it is not possible to know whether these were copied over from the original. Nevertheless, it can be used to help correct the digital files.

###Relation between them

There are many differences between the PDF and the JSON files. The entries are in many cases organized quite differently; the JSON file splits entries up, while the PDF has sub-entries under main words. This relation may be expressed in the JSON data. The alphabetical order is sometimes different, but this doesn’t matter much.

Mistakes may appear in either or both. For example:

The PDF gives akamaka, which the JSON corrects to akāmaka.
The PDF correctly has ākaḍḍhati, but the JSON mistakes it as akaḍḍhati.
The PDF has “flowless” for akāca, as does the JSON. But this is a typo for “flawless”.
Some errors seem to have arisen from flawed processing. We find, eg. under akaraṇīya: "pt.p. of ". But it doesn’t tell us what of.

##Where to start?

I would suggest the first thing is a careful proofreading and correction of the basic file.

Ensure that the English spelling is correct.

This is a job for a non-Pali expert. Probably the easiest way is to import the relevant column into a spreadsheet and use a spellchecker (US English). In addition, normal proofreading would be good. @ElissaJ, is this something you’d be interested in?

Ensure that the Pali spelling is correct.
Check entry definitions and grammatical information against DOP where available.

Further improvements probably require more advanced knowledge:

Fill out the BaseWord information. For example, akari is listed as a derived term from karoti. But akaronta, akaraṇīya, and many others aren’t. I can’t discern a pattern here, it seems to be just incomplete data.
Determine which words are not found in the EBTs. This can partly be done by searching, and partly by reference to DOP, but ultimately will be determined by our word list.

sujato · January 17, 2016, 10:29am

Here’s the first round of corrections for the CPED. For this round I have:

Run a spellchecker, using US English.
A few minor corrections to the entries, mainly removing obscure, archaic, or local terms.
Standardized punctuation.
Changed the Pali headwords to Unicode (and corrected a small number of spelling errors in Pali)

In addition I have restructured the document following suggestions by @blake. It would be good for someone else to run this through another spellcheck. But in any case, the vast majority of outright errors should be gone. In fact, it would be possible to use this right now for our Pali lookup.

Attached is the file in .ods spreadsheet. Use LibreOffice to open it, or convert it in Google Drive. If we get collaborators working on this, we can work out a way of collaborating that suits all.

cped_data_USspelling.ods.zip (488.3 KB)

Russell · January 17, 2016, 7:43pm

:anjal:

Dear Bhante @Sujato,

Thank you for giving the opportunity to assist. I am working on it now. I hope that others will volunteer to make the processing go faster. But if not, more adornment for the mind to me hee hee!

Sadhu! Sadhu! Sadhu!

with respect, reverence, and gratitude,
russ

:anjal: :

sujato · January 17, 2016, 10:14pm

Thanks so much, @Russell.

Please let me know exactly what it is that you’re doing, and how you’re doing it, i.e. methods and software. We have to make sure we work in sync!

Russell · January 18, 2016, 12:28am

:anjal:

Dear Bhante @Sujato,

I try to follow instructions most of the time hee hee so I’m using OfficeLibre as you suggested (BTW, very cool stuff and its free ). I’m doing all the spell checks and converting words to their US spelling. I’m also going to highlight English terms that are unfamiliar in the US and just use the ordinary descriptions already on the list (perhaps we may do away with them at a later date upon further review.)

with respect, reverence, and gratitude,
russ

:anjal:

sujato · January 18, 2016, 12:31am

Are you using the updated file I posted yesterday? Because most of these things are already done …

Russell · January 18, 2016, 12:43am

:anjal:

Dear Bhante @Sujato,

Yes, I am using the one that you posted within the last 14 hrs.

with respect, reverence, and gratitude,
russ

:anjal:

sujato · January 18, 2016, 12:46am

Okay, great. I have modernized to some degree the language, but much more could be done. The author, being Sinhalese, refers from time to time to specifically Sinhala ideas which are probably best removed (eg pingo becomes carrying pole). Don’t be afraid to make changes!

Russell · January 18, 2016, 12:51am

:anjal:

Dear Bhante @Sujato,

Pingo definitely perplexed me so I highlighted that. Thank you for the allowance to make changes. Also, plough/ploughing showed up a lot so I changed all of them to plow/plowing.

with respect, reverence, and gratitude,
russ

:anjal:

Russell · January 18, 2016, 2:55am

:anjal:

Dear Bhante @Sujato,

Please know that I have completed all the necessary changes to US English. The next process I’m gonna do is to go through each cell, proof read, and check for grammar (ex: "hose"instead of “whose” ) and punctuation (ex: unnecessary hyphens). I should be able to complete the list by this weekend and get it to you no later than Sunday, Taiwan time.

with respect, gratitude, and reverence,
russ

:anjal:

Russell · January 22, 2016, 3:55am

:anjal:

Dear Bhante @Sujato,

I have finished going through the list with being this evening my final combing-through. Here’s what has been done:

-all spellings converted to US(American) spelling
-spell checked all terms
-corrected the punctuations
-expanded incomplete descriptions (ex: type of plants and animals) wherever extent I could
-looking out for non-native English speakers, I’ve simplified a few words and corrected the grammar to make it easier to understand

Please see the attached .zip file.

Thank you again for giving the opportunity to assist with this task. I am confident that this will help a lot of people .

with respect, reverence, and gratitude,
russ

:anjal:

cped_data_USspelling.zip (493.9 KB)

sujato · January 22, 2016, 7:33am

Ta very much, I’ll have a look at this, then we can look at Stage 2.

sujato · January 22, 2016, 8:48am

Updated. dep0.3 is the latest.

Here’s an updated version of the file @Russell just supplied. Russell’s changes were carefully done and well considered, so thanks for that. I have kept these changes, and to them I have added a few improvements to the first 230 entries. This is something that I was playing around with these past few days to get an idea what is needed.

What I’ve done is to make a small start on the more advanced corrections as outlined above, namely:

Add words and meanings that are listed in DOP as occurring in the EBTs. This is the most important task.
Correct Pali spellings.
Correct and expand details in grammatical entries.
Fill out the info under “regular forms”. Not a very descriptive title, but basically this gives the normal form under which the elements of the word appear in the dictionary. Eg. akataññū = a+karoti+jānāti.
Bring the abbreviations in line with those used in DOP (and CPD). (This is done throughout, but only checked for the first 230 entries) Here is the file for this: abbreviations_dep.ods.zip (17.0 KB)
Add a few details, like alternative spellings of the Pali word.
In a few cases, I have split the entries, where the entry contains two forms that have both different grammar and different meaning. Eg. akkhidhutta is split into one for the adjective, meaning “addicted to gambling”, and one for the noun, “gambling addict”.
Add an extra column listing words not found in the EBTs, and marked with [brackets] meanings not found in EBTs. This is just according to the references given in Cone, so is not 100%, but should be fairly robust.
Add extra column that simply lists in order the original entries, so we can keep track of added terms.
Add extra column for stems. The previous column marked “stems” was in fact just mechanically truncated forms, so is relabelled as such. Identifying stems based on the Saddaniti file, which I posted a few days ago, is hard! I am not at all confident in doing it, and it really needs someone who has made a proper study of Pali stems. For now, it’s just there as an idea, I haven’t taken it far enough to be useful.

dep0.3.ods.zip (690.3 KB)

From here, I think the best thing is to work from DOP to add and correct the entries as best as possible. This requires some familiarity with Pali, but not at an advanced level. One key is to identify the references and know which of them refer to the EBTs. Apart from that, it’s mostly a matter of careful reading and editing. If we have more than one volunteer, we could spilt the job up by letters, and check each other’s work.

sujato · January 24, 2016, 2:18am

A post was split to a new topic: Best Pali dictionary for Android or online

ElissaJ · January 27, 2016, 10:42pm

I’ve ordered the DOP, so I think I could do this part, and also

and

If that would be helpful, let me know.

sujato · January 28, 2016, 2:03am

Great, that sounds good. I can share with you a scanned pdf of DOP if that would help. Let me know, we’ll probably have to email or dropbox it or something, they’re big files.

Also, let me know before you do any work on this. I am making some changes locally, and I don’t update the file here every day!

This is pretty much done, it’s just a matter of keeping an eye out for errors.

Yes it would be, you can see in the current file I simply put a 0 when the CPED entry has no EBT references in DOP.

What I’d suggest is that we work out a way of working that is congenial to both of us in terms of file types or whatever. I’d suggest if we are to collaborate, use a spreadsheet on Google Drive. I’ve just checked it, and it seems to work fine, although obviously it depends on a decent connection. Would this work for you?

When starting, it would probably be a good idea for me to mentor you, to make sure that we’re on the same page, and that you don’t waste your time. Once you’ve got a good idea what’s needed, I’ll let you get on with it, but will be here if needed. Does this sound good?

Russell · January 28, 2016, 2:44am

:anjal:

Dear @Elissaj,

Sadhu! Sadhu! Sadhu! Anumodana

with mettā,
russ

:anjal:

Russell · January 28, 2016, 2:47am

:anjal:

Dear Bhante @Sujato,

I would like to offer my assistance to you and @Elissaj. Thank you for your consideration!

with respect, reverence, and gratitude,
russ

:anjal:

sujato · January 28, 2016, 7:17am

Do you have a copy of DOP, or could I share one with you?

Russell · January 28, 2016, 3:53pm

:anjal:

Dear Bhante @Sujato,

I don’t have a copy of the DOP and I would gratefully accept you sharing a copy with me for this project.

with respect, reverence, and gratitude,
russ

:anjal: