A better Pali Dictionary

sujato · January 8, 2016, 10:49am

We’ve had some discussion about fixing the many, many problems with the current digital version of the Rhys Davids/Stede PTS dictionary. @waiyin has kindly offered to help, and @Simon also. But it was not something that I had looked at very closely.

For the past few days, Blake has been upgrading our translation app, so I have had some spare time and decided to take a look at the PTS dictionary and see what it would entail.

Here’s the results. It includes:

HTML files of a small sample portion that is reasonably finished, and the full, HIGHLY UNFINISHED version.
PDFs created from the samples, so you can see what I’m aiming at.

pali_ped.zip (3.5 MB)

What I’ve done:

Expanded most abbreviations
Adapted the refererences to the SC style
Corrected and checked
Marked up the various parts, such as grammatical terms, main definitions, etymology, references etc.
Used a modern dictionary style, especially by removing a lot of punctuation
Structured the entries as lists

As far as it goes, I’m happy with the result. It’s certainly a lot clearer and easier to read. But I’ve reached a point where I’m not really wanting to proceed further, so I thought I’d share it with you all and see if you have any feedback or suggestions.

The sample covers 96 entries. In the whole dictionary, there’s 16,528 entries. On a rough estimate, it’d take about 6 months work to finish. And that’s not something I can contemplate right now.

In fact, many of the changes are not that difficult. Most of the references, terms, and so on have been done with regular expressions. And with much of the extra markup it is simply a matter of going through the text and adding it one by one.

The real killer is the entry structure. It’s simply a nightmare to figure out how the entries are meant to be read. There is, most of the time, a structure, but it is no easy matter to work it out. Does this reference refer to the Pali phrase before it, or to the one after it, or to neither? Sometimes!

In the original text, these vague structures are, perhaps, not such a problem, as most of the time you just want to know the meaning of the word. But in order to mark it up properly, you have to figure out exactly how each element is related. There’s no way of automating this, so each entry has to be considered on its own. Even now, I am by no means confident that I have it correct, even in the small sample.

If it could be done for the whole text, I have no doubt it would make it far more usable and adaptable. As an example of that, I have included two versions of the sample, one of which is adapted for printing. With just a few CSS changes it makes a fairly good, printable version, taking up roughly the same space as the original print version, but far more legible. It’s not perfect, but it gives an idea what can be done when you have a well marked-up text.

The problem is, is it worth it? The landscape of Pali dictionaries is littered with the bones of failure: the webpage for the now-abandoned Critical Pali Dictionary lists the numerous obituaries of the scholars who died while trying to complete it. No, really!

So we have:

The Rhys Divids/Stede version, which is digitized, but poorly, and is out of date. But it is fairly complete as far as the canonical texts go.
Buddhadatta’s Concise Pali Dictionary, which is not greatly reliable or complete, and relies a lot on later Pali.
The Critical Pali Dictionary, available online, but only covering up to kh, and not at all user-friendly.
Margaret Cone’s A Dictionary of Pali, which so far covers about half the language, based on the texts published by the PTS; a date of 2030 has been mentioned for completion. This is not available digitally, and the print edition is not user friendly.

None of these are really satisfactory, and there’s no real sign of improvement in the near future, so far as I know.

Why, I am wondering, have we failed to produce a decent dictionary for Pali?

I think the failure has to do with a lack of clarity. We treat Pali as one language. But the Pali texts span 2500 years! You wouldn’t expect a dictionary of modern English to include this:

Hwæt! We Gardena in geardagum,
þeodcyninga, þrym gefrunon,
hu ða æþelingas ellen fremedon.
Oft Scyld Scefing sceaþena þreatum,

It’s from Beowulf, which is Old English. This is dated 8th–11th centuries, so the Pali canon is more than twice as old!

True, Pali has not evolved as fast as English, but still, there is no linguistic reason to insist that every strata of Pali belongs in the same dictionary. Later texts have a huge amount of extra vocabulary and usages, why not reflect this in the dictionaries? This is similar to what A.K. Warder successfully did with his Introduction to Pali; he based the grammar and vocabulary on just the Dīgha Nikāya.

What I am suggesting is that we need a Dictionary of Early Pali. Perhaps this would be a dictionary of the Pali Canon, or perhaps just the EBTs. Not only would this reduce the scope of the project greatly, it would define a language that is roughly contemporary, and which includes the texts that are of most interest for most people.

It would not need to be a detailed academic dictionary; we can leave that for Margaret Cone’s project. It would be something more like the Concise Dictionary, but covering all the vocabulary of the early texts, with meanings and context as used in those texts only.

With the convenience of text search, we no longer need to list so many references in a dictionary. Only when they indicate specific use cases are they needed. Etymology is unnecessary, as is scholarly discussion. Just words and meanings, essential grammar, and nice clean markup.

Anyway, as you know I have no time, nor do I have the inclination for this kind of work. But I wonder if there is a way to get it done. Perhaps a crowdsourcing venture of some sort could be implemented; but it is a specialized kind of work, so I am not sure how that would go. So I will just leave this here, and see what you think.

LXNDR · January 8, 2016, 11:47am

unfortunately the internal linking doesn’t work

i fixed it throughout in my earlier attempt to create a browsable file of the dictionary

so for further work maybe it will be more reasonable and prudent to use my version of the file

sujato · January 9, 2016, 2:39am

Thinking of the scope of a comprehensive dictionary of the EBTs, I decided to see how many unique words are in the corpus. Taking just the EBTs, I stripped the HTML, sorted the terms, and removed duplicates. I also removed some, but far from all, of the artificial duplicates created by things like “-ti” or “ñca” endings.

There’s about 83,000 unique terms. This does not organize things into “words”, but simply unique strings of characters. So it includes every different grammatical form, variant spellings, compounds, and so on, all counted as separate “terms”. Here’s the file. I don’t know if it’s useful for anything!

ebt-u.txt.zip (309.4 KB)

sujato · January 9, 2016, 2:50am

Thanks, perhaps I should have started with that, but by now I doubt if it would be useful. I’ve made some hundreds of thousands of changes to the file; and fixing the internal links will not, I think, be difficult.

One of the goals with the new format would be to enable automatic linking of each of the references—of which there are about 100,000—to the correct text on SC. This would takes a bit of jiggery-pokery, as the form of the references is different (the dictionary uses vol/page), but it should be doable.

Vimala · January 9, 2016, 6:51am

I think that to tackle such a large project we would need to use modern technologies.
I could envision some kind of wiki-style website—with one pali word per page—where pali scholars from all over the world can participate. Of course there is the issue of quality control and it needs a moderator or somebody who is in charge of overseeing this.
Something like: https://www.mediawiki.org/wiki/MediaWiki

sujato · January 9, 2016, 8:04am

Yes, that’s certainly worth looking in to.

The classic example is the Digital Dictionary of Buddhism, the Chinese>English Buddhist Dictionary, which was created by Charles Muller starting with his own work in 1995. He has curated it since then, and they are still adding and correcting entries each month. We use it for our Chinese lookup. This is one of the earliest and most successful examples of an academic collaboration project for the web.

Whether this could be replicated for Pali is hard to say. The single biggest factor is the person: to find someone who is willing to maintain and develop it for a long time. Even then, it is not a sure thing. The world today is very different to what it was then, and there are many projects competing for people’s attention. If the project was done outside of the Universities, I think it would be unlikely to get much contribution from academics. And to do it inside a University would require the kind of time and long term planning that is rarely available today. Finally, the field of Pali studies is far smaller than that of Chinese Buddhist studies, and it is not sure there are enough academics to actively contribute.

The other way of doing it would be a more open crowdsourcing. But here again, projects in the Pali world have so far not been successful. If we were to allow anyone to edit it, then who would do the work? Well, by far the greatest number of people learning Pali are those in traditional Theravadin countries. And in those contexts, there is rarely any meaningful ability to distinguish between early and later texts, and the meanings that words have in those contexts. So if we wanted an EBT dictionary, we would have to figure out how to restrict the entries to actual early meanings.

Perhaps the answer lies in a hybrid approach. Use our current sources to compile a dictionary of the EBTs, extracting the relevant entries and matching them with the actual word list from the EBTs. They could be compared and corrected against the CPD and Cone’s work. Then clean, filter, and evolve the dictionary with some kind of user-generated input. Still not easy!

Vimala · January 9, 2016, 8:47am

I agree that such a project hinges on finding a dedicated person.
Maybe worth to contact PTS as they are the main source and point of contact for universities.
There is also a project by Dr James Nye of the University of Chicago with regards to the dictionary so they might be interested to help. But the site is from 2007 so pretty old and one of many dictionaries they have on there. (and still listed under “current projects” on the PTS website).

But how many people do we really need? If you divide the work in several chunks, maybe 10 people would be sufficient.

sujato · January 10, 2016, 12:07am

Sure, even one would be enough, if they had the dedication and skills, and a small team would be better. But it does take quite a lot of skill. You have to know Pali well, and be able to work on an admittedly dry and complex task for the long term.

sujato · January 10, 2016, 10:07am

I’ve had a bit of time with this, and maybe I was coming at it from the wrong angle. Fixing the PTS PED is probably a lost cause; which was, after all, the conclusion of Margaret Cone. But looking again at the Concise Pali Dictionary, it seems to me this is much more workable.

We have a nicely marked-up HTML file, with simple, clearly structured entries. The coverage of words is reasonable—about 20,000 words. This seems like more than the PTS PED (16,000) but this is an illusion, as the Concise typically lists different forms in separate entries, whereas for the PED these are mostly listed under the main entry. Still, there are some words not found in the PED.

The accuracy of the dictionary can certainly be improved. On just a cursory examination, I found several words where the Pali is misspelled. There’s also problems with the definitions. And it includes a large amount of material from later sources, while omitting some terms from the EBTs. Okay, so a reasonable amount of work to be done, but far from hopeless.

Perhaps what we could do is something like this.

Work through the entries in the Concise, comparing them with Cone’s Dictionary as far as it goes. We can add words that are missing, especially those from the EBTs, and correct current entries. This gives us a straightforward way of correcting, expanding, and improving the text for half the language. For the remainder, perhaps do something similar, relying on the PED? Even this much would be a great improvement, and we would have a fairly reliable and comprehensive dictionary. Let’s call this the Expanded Concise Pali English Dictionary.
Taking it further, we could identify which terms and meanings appear in the EBTs. This is less straightforward, but the various extant dictionaries at least give us a start. These could be marked in the text, so that the dictionary could be used to show both early Pali and general Pali.
Compare the entries in the ECPED with the word list from the EBTs, and add all missing words, so that we have a complete dictionary for the early texts. This would then be the Dictionary of Early Pali. Yay!

Russell · January 10, 2016, 5:50pm

:anjal:

Dear Bhante @Sujato,

My Pali is basic but this project has captured my interest. What exact skills would you need from volunteers to be able to assist with the ECPED if the project ever does come to life? I’m pretty sure some members would like to assist in whatever capacity they can. May hands make light work

with respect, reverence, and gratitude,
russ

:anjal:

ElissaJ · January 10, 2016, 6:20pm

Hi All.

This is a project I am very much interested in and could devote considerable time to.

What I had started to do in my quest for interlinear Sutta presentation, was to combine all available electronic dictionaries. I think cleaning up the PTS version is worthwhile and would be willing to do the grunt work, if we can get a plan laid out where I have a sense that we will accomplish something. What I personally would like to have available is as complete a dictionary as possible.

Rather than using a text format, I would use a database format from which I could design reports to generate whatever formats are needed. I have thoughts on database structure and such that I could go into. What I envision is that for each word, we could include the PTS PED definition, the other definitions from other initial dictionaries (Concise PED, etc.). Then also a concise definition which might include multiple words separated with commas, and finally the summary of all of the legitimate definitions. Some words that are forms of base words would just have a link back to the base word.

Here is a thought. If someone would be interested in representing the scholarly / Pali side, I could do the grunt work and we could determine an approximate number of entries to go through per week.

Another thought is that we could reach out to the universities to recruit help. Is there available a list of universities with Buddhist studies programs?

P.S. I’ve had a little trouble following some of the acronyms in this thread. I have a few questions:

What does “early texts” refer to exactly and specifically?
EBT?

P.P.S
83593/52 = 1,607
83593/365 = 229

-Elissa

LXNDR · January 10, 2016, 6:34pm

aha

EBT = early Buddhist texts = four Nikayas and some books of Khuddaka Nikaya

ElissaJ · January 10, 2016, 7:58pm

Thanks. Pardon my thirst for knowledge and lack thereof. Which books of the Kuddaka Nikaya?

sujato · January 10, 2016, 11:42pm

Thanks so much! See below.

sujato · January 10, 2016, 11:44pm

Hi Elissa,

Thanks so much for your offer! I’ll respond in detail below, but first, may I ask, what’s your relevant experience here? I’m not prying, just trying to get a sense of what we can do!

sujato · January 10, 2016, 11:47pm

Six books: Dhammapada, Udana, Itivuttaka, Suttanipata, Theragatha, and Therigatha. These are pretty much contemporary with the bulk of the material in the four main nikayas, and should be considered together. The rest of the Khuddaka belongs to a distinctly later period.

sujato · January 11, 2016, 1:56am

So here I’ll respond in some more detail to some of the questions, and offer some idea how we can move ahead.

First up, a few people have suggested approaching the PTS or universities. In my opinion, this is a waste of time. I know most of the people working in this area, and they’re pretty much tied up with their own projects. It’s not like there’s a bunch of spare capacity just lying around waiting to be used. And working with such places is often very slow, very tied up with processes, applications, making proposals, getting grants, all of that stuff. The time scale that I’m thinking of, we’d have it finished before even the first round of grant applications was considered. If someone working in a Uni comes forward, great, but I won’t hold my breath.

There’s only one real reason for working through a university, and that’s prestige. Universities, at least in the humanities, have become prestige factories. Buddhist studies scholars are competing for very scarce, highly vulnerable jobs, and to maintain a career they must generate prestige in a manner recognized by the universities. And there’s little prestige to be gained from a project like this. What gets you prestige is publishing as many articles as possible in prestigious journals. To do that, you divide your work into the Minimum Publishable Unit (MPU)—this is an actual term I’ve heard from an academic. A project such as this won’t generate many MPUs, so it’s basically career suicide. Personally, I couldn’t care less about prestige, I want to make something that will actually help people to get enlightened. And that’s not a motivation that makes sense in an academic world.

To get back to the project, let me respond to a few things. But to start, we need to have some convenient way of talking about the dictionaries, so let me define my terms here:

PED: Pali Text Society’s Pali English Dictionary by Rhys Davids/Stede
CPD: Critical Pali Dictionary
CPED: Concise Pali English Dictionary
DOP: Cone’s A Dictionary of Pali

That’s more or less what we do on SC already. We have multiple dictionaries on call, and they produce their results, and that works fine. We don’t have either DOP—which is not available digitally—or CPD (but perhaps we should), but we use those that are available. (In addition to PED and CPED we have a Dictionary of Pali Proper Names and a Dictionary of Flora and Fauna.) Of course, you’re talking about integrating these at a deeper level, but this is no easy task, as I hope will become clear.

Okay, lets have a look at what that would entail. Here are the definitions for the first word in Pali, aṁsa. Let’s consider just the basic explanations of the word.

PED
aṁsa 1: a) the shoulder … b) a part (lit. side)
aṁsa 2: point, corner, edge … In connection with a Vimāna: āyat˚; with wide or protruding capitals
CPD
aṁsa 1: m. a share, part, portion,
aṁsa 2: m. the shoulder
aṁsa 3: m. edge, corner
aṁsa 4: m. an ornament of a chariot (?)
CPED
aṁsa: m.; nt. 1. a part; a side; 2. shoulder.
DOP
aṁsa 1: m. share, portion, part
aṁsa 2: m. the shoulder
aṁsa 3: m. point, corner, edge, facet

This is a nice illustration of the issues involved in reconciling the various dictionaries. For this one word, each dictionary gives us a different number of entries. This is not because they cover different material or explain it differently, but simply because they have organized their material differently. So we have to figure out how the material should be arranged.

We have a somewhat different set of priorities than the Dictionary compilers. Their overriding principle is linguistics, and they attempt to organize their material so that it corresponds with a “correct” linguistic form. However, as you can see, there is no clear or unambiguous way to achieve that.

Our emphasis, on the other hand, will be on pragmatics, especially on how to develop material that is usefully applied in a digital medium. For example, that while linguistically we may be able to distinguish between different senses of the word aṁsa, a computer cannot (or at best, it’s really difficult and unreliable). So how are we to best organize the material to show, hopefully, the correct meaning, or more likely, a range of possible meanings?

Another way of thinking about this is that, while in human language we deal with “words”, to which we ascribe meanings, computers deal with “tokens”, that is, unique strings of characters. So a big part of our job is figuring out how to make our “tokens” (that is, the actual strings of characters found in our texts) match up with the “words” (i.e. dictionary entries).

As we can see, while a token is simple, clear, and unambiguous, there is no agreement even among the dictionary makers as to what the “word” actually consists of: it is purely a linguist’s abstraction that does not exist in the language.

What we would gain from combining these entries? Not a lot. The basic definitions of the words are pretty much identical. In fact, they have probably copied from each other.

It would seem that the CPED is lacking one sense, “point or edge”, but this is not in fact the case. That sense only occurs in compounds, and the CPED lists the occurrences under the compounded forms. This is a good example of a case where what is linguistically less sophisticated is nevertheless more useful for us. Rather than expecting the computer to analyze a compound and figure out what the elements are, which meaning to ascribe to each of the elements, and how those meanings combine, we simply have an entry under the compounded form, which is close to what actually appears in the language.

Then we have the obscure reference to aṁsa with reference to a vimāna, a celestial palace or chariot. This occurs underneath the meaning “point” in the PED; while CPD assigns a separate, but dubious, entry; CPED omits it entirely; and DOP doesn’t explicitly mention it, but includes the references under “point”. From our point of view, this is all moot, since this meaning doesn’t occur in the EBTs.

There is other information in these entries, including examples, references, and so on. These are a huge part of the art of making such dictionaries, but in our context they have only limited usefulness.

In a print era, what happens is this. I come across the word aṁsa in a Pali text, and I want to look up the meaning. So I check an available dictionary, and I get the meaning. Mostly it will be pretty obvious. But if it is unclear, I can check the various references to see how the word is used in other contexts.

But in a digital era, I read the text, get the meaning, and if I want to check it in different contexts, I use search. Does search do exactly the same thing as such detailed references? No it does not. It does some things better and some things worse. But it does mean that the utility of spending vast amounts of time to sort out the references is drastically diminished. So my thinking is, these are a low priority. They will be useful in a only small number of cases. And remember, this information already exists. It is not a question of having it or not. It is a question of whether it is worth the time and effort—thousands of hours—to reorganize the material in a more user-friendly way.

In addition, all such references are in forms that are inconsistent with those used on SC. So to be really useful they have to be converted to a machine-readable, SC-friendly form. Then we can auto-link them and so on. They also repeat each other, so to create a well-structured, integrated text we would need to find a way to remove duplicates. This is not a trivial task, as sometimes a reference is part of a sentence, for example, “In DN iv.22 it means…” So in such cases we can’t simply remove a duplicated reference.

Finally, there’s the grammatical information, which in various dictionaries specifies the part of speech and, in this case, gender. Note that the PED omits this in this case, as it probably assumes masculine is default. CPED is alone in saying the word can be neuter, so we can probably assume this is a mistake. Knowing the gender is useful both for students and machines. It helps to identify the possible forms (tokens) that a word might take in the language. So it should be in a consistent and machine-readable form.

At the end of all that, what have we learned? Well, basically this: the CPED entry was fine. We don’t gain anything much from including the other definitions. We can make a minor correction to grammatical information, and that’s about it. For the vast majority of users, they already have what they need, and gain nothing from the other entries. In fact, giving more information is just confusing.

This is just one case, and for each entry the issues will be slightly different. But the basic situation will be the same: combining the dictionaries in any meaningful semantic way is a difficult and time-consuming task which will, I think, have little practical benefit.

What will have benefit is to ensure that our coverage of words in the EBTs is both accurate and complete. And it was to do this that I outlined my method above. If we are able to achieve this, I think we can basically use them in a similar way to what we already do: give the basic meaning in a lookup, and more extensive discussions in the full entry.

If there is still interest in integrating the dictionary in a more comprehensive way, then perhaps this could be a phase two. Start by concentrating on getting complete and accurate entries for all words in the EBTs, then progressively enrich the information.

If we are to proceed, then obviously we need to work out some format. What that is doesn’t matter too much, it’s most important that it be usable. Personally, I don’t like databases and always work with text files. But as long as the data can be imported/exported easily it doesn’t matter too much. Ultimately I would imagine it will be stored as JSON, or perhaps simply HTML. But if it is to be a collaborative project, it is important that the format be familiar and non-breakable for non-specialists, so it may be best to use a spreadsheet. Note that for some detail work it will probably be desirable to include some form of text markup, for example, for italicizing pali words in the text.

Yes, exactly!

waiyin · January 11, 2016, 6:07am

I like the ‘Yay!’ at the end of deciding that CPED is probably the best starting point.
(Assuming CPED = http://www.ahandfulofleaves.org/documents/Concise%20Pali%20English%20Dictionary_Buddhadatta.pdf
or
http://www.viet.net/anson/ebud/dict-pe/dictpe-01-a.htm)

I would like to help but
"You have to know Pali well, and be able to work on an admittedly dry and complex task for the long term."
I don’t know Pali well (just a few words here and there);
Dry and Complex task is okay.

PS - Good news is that I have the CPED in hardcopy (but hardly used and had ‘fun’ locating aṁsa or Aŋsa - found when searching for shoulder.
Should have looked at the online CPED first - it seemed to have ‘corrected’ the spelling (though aṃsa?)? But I wouldn’t have been able to look it on the hardcopy then.

PPS - Just copied and pasted all the entries from the viet.net version of the CPED - 19273 entries in total, entered differently from book (e.g. 2nd last entry on book). My original purpose in looking at Pali-English dictionary was to ‘create’ a Kindle dictionary so that I can lookup the ReadingFaithfully Pali nikayas (only 4) like that in SuttaCentral when I’m offline. I think this could be a good starting point. Will need to check the alphabets first and how-to create Kindle dictionary.

sujato · January 11, 2016, 7:19am

There is, I think, basically one version of the CPED available in various places and forms digitally, and this had its origins in a corrected version done by metta.lk. If we were to proceed, we would start with the version on SC, which we have in both HTML and JSON format. The JSON is already enriched with some extra information that is useful for us. Here’s the file for your interest.

cped_data.py.zip (404.5 KB)

Nibbanka · January 11, 2016, 8:29pm

First, it would be great to work through the words in the order of their frequency.

A list of 2000 most frequent words in Majjhima Nikaya:
http://dhamma.ru/paali/MN2000word_list.odt

Sutta-pitaka (without Khuddaka Nikaya) frequency list:
http://dhamma.ru/paali/word_list.odt

Sutta and Vinaya word frequency list:
http://dhamma.ru/paali/wordlist.zip

Kurt Schmidt
A Frequency Dictionary of Pali: Core Vocabulary for Learners
http://www.amazon.com/Frequency-Dictionary-Pali-Vocabulary-Learners/dp/1478369159/

Complete word list of all Pali words (about 967.000) as occuring in the CSCD (VRI) Tipitaka edition
http://wayback.archive.org/web/20150707075127/http://www.nibbanam.com/sortedFrequencyPali.zip