A better Pali Dictionary

Hi Elissa,

Thanks so much for your offer! I’ll respond in detail below, but first, may I ask, what’s your relevant experience here? I’m not prying, just trying to get a sense of what we can do!

Six books: Dhammapada, Udana, Itivuttaka, Suttanipata, Theragatha, and Therigatha. These are pretty much contemporary with the bulk of the material in the four main nikayas, and should be considered together. The rest of the Khuddaka belongs to a distinctly later period.

3 Likes

So here I’ll respond in some more detail to some of the questions, and offer some idea how we can move ahead.

First up, a few people have suggested approaching the PTS or universities. In my opinion, this is a waste of time. I know most of the people working in this area, and they’re pretty much tied up with their own projects. It’s not like there’s a bunch of spare capacity just lying around waiting to be used. And working with such places is often very slow, very tied up with processes, applications, making proposals, getting grants, all of that stuff. The time scale that I’m thinking of, we’d have it finished before even the first round of grant applications was considered. If someone working in a Uni comes forward, great, but I won’t hold my breath.

There’s only one real reason for working through a university, and that’s prestige. Universities, at least in the humanities, have become prestige factories. Buddhist studies scholars are competing for very scarce, highly vulnerable jobs, and to maintain a career they must generate prestige in a manner recognized by the universities. And there’s little prestige to be gained from a project like this. What gets you prestige is publishing as many articles as possible in prestigious journals. To do that, you divide your work into the Minimum Publishable Unit (MPU)—this is an actual term I’ve heard from an academic. A project such as this won’t generate many MPUs, so it’s basically career suicide. Personally, I couldn’t care less about prestige, I want to make something that will actually help people to get enlightened. And that’s not a motivation that makes sense in an academic world.

To get back to the project, let me respond to a few things. But to start, we need to have some convenient way of talking about the dictionaries, so let me define my terms here:

  • PED: Pali Text Society’s Pali English Dictionary by Rhys Davids/Stede
  • CPD: Critical Pali Dictionary
  • CPED: Concise Pali English Dictionary
  • DOP: Cone’s A Dictionary of Pali

That’s more or less what we do on SC already. We have multiple dictionaries on call, and they produce their results, and that works fine. We don’t have either DOP—which is not available digitally—or CPD (but perhaps we should), but we use those that are available. (In addition to PED and CPED we have a Dictionary of Pali Proper Names and a Dictionary of Flora and Fauna.) Of course, you’re talking about integrating these at a deeper level, but this is no easy task, as I hope will become clear.

Okay, lets have a look at what that would entail. Here are the definitions for the first word in Pali, aṁsa. Let’s consider just the basic explanations of the word.

  • PED

  • aṁsa 1: a) the shoulder … b) a part (lit. side)

  • aṁsa 2: point, corner, edge … In connection with a Vimāna: āyat˚; with wide or protruding capitals

  • CPD

  • aṁsa 1: m. a share, part, portion,

  • aṁsa 2: m. the shoulder

  • aṁsa 3: m. edge, corner

  • aṁsa 4: m. an ornament of a chariot (?)

  • CPED

  • aṁsa: m.; nt. 1. a part; a side; 2. shoulder.

  • DOP

  • aṁsa 1: m. share, portion, part

  • aṁsa 2: m. the shoulder

  • aṁsa 3: m. point, corner, edge, facet

This is a nice illustration of the issues involved in reconciling the various dictionaries. For this one word, each dictionary gives us a different number of entries. This is not because they cover different material or explain it differently, but simply because they have organized their material differently. So we have to figure out how the material should be arranged.

We have a somewhat different set of priorities than the Dictionary compilers. Their overriding principle is linguistics, and they attempt to organize their material so that it corresponds with a “correct” linguistic form. However, as you can see, there is no clear or unambiguous way to achieve that.

Our emphasis, on the other hand, will be on pragmatics, especially on how to develop material that is usefully applied in a digital medium. For example, that while linguistically we may be able to distinguish between different senses of the word aṁsa, a computer cannot (or at best, it’s really difficult and unreliable). So how are we to best organize the material to show, hopefully, the correct meaning, or more likely, a range of possible meanings?

Another way of thinking about this is that, while in human language we deal with “words”, to which we ascribe meanings, computers deal with “tokens”, that is, unique strings of characters. So a big part of our job is figuring out how to make our “tokens” (that is, the actual strings of characters found in our texts) match up with the “words” (i.e. dictionary entries).

As we can see, while a token is simple, clear, and unambiguous, there is no agreement even among the dictionary makers as to what the “word” actually consists of: it is purely a linguist’s abstraction that does not exist in the language.

What we would gain from combining these entries? Not a lot. The basic definitions of the words are pretty much identical. In fact, they have probably copied from each other.

It would seem that the CPED is lacking one sense, “point or edge”, but this is not in fact the case. That sense only occurs in compounds, and the CPED lists the occurrences under the compounded forms. This is a good example of a case where what is linguistically less sophisticated is nevertheless more useful for us. Rather than expecting the computer to analyze a compound and figure out what the elements are, which meaning to ascribe to each of the elements, and how those meanings combine, we simply have an entry under the compounded form, which is close to what actually appears in the language.

Then we have the obscure reference to aṁsa with reference to a vimāna, a celestial palace or chariot. This occurs underneath the meaning “point” in the PED; while CPD assigns a separate, but dubious, entry; CPED omits it entirely; and DOP doesn’t explicitly mention it, but includes the references under “point”. From our point of view, this is all moot, since this meaning doesn’t occur in the EBTs.

There is other information in these entries, including examples, references, and so on. These are a huge part of the art of making such dictionaries, but in our context they have only limited usefulness.

In a print era, what happens is this. I come across the word aṁsa in a Pali text, and I want to look up the meaning. So I check an available dictionary, and I get the meaning. Mostly it will be pretty obvious. But if it is unclear, I can check the various references to see how the word is used in other contexts.

But in a digital era, I read the text, get the meaning, and if I want to check it in different contexts, I use search. Does search do exactly the same thing as such detailed references? No it does not. It does some things better and some things worse. But it does mean that the utility of spending vast amounts of time to sort out the references is drastically diminished. So my thinking is, these are a low priority. They will be useful in a only small number of cases. And remember, this information already exists. It is not a question of having it or not. It is a question of whether it is worth the time and effort—thousands of hours—to reorganize the material in a more user-friendly way.

In addition, all such references are in forms that are inconsistent with those used on SC. So to be really useful they have to be converted to a machine-readable, SC-friendly form. Then we can auto-link them and so on. They also repeat each other, so to create a well-structured, integrated text we would need to find a way to remove duplicates. This is not a trivial task, as sometimes a reference is part of a sentence, for example, “In DN iv.22 it means…” So in such cases we can’t simply remove a duplicated reference.

Finally, there’s the grammatical information, which in various dictionaries specifies the part of speech and, in this case, gender. Note that the PED omits this in this case, as it probably assumes masculine is default. CPED is alone in saying the word can be neuter, so we can probably assume this is a mistake. Knowing the gender is useful both for students and machines. It helps to identify the possible forms (tokens) that a word might take in the language. So it should be in a consistent and machine-readable form.

At the end of all that, what have we learned? Well, basically this: the CPED entry was fine. We don’t gain anything much from including the other definitions. We can make a minor correction to grammatical information, and that’s about it. For the vast majority of users, they already have what they need, and gain nothing from the other entries. In fact, giving more information is just confusing.

This is just one case, and for each entry the issues will be slightly different. But the basic situation will be the same: combining the dictionaries in any meaningful semantic way is a difficult and time-consuming task which will, I think, have little practical benefit.

What will have benefit is to ensure that our coverage of words in the EBTs is both accurate and complete. And it was to do this that I outlined my method above. If we are able to achieve this, I think we can basically use them in a similar way to what we already do: give the basic meaning in a lookup, and more extensive discussions in the full entry.

If there is still interest in integrating the dictionary in a more comprehensive way, then perhaps this could be a phase two. Start by concentrating on getting complete and accurate entries for all words in the EBTs, then progressively enrich the information.

If we are to proceed, then obviously we need to work out some format. What that is doesn’t matter too much, it’s most important that it be usable. Personally, I don’t like databases and always work with text files. But as long as the data can be imported/exported easily it doesn’t matter too much. Ultimately I would imagine it will be stored as JSON, or perhaps simply HTML. But if it is to be a collaborative project, it is important that the format be familiar and non-breakable for non-specialists, so it may be best to use a spreadsheet. Note that for some detail work it will probably be desirable to include some form of text markup, for example, for italicizing pali words in the text.

Yes, exactly!

6 Likes

I like the ‘Yay!’ at the end of deciding that CPED is probably the best starting point.
(Assuming CPED = http://www.ahandfulofleaves.org/documents/Concise%20Pali%20English%20Dictionary_Buddhadatta.pdf
or
http://www.viet.net/anson/ebud/dict-pe/dictpe-01-a.htm)

I would like to help but
"You have to know Pali well, and be able to work on an admittedly dry and complex task for the long term."
I don’t know Pali well (just a few words here and there);
Dry and Complex task is okay.

PS - Good news is that I have the CPED in hardcopy (but hardly used and had ‘fun’ locating aṁsa or Aŋsa - found when searching for shoulder. :smiley:
Should have looked at the online CPED first - it seemed to have ‘corrected’ the spelling (though aṃsa?)? But I wouldn’t have been able to look it on the hardcopy then.

PPS - Just copied and pasted all the entries from the viet.net version of the CPED - 19273 entries in total, entered differently from book (e.g. 2nd last entry on book). My original purpose in looking at Pali-English dictionary was to ‘create’ a Kindle dictionary so that I can lookup the ReadingFaithfully Pali nikayas (only 4) like that in SuttaCentral when I’m offline. I think this could be a good starting point. Will need to check the alphabets first and how-to create Kindle dictionary.

There is, I think, basically one version of the CPED available in various places and forms digitally, and this had its origins in a corrected version done by metta.lk. If we were to proceed, we would start with the version on SC, which we have in both HTML and JSON format. The JSON is already enriched with some extra information that is useful for us. Here’s the file for your interest.

cped_data.py.zip (404.5 KB)

2 Likes

First, it would be great to work through the words in the order of their frequency.

A list of 2000 most frequent words in Majjhima Nikaya:
http://dhamma.ru/paali/MN2000word_list.odt

Sutta-pitaka (without Khuddaka Nikaya) frequency list:
http://dhamma.ru/paali/word_list.odt

Sutta and Vinaya word frequency list:
http://dhamma.ru/paali/wordlist.zip

Kurt Schmidt
A Frequency Dictionary of Pali: Core Vocabulary for Learners
http://www.amazon.com/Frequency-Dictionary-Pali-Vocabulary-Learners/dp/1478369159/

Complete word list of all Pali words (about 967.000) as occuring in the CSCD (VRI) Tipitaka edition
http://wayback.archive.org/web/20150707075127/http://www.nibbanam.com/sortedFrequencyPali.zip

1 Like

Secondly, it would be useful to rate dictionaries by quality, and use the highest quality source.

Since Margaret Cone’s dictionary is of higher quality, its entries would usually override the entries of previous dictionaries. So it makes sense to work first of all on the entries not covered in the Margaret Cone’s dictionary.

CPED is of low quality, so it would be a last choice.

Thirdly, the electronic dictionary has an advantage - it may be extendable and interactive. Scholars may be able to add extended glosses and whole discussions “under the hood” of the articles. Last year members of the Pali Study group sought for sofware platform that would help to preserve the results of work on the Pali terms. It would be great to build such a bridge between scholars and public.

People won’t have to wait another hundred years before the update of the dictionary.

Fourthly, the format needs to take in account the modern state of metadata. I don’t know much about this subject, but IMHO, the Resource Description Framework would take us in proper direction.

This is not difficult, we can simply sort the words. But I don’t think it would be useful. Most of the frequent words are well covered by the CPED already, and it is the infrequent ones that need attention.

More subtly, remember the distinction I made previously about the difference between a “word” and a “token”. If we sort words alpabetically, we end up with similar tokens near each other, and this will often correspond with words. For example, karosi and karoti will sort near each other. This is helpful so that we can recognize and organize sets of tokens into words. However if we sort according to frequency, karoti will be more frequent than karosi and they will be separated. There will be no meaningful relation between tokens that represent the same “word”. This will, I think, make the organizational task much more difficult.

1 Like

Agreed. This is why I suggested we begin by correcting and expanding the CPED by comparing with the DOP.

Sorry, but I don’t understand this.

This is a very good point, and should be at the foundation of our efforts. What I am thinking is that we can build a “skeleton” dictionary, with the aim to create simple, accurate, comprehensive entries for all words in the Pali canon. This can then be progressively enriched and extended over time. @blake and I have already done some work in this area. The translation software that we use allows for entry of terminology. With careful design, such an approach can, I think, evolve a very useful dictionary over time.

Did you end up with a solution?

If I might add, the issue is not to find a software, but to use a well-defined form of structured data. Using a consistent, predictable form of structured data, you can transform it into another format easily, and various kinds of software can assist in doing various kinds of things.

This is a good idea, yes. Currently SC doesn’t supply RDF metadata, but it would be a good addition.

It makes sense to work first of all on the P-H range of the Pali alphabet which is not yet covered by the published volumes of Margaret Cone’s dictionary.

Not at all. For example, we discussed the term “Buddha”:

but the results of this discussion remained in our closed group.

Even such widely used terms remain underexplored, with established translations which are used by habit.

That’s why I’m interested in exploring the key Pali terms - they are often mistranslated or misunderstood.

CPED, being too simplistic, is sometimes outright misleading.

Sometimes an article on the narrow contextual usage of the term, like:

Akira Hirakawa
The Relationship between Paṭiccasamuppāda and Dhātu

helps to understand what the sutta is about.

I would like for such contextual explorations to be added to the body of knowledge.

[quote=“sujato, post:24, topic:2445”]
If I might add, the issue is not to find a software, but to use a well-defined form of structured data. Using a consistent, predictable form of structured data, you can transform it into another format easily, and various kinds of software can assist in doing various kinds of things.[/quote]

Yes, indeed. I wonder how to find a balance between the ease of adding new entries, as in Wiktionary:
https://en.wiktionary.org/wiki/Category:Pali_nouns
and the transferability of the body of knowledge.

What would be its advantage over other dictionaries?

I’m thinking along the lines of Digital Dictionary of Buddhism, where scholarly community would gradually extend the dictionary.

Such frequency lists allow to easily see what terms are used in the early Buddhist texts, and what terms occur only in later literature.

Yes, indeed, to make the frequency lists perfect, one would need a Pali stemmer.
AFAIK, David Alfter has not yet made the stemmer.
( http://arxiv.org/pdf/1510.01570.pdf )
So we are left with the raw frequency lists, which are also useful.
Knowledge of several hundreds of most frequent words makes most of the text comprehensible.

CPED sometimes creates the illusion of understanding the term, with articles like:
nimitta : [nt.] sign; omen; portent; cause.
which tragically misses the meaning of the term in meditative practice.
So, IMO, it is most frequent words that require extended treatment.

Good question… I have extensive experience in database development and project management.

My experience with the Suttas is that I started reading them a little over a year ago and have found them to be … what can I say? This is it. This is what I have been looking for since I was 8 years old. So I have a passion and a thirst for knowledge and understanding. I am looking into M.A. and Ph.D. programs, people of like-mind in my area (Northern Arizona).

I also have an attention for detail and comprehension, and intelligence according to the tests.

I’ve been reading on right speech and so this is a little off to be saying things about my skills. I have many faults and shortcomings. I suppose the first is that I am hesitant to list them all at the moment. I have some sort of ADHD and/or PTSD that limits my ability to memorize things. So part of my personal reasons for working on this sort of project is to make the texts more accessible and convenient in terms of being able to view as much as needed as possible on one screen so as not to rely on memory.

I also have time. I’ve been very fortunate to be able to make a living without having to spend time commuting, and am able to make enough $/hour that I don’t have to work 40 hours/week.

So, if we could agree on a course, I could and would commit to the project, with gladness and dedication.

1 Like

Oh, thanks, okay, now I get it. I will respond further down.

Okay, so we are talking about two quite different things here. So we need to clarify that!

My goal—and this is something that is only becoming clear as the discussion proceeds—was to create a dictionary for basic Pali terminology. The primary use of this would be for word lookup, and thus it would extend, and hopefully complete, the range of words that were correctly identified by our Pali lookup tool. Let’s call is a Glossary rather than a Dictionary, if you like.

What you’re interested in, and if I’m not mistaken, Elissa too, is more of a dictionary of Buddhist terms. Perhaps something like Payutto’s Dictionary of numerical Dhammas, perhaps, but not just numerical. There are a number of such:

And no doubt others. However, none of them, so far as I know, deal specifically with early Buddhism.

This is also a great project, and would fill another need that I have felt for SC. Let me first discuss a little how I envisage something like this being used—or at least, one application—and then consider the project itself.

One of the things we have done with the texts on SC is to remove the footnotes. I have discussed this at length elsewhere, so I won’t go into it here. But one gap this leaves us is that we end up with texts that liberally use technical terms and ideas that will be unfamiliar to readers. Someone reading a sutta and coming across the term “aggregate” is unlikely to know what this means, unless they have some background in Buddhism already.

Now, footnotes are one way to deal with this, but not a very good one, especially in a digital medium. Why? Because they explain the term once, and we need the information to be contextual. People aren’t going to read the suttas sequentially, and we shouldn’t structure our information as if they will.

So, what to do? Well, I think that in a web environment we can use several means to approach this. One of those is this very discourse site, where we can discuss things, post essays and so on. But this doesn’t give us the fine-grained ability to explain specific words in a text. For this, I envisage two things.

  1. A system of site-wide annotations, where people can write notes on specific passages, and
  2. A terminological dictionary, such as the one we are considering, which will define doctrinally significant terms in a meaningful and useful way, to be applied site-wide.

So what you’d do is, if you wanted help with terminology, turn on the terminological dictionary, then the explanations will appear as popups for the terms wherever they appear in the site. The annotations would be similar, except they apply to specific passages, not general terms.

Of course, the terminological dictionary could also be used just as well on its own, or in other ways, maybe even printed.

What is the relation between this doctrinal dictionary and the simple glossary that I was envisaging?

Well, there doesn’t have to be a relation. Perhaps they are two separate projects. Or perhaps, we start by making a simple glossary, then enriching it with further information. I think both approaches could work. The latter approach would be conceptually more satisfying; but then, Worse Is Better!

1 Like

Well, just what I have been saying: clear, comprehensive, accurate.

Are we talking at cross purposes here? The list of terms that I made, and on which I was proposing we base the glossary, is just those that are found in the EBTs.

It would be possible to map out the kind of evolution you’re talking about, but you’d need to use the much larger Pali corpus at the VRI site. It would be a great thing to do. But it would be really, really hard. You’d need to accurately stem the Pali words from all periods, and not only that, but to break up compounds as well. This is why I was proposing we work only on the vocabulary actually used in the EBTs, as it is a reasonably concise task. Only 80,000 tokens!

Interesting, I wasn’t aware of this. I’ll read it carefully. In fact SC has a stemmer in javascript, but it works on quite simple principles. You can get maybe 90% accuracy, but beyond that it’s hard. My thinking is that, again, by restricting the corpus to the EBTs, we can avoid the hard computational problems. Use the computer to do what it can, then correct and fill in the blanks by hand. (BTW, Google does the same thing. Part of its secret is that it employs thousands of people around the world to google stuff and submit corrections …)

Well, in this case I would disagree. I think this is fine, although I’d probably say “sign, mark, precursor, hint, cause, omen”.

The meaning of nimitta as “bright light seen in meditation” doesn’t occur in the suttas. In meditative contexts nimitta usually means “cause”, perhaps “precursor”, or even “aspect”. Lights in meditation are called simply “light” (obhāsa, pabhassara, pariyodāta, etc.) This is why, when assembling a dictionary of early Buddhist terms, we need to be diligent about rejecting later meanings.

2 Likes

Very good!

Not at all! I asked, and it’s important to know.

Well, that sounds fantastic.