Implementing the New Concise Pali English Dictionary

sujato · October 19, 2016, 2:38am

Thanks to the diligent work by @Russell over the past several months, we now have ready the first, very substantial, update to the Concise Pali English Dictionary.

Sadhu! Sadhu!

This involved going through the entries from G–N and checking the old dictionary against the much newer and more reliable Dictionary of Pali (DoP) by Margaret Cone. Changes consist of:

Correcting mistakes in CPED, both in the entries and the grammar.
Using the same conventions for grammar as in DoP.
Adding extra meanings or terms from DoP that are found in the EBTs (non-EBT meanings and words were left out).

This is a very important update. It preserves the portions of the DoP that are of most interest to students of early Buddhism, and makes the ever-useful concise dictionary even more useful.

As for implementation, I would propose the following:

CPED and NCPED are maintained as separate entities.
We preferentially use the entries from NCPED instead of CPED.
CPED becomes the fallback.

In other words, you never see both CPED and NCPED.

Let me know if you think this is reasonable.

As @blake developed the dictionary tools, it would make more sense for him to implement this, I think. If not, hopefully @vimala can do it.

I have proofed the text and marked the definitions using the same HTML style as for the recent updates for the PTS dict. This was much easier!

Basically it means that complex entries are marked up as nested lists. One difference between the two is that the PTS formats entries by default as paragraphs, and uses lists only when necessary. This was just a pragmatic choice, by preference it would be all lists. The NCPED, on the other hand, has all entries as lists.

(Incidentally, the classes for the lists are necessary to target them for styling; ideally we could drop them as they just duplicate the list “type”. Somewhat annoyingly, though, this runs into a problem when you have both upper-case and lower-case list types; CSS is case-insensitive so you can’t differentiate type=“A” and type=“a”. If we can find a way around this we can drop the classes for lists in the dicts.)

What needs to be done with this is as follows:

The spreadsheet needs to be properly HTML-ized (I’ve only done the definitions)
This includes the grammar; preferably I’d like to do these as with the PTS dict, with abbreviations spelled our, and placed with suitable markup at the start of the entry (This is standard for dictionaries).
Note that there is some extra information in various columns; I hope this is useful for correctly identifying forms.
Apply magic.
Rejoice!

Here’s Russel’s file with my updates. This includes the non-HTML definitions, for reference only. In addition to the HTML, the updated definitions have various corrected typos and the like, so they should be used going forward and the old ones discarded.

NewConcisePaliEnglishDictionary_G-N.ods.zip (982.5 KB)

Vimala · October 19, 2016, 3:35am

It would seem to me that it needs to be JS-ized instead.
@blake - I can do that unless there is more that needs to be considered.

sujato · October 19, 2016, 3:54am

The source files for the dictionaries are HTML, this should be the same.

Vimala · October 19, 2016, 4:07am

The lookup tool is a js dictionary.

sujato · October 19, 2016, 4:10am

We’re talking in circles. The current spreadsheet must first be converted to the appropriate HTML markup. Once the HTML file is ready, it will be processed by JS.

Vimala · October 19, 2016, 4:13am

JS does not pull the data out of the html file automatically. There are JS databases here:
https://github.com/suttacentral/suttacentral/tree/master/static/js/data

The data has to also be incorporated into these files, otherwise the lookup tool will continue to use the old data.

sujato · October 19, 2016, 4:18am

Oh, okay, I wasn’t thinking of the lookup, but the dictionary. So we have two files, one for the lookup tool and one for the dictionary results page here:

In that case, both need to be updated.

But the HTML markup is richer and more semantic; and it agrees with the markup for other dictionaries. So shouldn’t we have just one source in HTML and transform it to JS?

Aminah · October 19, 2016, 11:31am

Neat! Much thanks @Russell and the band!

Vimala · October 19, 2016, 2:37pm

I think JSON might be more the way to go here, but will leave that decision up to Blake.
Yes, we have 2 files right now. I can do that.

Gabriel_L · October 19, 2016, 2:56pm

Is there room to have it translated fully into Portuguese? Please count on me.

Vimala · October 19, 2016, 3:09pm

If you want to make a dictionary in Portuguese … sure, go for it!
All you need to do is make a spreadsheet with the words in one column and the meaning in another. If you want to be more precise, make it like the one Bhante attached above.

Russell · October 20, 2016, 4:34am

Dear Bhante @sujato,

Sadhu! Sadhu! Sadhu! May all Dhamma followers benefit from this endeavor!

I have received your e-mail and replied.

Hope everyone had a great vassa ! Celebrate your good kamma to have been able to practice! Kathina is here !

with reverence and in mettā,

russ

LXNDR · October 20, 2016, 9:01pm

is CPED being retained and maintained for the sake of the lookup tool because it’s slimmer than PTS PED?

sujato · October 21, 2016, 12:12am

That’s right. For 90% of cases, you just need the basic meaning of the word, and CPED is excellent for that.

We will update it to use the NCPED as that is completed, although that won’t be finalized until the final volume of Cone’s dictionary is published.

Good news: Russell’s going to move on to the first volume of Cone, so that should be ready in a few months.

tuvok · October 22, 2016, 8:33am

Cone’s third volume was supposed to be finished in 2017 last time I asked the publisher, so it should be soon

sujato · October 22, 2016, 8:58am

Oh good, we shall keep a lookout.

Vimala · October 24, 2016, 4:52pm

OK. I need some input here because the file has many more columns than any of our dictionaries.
Is it OK if I just put entries like for instance “gacchati” like this:

<dl>
<dt><dfn>gacchati</dfn></dt>
<dd>
<ol type="1" class="decimal"><li>
<ol type="i" class="lower-roman"><li>(of people, animals, rivers, roads etc) goes; moves, walks; goes away, leaves; goes to (+ acc. or santike/santikaṃ etc); often with absol., e.g. ādāya ~ati, gahetvā ~ati, goes with, takes; pahāya ~ati, goes off without, leaves behind</li>
<li>goes to another existence, another birth etc</li>
<li>follows a course; follows a future course</li>
<li>goes to an activity; goes to do something</li>
<li>goes to in a sexual intercourse; has intercourse with</li></ol></li>
<li>goes to a state of condition; undergoes, reaches; obtains</li>
<li>relies on</li>
<li>the first person present is not rarely used to express an immediate or near future sense: I am going; I am going to go; we are about to go</li></ol>
</dd>
<dd class="grammar">pr. 3 sg.</dd>
</dl>

I’m just not sure what to do with the other columns in the spreadsheet like: additions, CdoP page, regular form, stem, truncated form, ebt, etc.
I could put them all in <dd class=" ..."> at the bottom but it will probably show up in the dictionary.

And how to deal with the various grammatical forms of the words?

sujato · October 25, 2016, 12:29am

We should keep @blake in the loop on this, as the maker of these tools, he has more idea what to do than I. But I can describe what is in the spreadsheet.

As far as the markup goes, we don’t use multiple <dd> in one <dt> (I think!). Instead just mark <dd class="grammar">pr. 3 sg.</dd> as a paragraph. And best to put this at the beginning of the entry; this is standard in dictionaries. For PTS dict I use

<p class="case">pr. 3 sg.</p>

I don’t like having the grammar info in abbreviated form. In the revisions for the PTS dict I spelled them out (eg SuttaCentral) where possible. With some find & replace we should be able to spell out most of the entries in NCPED without too much problem. There are some complex entries, however, which need to be handled carefully.

One detail I am not sure about is the use of “mfn.” This is essentially used for adjectives, as they are declined according to the gender of the relevant noun. So it doesn’t really mean “masculine/feminine/neuter”, it means “undefined gender”, or better “inherited gender”. But I am guessing there is a reason why they don’t just use “adjective”; probably the category is broader and there are things that are not strictly speaking adjectives. Suggestions welcome!

Most of the extra columns can be safely ignored. They were just there as part of the development process.

The “meaning”, “ebt” and “sujato terms” columns should be ignored. Likewise the “CDoP” page column, this is just a reference.

The main columns are the “word”, “grammar” and “HTML meaning”.

The columns “regular form”, “stem”, “truncated form” are there as potential assists for the lookup. “stem” and “truncated form” are probably useless and best ignored for now. “Stem” was an idea we never followed through. “Truncated form” does what the JS lookup tool does anyway, slice off the end of a word. In principle, having a hand-curated list of such forms is more reliable than a machine-generated process, but I don’t think it has been developed sufficiently yet.

The role of the “regular form” is the main entry or fallback for the various declined forms.

Checking the dictionary now, I see that Gabriel has not used this very much, preferring to use “(same as …)” or “(see …)” in the meaning column.

Perhaps the best approach would be to extract these from the meaning (or HTML meaning) column and populate the “regular form” column, or put them in a new column. Fortunately Gabriel has been very consistent and clean with this, so it should be simple. For now, let’s assume we have a new column called “main”. It should work something like this.

When I hover on a word, or see a result on /define/, I don’t want to see “(see xyz)”. I want to see the meaning of the word. So that should be there with the relevant grammatical info.

Let’s assume I look at garukatvā. This currently says “(see garukaroti)”. But we will have extracted this to the “main” column, which will just say “garukaroti”. So when I search for or hover over garukatvā, it shows me the entry for garukaroti instead. However, since these are distinct grammatical forms, it should show me the grammar for garukatvā. In other words, the entry for garukatvā would be:

garukatvā
absol.
honors; venerates; esteems; treats as important

However, it is probably a good idea in such cases to indicate the more basic form. And let’s spell out the grammar too. So we have:

garukatvā
absolutive from garukaroti
honors; venerates; esteems; treats as important

This also solves an awkward problem with the entries, which is that such derived meanings are not properly declined in the entries. Technically speaking, we should have:

garukatvā
absolutive from garukaroti
having honored; having venerated; having esteemed; having treated as important

However it is a massive amount of careful work to do this for each entry, with little benefit. If we use “absolutive from garukaroti” it shows that the entry form is derived from another source, and should be adapted according to the appropriate grammatical case.

Now, in this case the existing “(see garukaroti)” is redundant, for garukaroti already exists in the “regular form” column. So whatever, we can filter out the redundancy.

In some cases it is more complex than that. In general, the principle should be similar to how CSS works, in that the more specific meaning should be preferred.

Consider the entry for gahīta (long ī). This says “(same as gahita)” (short i). It also has gaṇhāti as the “regular form”. In this case, it would be best to display the entry for gahita. They both have the same grammar.

In turn, gahita has gaṇhāti as its “regular form”. There is no need to do anything about this, as gahita already has its meaning defined. It might be considered useful to have a link back to the “regular form” from this result; but this is an enhancement.

Oh, and another enhancement: it would be nice to have a list of definitions for these grammatical terms, which displayed as a popup on the entries!

Vimala · October 29, 2016, 1:57pm

This is what is used in the cped.html now. Would you like me to change that in cped?

<dl>
<dt><dfn>gacchanta</dfn></dt>
<dd>going; moving; walking.</dd>
<dd class="grammar">pr.p. of gacchati</dd>
</dl>

I take it you mean “Russel” …

Now the lookup tool (and cped in the dictionary results) has abreviated meanings. For instance right now it says:
'gacchati':["gam + a","goes; moves; walks."],

But when hovering over the word, you can click on “gacchati” and that takes you to the dictionary entry SuttaCentral

If I were to change the lookup tool to Russel’s update, it would become much too long.

A word like gacchanta is now listed in the lookup tool and dictionary pages as:
'gacchanta':["pr.p. of gacchati","going; moving; walking."],

While Russel’s update has (see gacchati). Changing that to the meaning of gacchati would again be much too long for the lookup tool and I feel that even for the dictionary pages, it might be good to keep the rendering of the existing cped in there as well.

So in short, I think it might be best to leave the lookup-tool as-is and just make a new html file called ncped.html with the comments/changes you have specified.
On the other hand, there are certain words, like garukatvā that do not appear in the lookup tool.
So maybe the thing to do here is to go over all the words that are not mentioned in the lookup tool and incorporate them. What do you think?

While working on this, I found another problem. For instance (converted to csv):
chetabba,fpp mfn.,,"<ol type=""1"" class=""decimal""><li>(see chettabba)</li></ol>"

But then you go to “chettabba” and there it says:
chettabba,fpp mfn.,chindati,"<ol type=""1"" class=""decimal""><li>(see chindati)</li></ol>"

So you end up here:
chindati,pr. 3 sg.,chindati,"<ol type=""1"" class=""decimal""><li>cuts; chops; cuts off; cuts out; inscribes</li><li>cuts down; destroys; removes</li><li>crosses (water)</li></ol>"

So, should both chettabba and chetabba be listed like this?:

chetabba,fpp mfn.,,"<ol type=""1"" class=""decimal""><li>cuts; chops; cuts off; cuts out; inscribes</li><li>cuts down; destroys; removes</li><li>crosses (water)</li></ol>"

chettabba,fpp mfn.,chindati,"<ol type=""1"" class=""decimal""><li>cuts; chops; cuts off; cuts out; inscribes</li><li>cuts down; destroys; removes</li><li>crosses (water)</li></ol>"

And what to do with things like:

`(see chindati); [quasi 4th class, in form identical with pass.; the meanings overlap], (intrans.) breaks, breaks off; dries up, comes to an end

or

the state of being a wife; marriage (see jāyā)

Do you want to have the full text of the word it refers to also in there, or just a link to the definition of that word?

I’m wondering if it will not be easier to keep all the (see …) and then make the word a link to the respective definition instead.

sujato · October 30, 2016, 1:04am

Yes. it doesn’t seem to be invalid HTML, but it is unsemantic. Multiple <dd. tags, I believe, should be used if there are multiple definitions of the same term, not for giving extra information for the same term. best use paragraphs within the dd.

You take it correctly!

As far as the details of the implementation go, I see the problem, but I’m reluctant to say too much. I think we need @blake’s input here, as I have no real understanding of how the logic of the lookup and so on work. It may well be that the tools need to be implemented differently.