CPED: some questions

sujato · January 15, 2016, 1:22am

Hi @blake, I have a few technical questions re the CPED.

We have two files, an HTML file in /data/dicts, and a JSON file in /dicts. They seem to have the same source, but the JSON is richer. May I ask, why do we have two versions? Are they both actually used?
I presume that, if we are to embark on improving this by creating a Dictionary of Early Pali (DEP), we should use the JSON file. Mostly it is clear enough, but there are a few things I don’t understand.

Entries are of the form ['ariyasacca', 1, "nt.", and so on. But I can’t figure out why the “1” is there. It’s not mentioned in the first line, which defines the columns, and each entry is the same, so it seems to serve no purpose. Should we eliminate it?
The Defn is in Velthuis. Should we not transform this to Unicode?
What does the “Source” refer to?
I don’t understand everything that’s in the InflectInfo column. Is there somewhere that explains it?
BaseWord and BaseDefn seem to be wrongly implemented. It seems that the idea is that when you have many terms derived from a Base Word, you list the Base Word, and the Base Defn should, I’m guessing, tell you which of those words is the base definition. But it doesn’t: all are simply assigned “1”, so this doesn’t give you any extra information. I’m assuming we should use “0” for the word considered the basic term, and “1” for derivations. Then BaseWord and BaseDefn taken together can tell you which words belong together, and which of them is the base term.

If we are to improve the file, we will want to add some data, as well as correcting what is there. Some changes:

We will want to indicate what terms/meanings are found in the EBTs. I guess an extra column that we can mark for terms/forms that don’t appear in the EBTs. These will be effectively excluded from the DEP, but we might as well keep the data around for now, just in case. What to do about meanings? The word appears, but one or more of the meanings ascribed to it doesn’t? Just edit the entry, I suppose. This will only be relevant in a few cases, but they will be cases of doctrinal significance. The CPED was explicitly designed to represent the meanings as given in the commentaries.
The BaseWord should allow for multiple entries, in the case of compounds. For example akataññū should have as base words karoti and jānāti (perhaps “a” as well, as this is a listed entry).

What I am envisaging is that we first make a corrected version of the text, which fixes mistakes found in the digital file (and some in the printed edition). Next we can further expand/correct entries by comparing them with DOP especially. But at some stage we will want to match the CPED entries with the actual words as found in the EBTs, with the aim of creating a dictionary that will give us 100% coverage of the EBTs. I’ve made a word list of the EBTs, probably there are better ways of doing this, but anyway, this is not too hard. What we need is a way of automatically relating the dictionary entries with the word list. Kind of like a reverse word lookup? Anyway, how this is done, I don’t know, but let’s assume that we can associate the words in our word list with the words in the dictionary as best as possible. In the current lookup we get maybe 90% accuracy. There’s something around 80,000 unique words. So maybe something in the order of 10–20,000 corrections will need to be made by hand. In this scenario, meanings are assigned to “tokens”, not words, so there will be cases where one token (e.g. sati) maps on to multiple words (BaseWords sati and atthi), and of course cases where the same word/token has multiple meanings. At this stage we can’t disambiguate these. Anyway, does this sound like a reasonable procedure to you?
Finally, if we are to do this, it seems to me that it will be better to use this DEP for making my terminology changes, rather than a Pootle terminology file. That is to say, as I go I can add my renderings and so on to the DEP, although terminology still has its uses.

Finally finally, what do you think about this project? Does it sound like something doable?

blake · January 16, 2016, 10:41am

The JSON version is a straight dump from DPR, the other columns are mostly automatically derived from the term and grammar (you will doubtless notice the stem is often wildly incorrect when it’s not as simple as slicing the suffix off), and elsewhere in DPR is a table of irregular to regular mapping making the base word column not useful (although you could include the base word if you want to make a unified dictionary in a single file).

The HTML version is basically one with all the superfluous columns stripped out. It is standardized into the same format I standardized the other dictionaries into, using DL/DT/DD HTML markup. If I had a reason for doing that, it would be that it renders nicely by default and can be fed directly to a browser without further modification. And also, whenever dealing with structured text you should very strongly consider not using JSON as XML or HTML are formats designed for structured text. You can put HTML markup in JSON strings but then you have a dual-format file which editors can’t help you much with (they can syntax highlight the JSON or HTML but not both) and it also can’t be displayed properly by anything. This is not a big deal for a file like the CPED without inline markup but really why wouldn’t you want the option of having inline markup?

The ideal format for a simple dictionary is a spreadsheet because you get actual columns. It’s trivial to convert a spreadsheet to JSON for digestion by computers, no point editing it in JSON. If you want inline markup you can even use a spreadsheet, use bold/italic in libreoffice, export as HTML (or import into python from ODT), and perform some cleanup. Spreadsheets are a good tool for this kind of task.

If for the moment we forget about the existing formatting, it sounds like what we want is:

A way to deal with multiple meanings.
A way to deal with irregular forms, mapping irregular forms to their regular form, this relationship is one of equivalence of meaning, there’s no need for a separate meaning for an irregular form.
A way to deal with compounding, as compounds have individual meaning which is more than a sum of the parts, in this case you want the parts to be “see also”.

I think we can do this with 5 columns (with a 6th for EBT usage), angle brackets indicate omit-able and there would be two basic forms:

[word, grammar, <regular form>, stems, meaning]
[word, grammar, regular form, <stems>, <meaning>]

The first one is for a regular form, stems should be +joined if there is more than one (+joined because they all apply simultaneously), grammars should be comma-joined if there is more than one (comma joined because only one of them applies in a given instance)

ariya,adj.a,,ariy,noble
ayya,m.a,,ariy,a noble person
ariyasacca,nt,,ariy+sacc,a noble truth

The second form is for irregular:

karoti,o,,kar,does
katvā,abs,karoti,,having done

If stem or meaning is omitted it inherits from the regular form, for the sake of clarity it might be desirable to have a meaning.

In the case where a word can have completely different derivations with distinct meanings and stems (such as ‘sati’), the best way is to have multiple distinct entries - this slightly complicates lookup but makes the data format much less muddy.

The final thing to consider here is the grammar column: Ultimately the grammar column should determine what kind of transformations are legal. For example a lookup tool should never be allowed to transform a word into an indeclinable, for example if you ask a lookup tool what “athati” is, it will say “atha” (because one of the things it understands is that ati can be turned into a) - athati isn’t a real word, but even if it was, it shouldn’t be allowed to be transformed into “atha” since atha is an indeclinable, also some words (particularly irregular) will also be illegitimate targets for transformation.
Ideally the grammar column could give complete information on everything the word can be transformed into, so you can use it in abhidhamma wheels style to make all the possible forms.

All in all I think it’s a good idea. One thing we should be clear of is that it’s a dictionary designed for machine consumption, which doesn’t mean a machine can’t consume it and spit out a version designed to be shown as a webpage or printed, but we want a high level of consistency so machines can accurately cross-reference stuff without guessing.

sujato · January 17, 2016, 2:13am

Thanks so much, let me address a few issues.

Re formatting , I agree, using a spreadsheet is the way to go.

This is probably not always the case, so it’s good that there is the option to include meanings here. I can’t think of any examples right now, but I think there are idioms where irregular forms have specialized meanings.

Basically this sounds fine.

Not sure what you mean here: do you mean, a meaning that’s specific to the EBTs?

Another possible column is a “terminology” column. Here we could put the regular terms as used in translations. Then it can be used as a reverse lookup for translated texts. This would only be used for the 1,000-ish defined terms.

In this case, do we mark these somehow, or simply repeat the headword?

What about the opposite case: multiple spellings of the same headword? Do we give these separate entries, or use a comma-joined list in the “word” column.

I’m not sure I’m following this: katvā is a regular absolutive.

Ideally!

What I’m interested in is this: how do we mash up the “spat out” terms with the actual terms found in the texts? Given that we won’t ever get 100% with algorithms, can we use the grammar-spitter recursively with human checking? Spit out a wheel, get humans to check it against the actual terms found in EBTs, then use that to improve the spitter, and hopefully end up with a “perfect” grammarizer for the EBTs. This has the advantage of working with a sane data set, so is a finishable task.

blake · January 17, 2016, 2:51pm

It might be a bad example or perhaps it’s a good example, in DPR an irregular form is basically one which the pali analysis tools lacks the sophistication to join the dots, that is it has trouble figuring out that Katvā is the absolutive of Karoti.

I think according to the rules it would (should) try "karitvā’ or ‘karotvā’ and in particular when starting with katvā it can’t figure out it has to slice off the tvā and replace it with “roti”, in reality what it actually tries is “kati” (slice of the tvā and replace it with ti).

In this case starting with the stem kṙ or ka[r], actually katvā looks just as regular as karoti, but DPR does not work in terms of stems - it tries to reduce to the third person singular verb form found in dictionaries and it tries to do so in a single step, while this is clearly a two step process.

So there are probably two class of irregular: forms which the matching tools aren’t sophisticated to join the dots, but which do follow ultimately straightforward rules (at least starting from the stem), and irregular forms which are so irregular no computer could ever hope to join the dots.

I think the limitation is more the dictionaries - if there was explicit stem information then it would be easier.

Like you could imagine entries like this:

ka(r), 6-o, does/makes, done/made, do/make, doing/making, doer/maker

Which goes: stem, conjugation, present, past/absolutive/infinitive, future, present.participle, adjective/noun

Then it can wheel up possible absolutive forms:
katvā karitvā (kaitva) (kartvā) katiya (kaiya) katya (karya) (kartya) katvāna katūna - the bracketed forms would have additional sandi rules applied.
Any of which would be rendered as having done or made

The related transformation rule would look something like this:
absolutive, verb-any, (i)tvā/(t)ya/tvāna/tūna, having ${absolutive}

And by using other transformation rules it could also wheel up 6th conjugation verbs, causatives, infinitives, all the participles, derived adjective forms and know how to render a human-readable description.

That would be the ultimate machine dictionary as a supplementary resource to existing word dictionaries, applying a shot-gun approach to generate precise (though possibly not accurate) generic definitions for all well-formed verbs and nouns.

Now - that is pretty much just an idea I just came up with - but I don’t see why it wouldn’t work, and because it relies on expanding “wheels” the amount of base data wouldn’t be that large, the conjugation/declension rules fit on a few A4 pages and there aren’t that many stems.

sujato · January 18, 2016, 2:27am

I suspected as much. In fact, this is exactly what an irregular form is anyway: any form that the grammar is not sophisticated enough to account for. Which means that

are essentially the same thing, if the program accurately represents the formal grammar.

So I had a look around this morning, and found this pdf:

pali_roots.pdf (2.4 MB)

Which is a pretty comprehensive survey of Pali roots, based, of course, on the later grammars (Saddanīti).

I sucked out the text, eliminated the Spanish and Sanskrit, converted to Unicode, and made a nice HTML file. q.v.

pi_roots.html.zip (20.2 KB)

So that’s cute, right?