Hi @blake, I have a few technical questions re the CPED.
-
We have two files, an HTML file in /data/dicts, and a JSON file in /dicts. They seem to have the same source, but the JSON is richer. May I ask, why do we have two versions? Are they both actually used?
-
I presume that, if we are to embark on improving this by creating a Dictionary of Early Pali (DEP), we should use the JSON file. Mostly it is clear enough, but there are a few things I don’t understand.
- Entries are of the form
['ariyasacca', 1, "nt.",
and so on. But I can’t figure out why the “1” is there. It’s not mentioned in the first line, which defines the columns, and each entry is the same, so it seems to serve no purpose. Should we eliminate it? - The Defn is in Velthuis. Should we not transform this to Unicode?
- What does the “Source” refer to?
- I don’t understand everything that’s in the InflectInfo column. Is there somewhere that explains it?
- BaseWord and BaseDefn seem to be wrongly implemented. It seems that the idea is that when you have many terms derived from a Base Word, you list the Base Word, and the Base Defn should, I’m guessing, tell you which of those words is the base definition. But it doesn’t: all are simply assigned “1”, so this doesn’t give you any extra information. I’m assuming we should use “0” for the word considered the basic term, and “1” for derivations. Then BaseWord and BaseDefn taken together can tell you which words belong together, and which of them is the base term.
- If we are to improve the file, we will want to add some data, as well as correcting what is there. Some changes:
- We will want to indicate what terms/meanings are found in the EBTs. I guess an extra column that we can mark for terms/forms that don’t appear in the EBTs. These will be effectively excluded from the DEP, but we might as well keep the data around for now, just in case. What to do about meanings? The word appears, but one or more of the meanings ascribed to it doesn’t? Just edit the entry, I suppose. This will only be relevant in a few cases, but they will be cases of doctrinal significance. The CPED was explicitly designed to represent the meanings as given in the commentaries.
- The BaseWord should allow for multiple entries, in the case of compounds. For example akataññū should have as base words karoti and jānāti (perhaps “a” as well, as this is a listed entry).
-
What I am envisaging is that we first make a corrected version of the text, which fixes mistakes found in the digital file (and some in the printed edition). Next we can further expand/correct entries by comparing them with DOP especially. But at some stage we will want to match the CPED entries with the actual words as found in the EBTs, with the aim of creating a dictionary that will give us 100% coverage of the EBTs. I’ve made a word list of the EBTs, probably there are better ways of doing this, but anyway, this is not too hard. What we need is a way of automatically relating the dictionary entries with the word list. Kind of like a reverse word lookup? Anyway, how this is done, I don’t know, but let’s assume that we can associate the words in our word list with the words in the dictionary as best as possible. In the current lookup we get maybe 90% accuracy. There’s something around 80,000 unique words. So maybe something in the order of 10–20,000 corrections will need to be made by hand. In this scenario, meanings are assigned to “tokens”, not words, so there will be cases where one token (e.g. sati) maps on to multiple words (BaseWords sati and atthi), and of course cases where the same word/token has multiple meanings. At this stage we can’t disambiguate these. Anyway, does this sound like a reasonable procedure to you?
-
Finally, if we are to do this, it seems to me that it will be better to use this DEP for making my terminology changes, rather than a Pootle terminology file. That is to say, as I go I can add my renderings and so on to the DEP, although terminology still has its uses.
Finally finally, what do you think about this project? Does it sound like something doable?