SuttaCentral

[Project] Biggest Pali dictionary and translation correction tool

dictionary
pali
development
html
database
Tags: #<Tag:0x00007f788ab7ff98> #<Tag:0x00007f788ab7fe30> #<Tag:0x00007f788ab7fb88> #<Tag:0x00007f788ab7f8e0> #<Tag:0x00007f788ab7f728>

#1

Hello Friends.

Long time ago I got frustrated with current status of the Pali dictionaries. I got idea to collect all existing dictionaries and put them into one completely new. I stared to collect the dictionaries, building data structure in MS Access. But soon I realised that the project is extremely huge, time consuming and worst, it is out of my technical skill.

The idea was that software (database) should be the hub for all Pali dictionaries and should be the hub for Pali scholars and monastic to correct, vote and rank each translation.

With such database the new biggest Pali dictionary can be generated, for each word except the translation many additional fields can be generated (also audio files!).

I have basic database structure and collected lot of dictionaries see:

For details see my GitHub - dxcore35/massive_Pali-dictionary: Biggest Pali dictionary (Please note that there are no files, and tables uploaded, yet)

Now I’m stucked with cleaning of the data and migrating of the database. It is out of my current knowledge. I would like to gather any feedback or any help from anybody who is interested.

With Metta


#2

Hi dxcore,

This is an awesome idea and project.

We have done a lot of work with Pali dictionaries over the years, so rather than repeat myself, perhaps you’d like to search some of the old threads on this forum.

  1. The first thing I would suggest is, don’t think in terms of “applications” or “databases”. What we need is data in a raw and usable form. The universal data format of the web is JSON, and SuttaCentral these days essentially wraps all our data, including dictionaries, as JSON. It’s a very simple format, so I’d suggest getting familar with that as a start. Basically, if you have good data, any application can use it.
  2. We have both “large” dictionaries, which have complex, poorly structured entries, and “concise” dictionaries, with short well-structured entries. See discussion here.
  3. We have a long term project to do a comprehensive markup for the old PTS dictionary. However this is currently on hold.
  4. The Pali dictionary world is waiting with bated breath for the release of the final two volumes of Margaret Cone’s “Dictionary of Pali”. Until then, none of our resources are really first-class. But this will be some years away. Unfortunately, it seems the PTS does not have plans to release this in public domain, or to supply a properly structured edition.

If you really want to pursue this, I would suggest the best thing is to break down a small and helpful task in data preparation. Bit by bit, if the materials are readied, they can be used in more and more useful ways.

Never forget, the Home page of the Critical Pali Dictionary prominently features the obituaries of all those who passed away working on it. After 150 years, they have made it part-way through the first consonant in Pali, the letter k.

https://cpd.uni-koeln.de/intro/


#3
  1. By application I mean some interface where I can manipulate data, and where in future the data can be easily cleaned by non tech people… for this the JSON is not so good. I’m thinking to keep it as SQL database during “cleaning & standardisation of data” phase and later export it as JSON for consumption.

  2. Correct lot of dictionaries are poorly structured… but this needs man power (one person is not able to do it, and more people needs coordination…)

  3. I’m aware of this project and I read all related topics here before “starting” my project

  4. Yea, we really cannot rely on companies to do always Dana to us. Really just the monastics needs to do it, and then release it freely.

  5. I really need some knowledgeable human resources

You really uplift my spirit with your last comment regarding to how many people have died doing just one dictionary :sweat_smile: :joy: But you know Bhante, they don’t live in age of computers and IT cheap tools. So I think with collaboration of more people, we can make it in this life :thinking:


#4

I’m not sure what your experience is in this, so I’m sure you have your reasons, but my experience is exactly the opposite. As a non-programmer, I can easily use a text editor and clean or change JSON files anywhere. I don’t even know how to start with a database. They just seem like a mysterious black box to me.

We have moved completely away from keeping our data in databases for this reason. We still use DBs, but only as a convenient way of querying data for the application.


#5

My reason is exactly the querying of the data! Especially bulk data manipulation, splitting, merging, regex extracting… I cannot even think to do it in text editor manually :smile:

On beginning this bulk job needs to be performed, and do as much as possible clean the data semi-automatically via those bulk task… Next the manual edits comes into play. There is hundreds of thousand fields to be standardised, without this bulk cleaning tasks, there will be definitely no Samadhi, and really hard to finish it in this life (for one person).


#6

Well, when you put it like that …