Problems with searching suttapitaka, and possible resolution

sujato · June 14, 2017, 1:23am

Welcome to the wonderful world of natural language search. These kinds of problems are systemic, and there is no easy fix.

Some suggestions:

Use the Mahasangiti text as found on SuttaCentral rather than the CSCD or—good lord no, the Dhammakaya. While not perfect, it is by far the most consistent. You can download the raw text from here:

You can put this all in one file if you like (on linux, use cat). It will of course have HTML tags, but you can strip these easily enough.
If you want to do complex searches within one file, rather than use a PDF, which is a presentation format, i would suggest using Sublime Text or another capable text editor. There might be a struggle handling the file size, though. Otherwise there are a variety of command line tools for this.
Another simple option, if the search on SC is not doing what you want, try google site search: site:https://suttacentral.net
Stripping punctuation should be fairly trivial. Normally a search engine like the one used on SC strips punctuation by default.
Spaces are more difficult, and I wouldn’t recommend stripping them in general. However, it might work on occasion. In most cases, however, the obvious errors such as c’ eva should be corrected in the MS edition.

I found this an interesting idea, so I have made several versions of such a file. I’ll put them in a separate thread where they will be more discoverable.