Problems with searching suttapitaka, and possible resolution

Senryu · June 13, 2017, 5:26pm

Summary:
-Single PDF of whole suttapitaka available.
-Issues with searching suttapitaka in general; possible fix.

I find it very useful to be able to search the whole suttapitaka for words, phrases or phrase variants. I have mainly been using the Pāli Reader for that, integrated into Firefox.

I have been advised to also use the Chaṭṭha Saṅgāyana edition inputted by Dhammakāya organisation. I found the online search function for that dysfunctional. I am aware of the Chaṭṭha Saṅgāyana CD-ROM but so far as I know that software does not work with Mac, so I can’t use it.

So I downloaded the entire text from http://gretil.sub.uni-goettingen.de/#Pali and put all of the files together in one PDF for the whole suttapitaka, which means it can be searched for terms as one (rather large) document. I did the same for the vinaya. If these files would be useful to others, I am happy to have them put online somewhere. One thing that I couldn’t do is create an Outline (the table of contents in the left column) which would be very useful (if anyone wants to add that, would be great!)

Problem:
Now, the issue I have found, is to do with inconsistency of input. I would like to know if this same problem exists when using the Chaṭṭha Saṅgāyana CD software. I somehow expect it may. I will give an example:
The word ceva is sometimes written as:
c’eva
c’ eva
c’; eva

I don’t know why they have done that - if anyone knows, please say! But if I search for a phrase which has ceva as a part of the phrase, I have to use all of the variants to do a proper search. Now that’s just one word - there must be many other such examples!

Another slightly different but same type of problem is the choice of where to put commas or full stops. There may be an identical sentence, but it cannot be found by searching the sentence because one has different punctuation. And that issue is there on the Pāli Reader also.

Potential solution:
I do not know if this has already been done (if so, please say!) But I would think that the most functional text for the Canon would be if it were able to be searched without any punctuation or spaces between words. (Presumably that is how the Pāli is originally?) That is inconvenient for reading, but should be far better for comprehensive searches, which are so important for research.

I could imagine that software could be designed to ignore all punctuation and search merely the letters, as if they were all one string. So that you would not have to actually edit the whole Canon. Or perhaps that is already done, or someone has a method? I would love to hear about it!

sujato · June 14, 2017, 1:23am

Welcome to the wonderful world of natural language search. These kinds of problems are systemic, and there is no easy fix.

Some suggestions:

Use the Mahasangiti text as found on SuttaCentral rather than the CSCD or—good lord no, the Dhammakaya. While not perfect, it is by far the most consistent. You can download the raw text from here:

You can put this all in one file if you like (on linux, use cat). It will of course have HTML tags, but you can strip these easily enough.
If you want to do complex searches within one file, rather than use a PDF, which is a presentation format, i would suggest using Sublime Text or another capable text editor. There might be a struggle handling the file size, though. Otherwise there are a variety of command line tools for this.
Another simple option, if the search on SC is not doing what you want, try google site search: site:https://suttacentral.net
Stripping punctuation should be fairly trivial. Normally a search engine like the one used on SC strips punctuation by default.
Spaces are more difficult, and I wouldn’t recommend stripping them in general. However, it might work on occasion. In most cases, however, the obvious errors such as c’ eva should be corrected in the MS edition.

I found this an interesting idea, so I have made several versions of such a file. I’ll put them in a separate thread where they will be more discoverable.

Senryu · June 14, 2017, 12:32pm

Thank you very much for your reply @Sujato

That sounds interesting, though beyond my current computer skills! If anyone reading this has done this or knows of any simply articulated step by step instructions for doing this on a Mac, I would greatly appreciate hearing about it! My pdf is at least searchable and functions - it is just fraught with difficulty due to the inconsistencies of the actual text, as the example I gave above illustrates. But if there is a better way I am very open to it!

I have tried this but the result has not been good. Not sure if that is just the way I have done it, but I will explain.

Let’s take the sequence of letters ‘kāyena pas’ as an example. This occurs only once, in the Dhammapada, as ‘kāyena passati’. Using the Pāli Reader, you just search for:
kāyena pas
and it gives the reference in the Dhammadapa.

Using the Suttacentral search function, if you search:
“kāyena passati”
It gives the same reference.
However, if you search for:
“kāyena pas”
You get no results. And the above two search types give the same result using Google Advanced Search, the latter being this sequence typed into Google:
“kāyena pas” site:suttacentral.net

Simply searching Suttacentral using the search function, for:
kāyena pas
gives 10 pages of results, thus not the solution.

The online CS search also does not seem to work for this. I found it here: http://search.tipitaka.org/solr/web?q=“kāyena+pas”*&fq=script%3Aromn&facet.field=volume
Searching
“kāyena passati”
gives the right reference.

You can search an incomplete term using a , so you can search
pas
for passati. But, you can’t seem to do that for an incomplete word following another word. So searching:
kāyena pas*
gives 510 results

Searching
“kāyena pas”*
gives no results

And searching:
“kāyena pas*”
also gives no results.
So it seems they have not designed their search function to be functional enough to search for phrases with variations, which seems a great loss.

So it seems only the Pāli Reader has this functionality. But this functionality seems to me extremely important for researching the Canon! Would you consider adding such functionality to Suttacentral? That would be so valuable, I believe.[quote=“sujato, post:2, topic:5684”]
Stripping punctuation should be fairly trivial. Normally a search engine like the one used on SC strips punctuation by default.
Spaces are more difficult, and I wouldn’t recommend stripping them in general. However, it might work on occasion. In most cases, however, the obvious errors such as c’ eva should be corrected in the MS edition.
[/quote]

It would eliminate a number of problems, which you also have raised on What is the difference between the Pali text of the VRI and that of the Mahāsaṅgīti?
For example
-Problems caused by differences in choices of splitting compounds, some places having very long words whilst elsewhere they are split into separate words.
-Apostrophe problems for quotes regarding their position.
-All problems with inconsistencies in punctuation and give the ability to search for sequences of words regardless of variations in punctuation.

The text without spaces and punctuation would not be good for reading but I expect there should be some possibility of making software that could simply disregard all spaces and punctuation for searches, but retain the full text for reading. Thus able to search for any string of letters you like, including kāyena pas (which it would treat as kāyenapas, and successfully locate kāyena passati in the text).

I would also assume that it would be fairly straightforward (for the computer-literate) to give either the option or default setting that the search function could treat ṅ and ṃ as synonymous, thus giving hits for both with the same search. This would make the inconsistencies you pointed out in the thread I linked to above regarding those letters, non-problems so far as searching is concerned.

These things seem really obvious to me so perhaps someone has already done it?! If I am understanding the situation correctly, this would make searched of phrases, sentences or passages much much more reliable, thus really aiding analysis and research of the texts.

sujato · June 14, 2017, 10:37pm

Yes, the SC search doesn’t work well for multiple words.

Wildcard search also works on SC; it is a common standard. But in this case it doesn’t solve the problem.

This is how search engines work anyway: you make a stripped version of the text for indexing. It’s just a matter of what you want to strip. I’m not convinced that stripping spaces is going to be useful a significant amount of time.

We already do this: see the results for samkhara for example. But no doubt there is room for improvement.

Thanks so much for the detailed and careful feedback! We’re currently focussed on upgrading the whole site. When this is done I’d like to focus on improving search, and we’ll certainly take this into account.

Senryu · June 15, 2017, 3:55pm

If it would not require too much more code and is easily possible, it may be worth it. It would improve reliability of analysis where different choices of compounds cause issues, where some places or some texts have long terms broken up into separate words. As well as issues like the occurences of both c’eva and c’ eva. Thus reducing the amount of missed hits when researching some terms or phrases. So that sounds like a good benefit if the aim is optimisation.

Very welcome!
Just to add one more point: Adding the Pāli Reader type of search option of being able to choose which sections of the tipitaka the search runs on, would be amazing. I would expect with that (and as many of the above mentioned functions as possible ), Suttacentral could replace the Pāli Reader, since it owuld have their functions and more (links to translated texts being a major advantage here, as well as the very good system of referencing, with both the PTS references and the other system (which I don’t know the name of but seems more user friendly) - these are both great aspects of Suttacentral’s helpfulness).

I really look forward to whatever upgrades are coming! Many thanks to you and the whole team

sujato · June 15, 2017, 10:40pm

Absolutely, this will be the number one priority for when we review search.

Mat · June 16, 2017, 8:03am

Would an English translation be available in this way? I drive a bit and like to listen to suttas ideally by getting my phone to read then out to me. I haven’t had much success finding Audible compatible versions. A Word/Note file with the suttas (any) would be great. I can use the ‘speak’ function with those (though it sounds a bit robotic).