Problems with searching suttapitaka, and possible resolution

Thank you very much for your reply @Sujato :anjal:

That sounds interesting, though beyond my current computer skills! If anyone reading this has done this or knows of any simply articulated step by step instructions for doing this on a Mac, I would greatly appreciate hearing about it! My pdf is at least searchable and functions - it is just fraught with difficulty due to the inconsistencies of the actual text, as the example I gave above illustrates. But if there is a better way I am very open to it!

I have tried this but the result has not been good. Not sure if that is just the way I have done it, but I will explain.

Let’s take the sequence of letters ‘kāyena pas’ as an example. This occurs only once, in the Dhammapada, as ‘kāyena passati’. Using the Pāli Reader, you just search for:
kāyena pas
and it gives the reference in the Dhammadapa.

Using the Suttacentral search function, if you search:
“kāyena passati”
It gives the same reference.
However, if you search for:
“kāyena pas”
You get no results. And the above two search types give the same result using Google Advanced Search, the latter being this sequence typed into Google:
“kāyena pas” site:suttacentral.net

Simply searching Suttacentral using the search function, for:
kāyena pas
gives 10 pages of results, thus not the solution.

The online CS search also does not seem to work for this. I found it here: http://search.tipitaka.org/solr/web?q=“kāyena+pas”*&fq=script%3Aromn&facet.field=volume
Searching
“kāyena passati”
gives the right reference.

You can search an incomplete term using a , so you can search
pas

for passati. But, you can’t seem to do that for an incomplete word following another word. So searching:
kāyena pas*
gives 510 results

Searching
“kāyena pas”*
gives no results

And searching:
“kāyena pas*”
also gives no results.
So it seems they have not designed their search function to be functional enough to search for phrases with variations, which seems a great loss.

So it seems only the Pāli Reader has this functionality. But this functionality seems to me extremely important for researching the Canon! Would you consider adding such functionality to Suttacentral? That would be so valuable, I believe.[quote=“sujato, post:2, topic:5684”]
Stripping punctuation should be fairly trivial. Normally a search engine like the one used on SC strips punctuation by default.
Spaces are more difficult, and I wouldn’t recommend stripping them in general. However, it might work on occasion. In most cases, however, the obvious errors such as c’ eva should be corrected in the MS edition.
[/quote]

It would eliminate a number of problems, which you also have raised on What is the difference between the Pali text of the VRI and that of the Mahāsaṅgīti?
For example
-Problems caused by differences in choices of splitting compounds, some places having very long words whilst elsewhere they are split into separate words.
-Apostrophe problems for quotes regarding their position.
-All problems with inconsistencies in punctuation and give the ability to search for sequences of words regardless of variations in punctuation.

The text without spaces and punctuation would not be good for reading but I expect there should be some possibility of making software that could simply disregard all spaces and punctuation for searches, but retain the full text for reading. Thus able to search for any string of letters you like, including kāyena pas (which it would treat as kāyenapas, and successfully locate kāyena passati in the text).

I would also assume that it would be fairly straightforward (for the computer-literate) to give either the option or default setting that the search function could treat and as synonymous, thus giving hits for both with the same search. This would make the inconsistencies you pointed out in the thread I linked to above regarding those letters, non-problems so far as searching is concerned.

These things seem really obvious to me so perhaps someone has already done it?! If I am understanding the situation correctly, this would make searched of phrases, sentences or passages much much more reliable, thus really aiding analysis and research of the texts.

1 Like