Discussion: How should search results be ranked 📃?

Why do you think the segmented translations are ‘better’? I’m not sure I can think of a reason.

2 Likes

Yeah, I can’t make a strong argument. And of course there is nothing inherently better about the translation itself.

But in terms of site functionality, they show better excerpts in the results. Kind of.

And they will take you to a text that can be viewed side by side with the Pali, which in general I would say is better. But of course that won’t matter to everyone. Also, those are the only texts that have translator notes. Basically the segmented texts give you the most complete SuttaCentral experience.

The segmented texts tend to be newer as well. (although I just now realized that @cdpatton’s translations are legacy texts) And potentially, although not necessarily, have more uniform translations since they were done using Bilara which promotes that.

At this point I’m just kind of throwing out any possible way we might be able to rank them that might in some way be useful.

1 Like

I think the biggest problem is that they are simply so many so that it seems that almost any ranking may be better than no ranking at all 
 :TRIPLE_SIGH:

3 Likes

Hello, everyone.

That’s a great survey. Thank you ven @Snowbird for bringing it up. And thank you @HongDa for improvements.

It might be that my suggestions about filters more on the grouping and sorting side of the already produced search results. But these are also some kind of filters after all.

Here is an example with Kuáč­hār search results on Suttacentral.net (partial match)

https://suttacentral.net/search?query=in:ebs%20Kuáč­hār

And the same search on find.dhamma.gift

#1 imo this grouping like I did on dhamma.gift looks more user friendly and helping to work with search results. While output on suttacentral is fine for 13 texts. if there will be 50+ texts or more it’ll almost become unmanagable to make any desicions which texts might help more then other.

#2 The other big thing for the user is results aggregated by words. That can be a very important feature for the partial match search.

#3 is really minor but can be crucial for user. Showing variants of the word like
In the part “Variants for Kuáč­hār”

Ps i gave test links for fdg. If you’ll see some tech info or errors please don’t mind. I just wanted to show the output that might improve the search result visualization for user on Sc.net.

1 Like

:sweat_smile:

What about sorting by number of parallels of that specific part of the text? That signifies some importance.

There’s also sorting by number, then by nikaya.

DN1 → MN1 → 
 Dhp1 →
DN2 → MN2 → 
 Dhp2 →
DN3


Or with SN first.

I’m just putting out ideas in case it inspires an actual good layout.

1 Like

I created this issue previously that kind of addresses your suggestion:

My personal feeling is that the Digital Pali Reader does such a good job at search for Pali that it’s not worth it to try and replicate it on SuttaCentral.

But if we do, I would like to see those partial match hits as well as breakdown by book.

Ideally I might like to see filter suggestions customized to the results (i.e. only show potential filters that would have some effect on the results)

That would avoid having all of the same book at the top of the results. But I’m not sure if that’s the best way to do it.

One thing that would be helpful is to have the default search order by available translations. English search terms like “Greed” will pull up texts which only have their titles translated. I’m sure this is helpful for some people who want to search in English but do their own translation in root texts, but if a major goal of the site / search is to help people read translated EBTs in a language they know, then the current functionality is a bit suboptimal.

If this is the root issue, why not just apply a simple discount, like multiplying by length (or some function of length, like log length).

I think a part of a good benchmark might be that a search for right view returns MN9 at the top.

You don’t want really long texts which just briefly touch on a subject to rise to the top, but you also want longer texts which directly focus on a topic to be treated fairly.

1 Like

Thanks so much for your feedback!

I’m not sure I quite understand. Could you say this in a different way?

Currently, title matches are shown as suttaplex cards in the right column on desktop and at the top of the page on mobile.

When viewed on mobile, common terms like this to create the illusion that only items with title matches are returned.

That’s a very good point. Currently it seems to be third in the right column on desktop. It could automatically be brought to the top in this case if

  • we sorted by longest sutta first
  • or we sorted by “recommended suttas first”

To me, the second option seems more reliable.

Greed has a different problem. The first ten title results are all for the abbreviated/repetition series at the end of AN chapters. I don’t want to disparage any sutta ever, but I doubt those are the suttas people want most.

Ah, yes I guess I just misunderstood what I was seeing.

This may not belong in this thread, but I do find the cards presence / presentation in search somewhat odd. And you can get it where they appear for texts with no translations - e.g. if I search “seven buddhas” (in english) I get “suttaplex cards” for two untranslated Chinese texts.

  • or we sorted by “recommended suttas first”

To me, the second option seems more reliable.

I don’t know exactly what you mean in terms of an underlying implementation, but I would worry something like a “recommended” flag could lead to unintended consequences. For example, there’s a lot of AN entries which are much more relevant for the search term “harsh speech” (which occurs once in MN9’s Bodhi translation). You wouldn’t want “good” but tangential suttas being sorted above more directly relevant suttas.

I agree it is confusing. I’m happy to get any and all feedback.

I also find it odd.

I think the idea is that translated texts will appear first (although I’m not 100% sure this is happening everywhere and in all ways. In this case there were no suttas translated that have the words in the title.

Also a very good point! Currently there are only a limited number of suttas that have been given any kind of recommendation status. And as you point out it is for the whole sutta. Perhaps if more suttas could be given this recommendation status it would mitigate the problem.

Search is hard!!!

What exactly is a “recommendation status”?

If this is the little symbols for “beginner”, “middle”, or “advanced” status represented by a seedling etc., this has nothing at all to do with the relevance of a sutta for a particular search term—does it? :thinking:

No, but it does indicate that the sutta is somehow valuable/important. All other things being equal, I think it makes sense to prioritize valuable over not.

I’m open to anything, though. (and yes, I was talking about suttas with any of those three categories)

1 Like

I am not so sure about this judgement. There are many very valuable suttas in SN or AN, but these icons are only found on MN and DN suttas. It’s a bit one-sided in my view.

1 Like

Right! I guess my theory is based on accurate recommendation data. I wonder if those recommendations were only added to MN and DN so the feature had data to work with and then the recommending just stopped. It would be an interesting project to add to SN and AN.

1 Like

As far as I know all MN and DN suttas have one of the icons*, and all other suttas don’t have them—or am I wrong? (* I checked again, and this is indeed the case; so using them for ranking search results just means ranking MN and DN at the top.)

And I also understood that they were not meant to say that these suttas are any more valuable than others, but just to give a grouping within these Nikayas about the level of expertise recommended for reading them. Basically, so that a newcomer doesn’t start with MN 1 and get frustrated.

That’s why I didn’t first know what you mean by “recommendation”.

And the absence of an icon I didn’t understand as meaning this sutta is less valuable, but that it’s not easy to group it in a category between beginners and advanced readers. Or that simply no-one found the time to do it.


Added:
On the SC licences page we find:

We use several icons from the Noun Project, kindly released via Creative Commons Attribution (CC BY 3.0 US) .

So they are not meant to indicate value, but difficulty.

They are also mentioned in the same sense here:

1 Like

:man_facepalming:

Thanks so much for doing all that research. Somehow it escaped me that all the suttas there had a ranking.

1 Like

Indeed. I’m very sympathetic to you and anyone working to try and make it better.

Again, I’ll just throw out the idea of trying to apply some sort of formulaic discounting of the tf-idf score for these texts, improving the score for longer documents.

I did a little experiment and found that for the term “meditate” tf-idf gives the top 5 spots to AN texts, multiplying the score by the natural log of the text length gets sn52.1 in, and multiplying by length gets DN22 and MN10 their (IMO) proper place as #1 and #2. Similar results happen for “mindfulness”

The tf-idf formula I used is the default in tidytext, which is:
tf = occurrences of term / length of document
idf = ln(total number of documents / number of documents with at least one occurrence of the given term)

and tf-idf = tf*idf

So, my “best” discounting of tf*idf is equivalently idf * raw count of term occurrences.

Confusingly, sometimes that raw count is itself called “term frequency”, even in the context of tf-idf analysis. IDK what the current algo is using under the hood.

There might be some superior method to what I’ve devised (it IMO incorrectly ranks DN30 above any SN29 text when the search term is “dragons”), but the root issue is that we’re comparing sentence fragments to essays and narratives. Traditional tf-idf gives an artificial edge to a text like an1.416 whose whole text is “the faculty of mindfulness” v.s. “here’s the story of one time the Buddha spoke on many different aspects of mindfulness”

Another way to think of it is that using this modified version of tfidf is like ranking suttas by how much they have to say on a topic instead of what % of the text is purely targeting the topic.

Messing around further as I wrote this, I found that tfidf*log(length)*sqrt(length) passed both my personal “meditate” and “dragon” tests, but obviously lacks any sort of interpretability or justification.

2 Likes

All that is very interesting! I think we may need your help when we get into the nitty-gritty of ranking.

Thanks!

2 Likes

Just wanted to point out a variation of this idea. The number of parallels is the most common measure of importance of a topic that I’ve seen people genuinely try to use. e.g. more mentions of jhāna means they’re more important/reliable.

The number of repeat results can be used instead of parallels if that is too hard to measure. It’s not easy to say whether repeat results are referencing each other or if it’s just a common phrase, and some results are very similar with small differences (even punctuation, which could be ignored if search contains no punctuation or through some setting), but it would still give results that are easier to read this way, are less cluttered, and can be gone through by uniqueness. It’s then ordered by highest number of repeats since that could mean more important.

Personally, that would make searching a lot easier for me. One other reason is that I’m often looking for a term which was a part of longer phrase and it may be hundreds of results down which are full of repeats and I have to memorize which ones I’ve seen so far.

So when I search punabbhavo, I get 168 results. Scrolling through it, there’s many repeat results, but this way it would be organized like:

  1. [15 repeats] AN 3.103, AN 3.104, AN 7.50, MN 26, Kd 1, Ps 2.6, SN 14.31, 
 SN 56.11
    ‘akuppā me vimutti, ayamantimā jāti, natthi dāni punabbhavo’”ti.

  2. [3 repeats] AN 4.1, DN 16, Kv 1.5
    Tayidaáč, bhikkhave, ariyaáč sÄ«laáč anubuddhaáč â€Š ucchinnā bhavataáč‡hā, khÄ«áč‡Ä bhavanetti, natthi dāni punabbhavo

  3. [3 repeats] AN 8.64, AN 9.41, MN 128
    akuppā me cetovimutti, ayamantimā jāti, natthi dāni punabbhavo

  4. [2 repeats] MN 98, Snp 3.9
    santo khÄ«áč‡apunabbhavo
    

    Singlets:
    Snp 3.6
    Vusitavā khÄ«áč‡apunabbhavo sa bhikkhu
    Bv 2
    Dukkho punabbhavo nāma
    


I omit some content (the title, name of sutta, root language, whether aligned, language, original text, exact reference path), but that could still be included (not sure how important they really are, though, since if you wish to know about those details as they pertain to searching, then you could just search by them). Of course, it also ignores the context, but so does the current search, and one can still click on the sutta reference, linking to that same line.

The original results can just be mutated into a different data layout, so the original search process isn’t changed, which should be a relatively easy change. Also split by reference since some results contain multiple when the term appears on multiple lines. The display can get a little technically weird with title: or when multiple languages are listed, but I don’t think that presents a big challenge as it would still just list out those results separately as described.

Thanks for the suggestion.

So if I am understanding correctly, you aren’t just proposing a way to sort results, but to also somehow group the results when the search term is found in identical segments?

This would complicate things, especially since you can search for more than one word, and those words can appear anywhere in the whole sutta.