Discussion: How should search results be ranked 📃?

Snowbird · February 4, 2024, 12:39am

In other words, when there is more than one result, what order should they appear?

Currently when a keyword is found in a text title, the result is displayed as a suttaplex card and (on mobile) they are all grouped at the top of the list and on desktop they appear in the right hand column separate from results where the word is found in the body of the text.

Here are some ideas gathered so far:

If keyword is in title (currently used on site)
Length of document
Frequency of keyword in document
Using beginner/intermediate/advanced tags
Number of footnotes/annotations
Number of translations of a document (because more translations might indicate importance)
The method on SC-Voice: The relevance score is simply the sum of the number of matches plus the fraction of matching segments.
DN → MN → SN → AN → Kp → Dhp → … → Vinaya → Abhidhamma?
Sutta > Vinaya > Abhidhamma
EBTs first
Segmented translations first

Khemarato.bhikkhu · February 4, 2024, 3:44am

You had asked in the other thread why AN results appear first. It could be an alphabetical thing. It could also be that the Tfidf ranker tends to put shorter documents ahead of longer ones. In retrospect, that’s probably not the right approach for us. We probably want to encourage people to read longer suttas, right?

Snowbird · February 4, 2024, 4:22am

I limited a search to in:sn and it sorted them “alphabetically” in the sense of sn22.111 coming before sn22.99 etc.

See, that’s where I start to feel shaky. I guess all things being equal that’s not necessarily bad. But it doesn’t increase the chances of the sutta I’m looking for being closer to the top.

I think I tend to use the search to find a sutta I already know exists. Which is not the only way! If someone is just looking for a text about consciousness, then yeah, maybe a longer sutta would be more appropriate? But I’m really not sure. A very long sutta could mention consciousness in passing and not offer much more than a short sutta would. And if we went by sutta length, then we might as well just rank DN and MN first.

As far as segmented suttas go, it might be valuable to prioritize suttas where the keyword was found in a longer segment. That would downgrade all of the results where the only excerpt is “…feeling” etc.

However that breaks down with the non-segmented texts. They seem to treat the whole paragraph as a segment regardless.

Perhaps the number of times a keyword appears in the sutta. A sutta that had the word “consciousness” 10 times might be more relevant than if it only mentioned it once or twice.

I’m trying to think of metrics we have about suttas that could even be used. We have the beginner, intermediate, advanced quality. But should beginner or advanced be prioritized? The vast majority of suttas don’t have that quality assigned, so maybe simply having one of those means that a human has decided it is of special importance.

We also know how many translations exist for any given sutta. In theory the more translations exist for a sutta the more important it is.

We know how many footnotes each sutta has. Should a highly commented on sutta rank higher?

sabbamitta · February 4, 2024, 8:09am

In Voice, we use a relevance score for ranking search results.

Search results are sorted by relevance. The relevance score is simply the sum of the number of matches plus the fraction of matching segments. Suttas densely packed with search terms have highest relevance.

See About Voice, under “search results”.

This was done exactly for the purpose to avoid that AN results are always displayed first; by default, the software would use alphabetical order. And probably any software does, if not told otherwise.

BUT … Voice does by default only search in

Pali
segmented
from Mahasangiti manuscript, or basically, thexts that have a translation by Bhante Sujato
suttas, no vinaya, no abhidhamma

So many of the problems SC has to solve do not apply here.

Pasanna · February 4, 2024, 8:59am

This is what I would have expected to be the primary weighting.

Then, if there are two or more suttas with the same score I think your suggestions about the number of translations is reasonable.

However, if I’m looking for a sutta with an elephant in it, then I might not be looking for the most densely ‘elephanted’ sutta- as opposed to if I’m looking for a sutta on consciousness. In this instance the title and the summary of the sutta would be useful in the weighting… which takes us back to the other thread on search criteria/filters. What about in:title and in:summary but either way these criteria seem important.

Uggh search is complex.

Snowbird · February 4, 2024, 6:27pm

We currently have title:elephant. It’s not very flexible because it only takes a single keyword. I had this idea about how it could be implemented differently, but @Khemarato.bhikkhu has given me doubts about it.

However, the title is already given priority, just not in a way that I really like:

It puts them in the right column as SuttaPlex cards. I don’t like it for many reasons. It separates them out, so now there are two separate rankings. And being separate it’s easy to miss one or the other. It’s also not always obvious why a result is being given if one author puts that specific word in the title and another one doesn’t (see example below). And in the example above, the second title-first result has nothing at all to do with elephants! And with the suttaplex cards we get no excerpt. Just because a keyword is in a title doesn’t mean I don’t want to see the context of the word in the sutta itself.

Here is what the first result of title:elephant looks like:

I feel like the user would be better served by seeing some lines of text rather than the suttaplex card.

bran · February 4, 2024, 9:22pm

Just to put it out there: DN → MN → SN → AN → Kp → Dhp → … → Vinaya → Abhidhamma?

It’s really not obvious what order the searcher wants it to be in even with the context of their search content.

Snowbird · February 4, 2024, 9:27pm

I don’t know if I would always want to see DN first, but all other things being equal, perhaps it does make sense to show EBTs before everything else. Which in general would be what you are proposing I think? I just wouldn’t necessarily want the EBT results to be sorted that way.

Thanks!

Pasanna · February 4, 2024, 10:35pm

I don’t know if I like that ranking for the nikayas.
SN is the original categorised search!

I don’t even think we should give a weighted ranking to the different Nikayas. I agree that Sutta > Vinaya > Abhidhamma within the context of this site

I agree. I was a bit confused about having suttaplex cards on the right, but you can get used to anything!

Within the context of the search I would want to see the title, the translator (why is this called author?) and then the chunk of text where my search term is showing, with the term highlighted. I would also find it useful to know how many times that term appears in that particular text (This is something that thebuddhaswords.net 's very basic search does- maybe I have just become accustomed to it).

Snowbird · February 4, 2024, 10:49pm

Do you mean in the filter? Change `author:` fiter to `by:` · Issue #2970 · suttacentral/suttacentral · GitHub See my moaning there.

Interesting idea. For segmented texts I believe you are shown up to three segments that have the term. So if you know that’s how it works, then you already get some indication (1, 2, or 3/3+)

I wonder if segmented translations should be given weight? That would mean for now defacto showing Bhante Sujato and Bhante Brahmali’s first.

Pasanna · February 4, 2024, 11:06pm

Why do you think the segmented translations are ‘better’? I’m not sure I can think of a reason.

Snowbird · February 4, 2024, 11:24pm

Yeah, I can’t make a strong argument. And of course there is nothing inherently better about the translation itself.

But in terms of site functionality, they show better excerpts in the results. Kind of.

And they will take you to a text that can be viewed side by side with the Pali, which in general I would say is better. But of course that won’t matter to everyone. Also, those are the only texts that have translator notes. Basically the segmented texts give you the most complete SuttaCentral experience.

The segmented texts tend to be newer as well. (although I just now realized that @cdpatton’s translations are legacy texts) And potentially, although not necessarily, have more uniform translations since they were done using Bilara which promotes that.

At this point I’m just kind of throwing out any possible way we might be able to rank them that might in some way be useful.

sabbamitta · February 4, 2024, 11:37pm

I think the biggest problem is that they are simply so many so that it seems that almost any ranking may be better than no ranking at all … :TRIPLE_SIGH:

dhammagift · February 20, 2024, 2:33pm

Hello, everyone.

That’s a great survey. Thank you ven @Snowbird for bringing it up. And thank you @HongDa for improvements.

It might be that my suggestions about filters more on the grouping and sorting side of the already produced search results. But these are also some kind of filters after all.

Here is an example with Kuṭhār search results on Suttacentral.net (partial match)

https://suttacentral.net/search?query=in:ebs%20Kuṭhār

And the same search on find.dhamma.gift

#1 imo this grouping like I did on dhamma.gift looks more user friendly and helping to work with search results. While output on suttacentral is fine for 13 texts. if there will be 50+ texts or more it’ll almost become unmanagable to make any desicions which texts might help more then other.

#2 The other big thing for the user is results aggregated by words. That can be a very important feature for the partial match search.

#3 is really minor but can be crucial for user. Showing variants of the word like
In the part “Variants for Kuṭhār”

Ps i gave test links for fdg. If you’ll see some tech info or errors please don’t mind. I just wanted to show the output that might improve the search result visualization for user on Sc.net.

bran · February 20, 2024, 3:14pm

What about sorting by number of parallels of that specific part of the text? That signifies some importance.

There’s also sorting by number, then by nikaya.

DN1 → MN1 → … Dhp1 →
DN2 → MN2 → … Dhp2 →
DN3…

Or with SN first.

I’m just putting out ideas in case it inspires an actual good layout.

Snowbird · February 20, 2024, 7:16pm

I created this issue previously that kind of addresses your suggestion:

github.com/suttacentral/suttacentral

On larger search results, display a breakdown of what books the results are in

opened 02:28AM - 11 Jul 23 UTC

thesunshade

#### User story Someone searches for a rather common term, e.g. "rāga". They ge…t 2610 results: ![image](https://github.com/suttacentral/suttacentral/assets/82448383/34bd9306-96cb-4873-81a7-b2fd44c8bf1e) They need some way to narrow things down. At this point they may decide that the sutta they were looking for is in a particular book, or simply that they would be interested in a result that came from a particular book. For example, they think that they might want a pithy verse, so they would like to narrow it down to the Sutta Nipāta. Currently there is no way for them to know the distribution for their results or an easy way to narrow them (without investigating the "filter" icon). #### Feature description Add a break down of results per book. Clicking on one of those book names would add the necessary filter to the search and re-run the search. The DPR has something like this (see top right of screen): ![image](https://github.com/suttacentral/suttacentral/assets/82448383/626e800d-408e-432d-aad4-70f9b91c1bdc) #### Issues: * Should there be a threshold of x results before this kind of interface is presented? * Would every single book that has a result be represented? This could lead to having dozens of books with just one result each. #### Acceptance criteria (the list of things that need to be done for the ticket to be considered finished): #### Pre milestone planning check: - [ ] Small enough to completed in a milestone. - [ ] Dependencies marked - [ ] No external dependencies block the PBI from being completed. - [ ] Details are understood by dev team to decide if the PBI can be completed. #### Done check: - [ ] Produced code for presumed functionalities - [ ] Project builds without errors - [ ] Peer Code Review performed/pull request approved x2 - [ ] Project deployed on the stage environment identical to production platform - [ ] Feature is tested against acceptance criteria - [ ] Feature ok-ed by Product Owner (moving to Closed on the Board) - [ ] Refactoring completed - [ ] Any configuration or build changes documented (readme, etc...)

My personal feeling is that the Digital Pali Reader does such a good job at search for Pali that it’s not worth it to try and replicate it on SuttaCentral.

But if we do, I would like to see those partial match hits as well as breakdown by book.

Ideally I might like to see filter suggestions customized to the results (i.e. only show potential filters that would have some effect on the results)

That would avoid having all of the same book at the top of the results. But I’m not sure if that’s the best way to do it.

DonatorProponent · February 21, 2024, 4:39pm

One thing that would be helpful is to have the default search order by available translations. English search terms like “Greed” will pull up texts which only have their titles translated. I’m sure this is helpful for some people who want to search in English but do their own translation in root texts, but if a major goal of the site / search is to help people read translated EBTs in a language they know, then the current functionality is a bit suboptimal.

If this is the root issue, why not just apply a simple discount, like multiplying by length (or some function of length, like log length).

I think a part of a good benchmark might be that a search for right view returns MN9 at the top.

You don’t want really long texts which just briefly touch on a subject to rise to the top, but you also want longer texts which directly focus on a topic to be treated fairly.

Snowbird · February 21, 2024, 6:32pm

Thanks so much for your feedback!

I’m not sure I quite understand. Could you say this in a different way?

Currently, title matches are shown as suttaplex cards in the right column on desktop and at the top of the page on mobile.

When viewed on mobile, common terms like this to create the illusion that only items with title matches are returned.

That’s a very good point. Currently it seems to be third in the right column on desktop. It could automatically be brought to the top in this case if

we sorted by longest sutta first
or we sorted by “recommended suttas first”

To me, the second option seems more reliable.

Greed has a different problem. The first ten title results are all for the abbreviated/repetition series at the end of AN chapters. I don’t want to disparage any sutta ever, but I doubt those are the suttas people want most.

DonatorProponent · February 21, 2024, 8:24pm

Ah, yes I guess I just misunderstood what I was seeing.

This may not belong in this thread, but I do find the cards presence / presentation in search somewhat odd. And you can get it where they appear for texts with no translations - e.g. if I search “seven buddhas” (in english) I get “suttaplex cards” for two untranslated Chinese texts.

or we sorted by “recommended suttas first”

To me, the second option seems more reliable.

I don’t know exactly what you mean in terms of an underlying implementation, but I would worry something like a “recommended” flag could lead to unintended consequences. For example, there’s a lot of AN entries which are much more relevant for the search term “harsh speech” (which occurs once in MN9’s Bodhi translation). You wouldn’t want “good” but tangential suttas being sorted above more directly relevant suttas.

Snowbird · February 21, 2024, 9:07pm

I agree it is confusing. I’m happy to get any and all feedback.

I also find it odd.

I think the idea is that translated texts will appear first (although I’m not 100% sure this is happening everywhere and in all ways. In this case there were no suttas translated that have the words in the title.

Also a very good point! Currently there are only a limited number of suttas that have been given any kind of recommendation status. And as you point out it is for the whole sutta. Perhaps if more suttas could be given this recommendation status it would mitigate the problem.

Search is hard!!!