Discussion: How should search results be ranked šŸ“ƒ?

In other words, when there is more than one result, what order should they appear?

Currently when a keyword is found in a text title, the result is displayed as a suttaplex card and (on mobile) they are all grouped at the top of the list and on desktop they appear in the right hand column separate from results where the word is found in the body of the text.


Here are some ideas gathered so far:

  • If keyword is in title (currently used on site)
  • Length of document
  • Frequency of keyword in document
  • Using beginner/intermediate/advanced tags
  • Number of footnotes/annotations
  • Number of translations of a document (because more translations might indicate importance)
  • The method on SC-Voice: The relevance score is simply the sum of the number of matches plus the fraction of matching segments.
  • DN ā†’ MN ā†’ SN ā†’ AN ā†’ Kp ā†’ Dhp ā†’ ā€¦ ā†’ Vinaya ā†’ Abhidhamma?
  • Sutta > Vinaya > Abhidhamma
  • EBTs first
  • Segmented translations first

You had asked in the other thread why AN results appear first. It could be an alphabetical thing. It could also be that the Tfidf ranker tends to put shorter documents ahead of longer ones. In retrospect, thatā€™s probably not the right approach for us. We probably want to encourage people to read longer suttas, right?

3 Likes

I limited a search to in:sn and it sorted them ā€œalphabeticallyā€ in the sense of sn22.111 coming before sn22.99 etc.

See, thatā€™s where I start to feel shaky. I guess all things being equal thatā€™s not necessarily bad. But it doesnā€™t increase the chances of the sutta Iā€™m looking for being closer to the top.

I think I tend to use the search to find a sutta I already know exists. Which is not the only way! If someone is just looking for a text about consciousness, then yeah, maybe a longer sutta would be more appropriate? But Iā€™m really not sure. A very long sutta could mention consciousness in passing and not offer much more than a short sutta would. And if we went by sutta length, then we might as well just rank DN and MN first.

As far as segmented suttas go, it might be valuable to prioritize suttas where the keyword was found in a longer segment. That would downgrade all of the results where the only excerpt is ā€œā€¦feelingā€ etc.

However that breaks down with the non-segmented texts. They seem to treat the whole paragraph as a segment regardless.

Perhaps the number of times a keyword appears in the sutta. A sutta that had the word ā€œconsciousnessā€ 10 times might be more relevant than if it only mentioned it once or twice.

Iā€™m trying to think of metrics we have about suttas that could even be used. We have the beginner, intermediate, advanced quality. But should beginner or advanced be prioritized? The vast majority of suttas donā€™t have that quality assigned, so maybe simply having one of those means that a human has decided it is of special importance.

We also know how many translations exist for any given sutta. In theory the more translations exist for a sutta the more important it is.

We know how many footnotes each sutta has. Should a highly commented on sutta rank higher?

2 Likes

In Voice, we use a relevance score for ranking search results.

Search results are sorted by relevance. The relevance score is simply the sum of the number of matches plus the fraction of matching segments. Suttas densely packed with search terms have highest relevance.

See About Voice, under ā€œsearch resultsā€.

This was done exactly for the purpose to avoid that AN results are always displayed first; by default, the software would use alphabetical order. And probably any software does, if not told otherwise.

BUT ā€¦ Voice does by default only search in

  • Pali
  • segmented
  • from Mahasangiti manuscript, or basically, thexts that have a translation by Bhante Sujato
  • suttas, no vinaya, no abhidhamma

So many of the problems SC has to solve do not apply here.

2 Likes

This is what I would have expected to be the primary weighting.

Then, if there are two or more suttas with the same score I think your suggestions about the number of translations is reasonable.

However, if Iā€™m looking for a sutta with an elephant in it, then I might not be looking for the most densely ā€˜elephantedā€™ sutta- as opposed to if Iā€™m looking for a sutta on consciousness. In this instance the title and the summary of the sutta would be useful in the weightingā€¦ which takes us back to the other thread on search criteria/filters. What about in:title and in:summary but either way these criteria seem important.

Uggh search is complex.

3 Likes

We currently have title:elephant. Itā€™s not very flexible because it only takes a single keyword. I had this idea about how it could be implemented differently, but @Khemarato.bhikkhu has given me doubts about it.

However, the title is already given priority, just not in a way that I really like:

It puts them in the right column as SuttaPlex cards. I donā€™t like it for many reasons. It separates them out, so now there are two separate rankings. And being separate itā€™s easy to miss one or the other. Itā€™s also not always obvious why a result is being given if one author puts that specific word in the title and another one doesnā€™t (see example below). And in the example above, the second title-first result has nothing at all to do with elephants! And with the suttaplex cards we get no excerpt. Just because a keyword is in a title doesnā€™t mean I donā€™t want to see the context of the word in the sutta itself.

Here is what the first result of title:elephant looks like:

I feel like the user would be better served by seeing some lines of text rather than the suttaplex card.

2 Likes

Just to put it out there: DN ā†’ MN ā†’ SN ā†’ AN ā†’ Kp ā†’ Dhp ā†’ ā€¦ ā†’ Vinaya ā†’ Abhidhamma?

Itā€™s really not obvious what order the searcher wants it to be in even with the context of their search content.

1 Like

I donā€™t know if I would always want to see DN first, but all other things being equal, perhaps it does make sense to show EBTs before everything else. Which in general would be what you are proposing I think? I just wouldnā€™t necessarily want the EBT results to be sorted that way.

Thanks!

1 Like

I donā€™t know if I like that ranking for the nikayas.
SN is the original categorised search! :stuck_out_tongue_winking_eye:

I donā€™t even think we should give a weighted ranking to the different Nikayas. I agree that Sutta > Vinaya > Abhidhamma within the context of this site

I agree. I was a bit confused about having suttaplex cards on the right, but you can get used to anything!

Within the context of the search I would want to see the title, the translator (why is this called author?) and then the chunk of text where my search term is showing, with the term highlighted. I would also find it useful to know how many times that term appears in that particular text (This is something that thebuddhaswords.net 's very basic search does- maybe I have just become accustomed to it).

2 Likes

Do you mean in the filter? Change `author:` fiter to `by:` Ā· Issue #2970 Ā· suttacentral/suttacentral Ā· GitHub See my moaning there.

Interesting idea. For segmented texts I believe you are shown up to three segments that have the term. So if you know thatā€™s how it works, then you already get some indication (1, 2, or 3/3+)

I wonder if segmented translations should be given weight? That would mean for now defacto showing Bhante Sujato and Bhante Brahmaliā€™s first.

1 Like

Why do you think the segmented translations are ā€˜betterā€™? Iā€™m not sure I can think of a reason.

2 Likes

Yeah, I canā€™t make a strong argument. And of course there is nothing inherently better about the translation itself.

But in terms of site functionality, they show better excerpts in the results. Kind of.

And they will take you to a text that can be viewed side by side with the Pali, which in general I would say is better. But of course that wonā€™t matter to everyone. Also, those are the only texts that have translator notes. Basically the segmented texts give you the most complete SuttaCentral experience.

The segmented texts tend to be newer as well. (although I just now realized that @cdpattonā€™s translations are legacy texts) And potentially, although not necessarily, have more uniform translations since they were done using Bilara which promotes that.

At this point Iā€™m just kind of throwing out any possible way we might be able to rank them that might in some way be useful.

1 Like

I think the biggest problem is that they are simply so many so that it seems that almost any ranking may be better than no ranking at all ā€¦ :TRIPLE_SIGH:

3 Likes

Hello, everyone.

Thatā€™s a great survey. Thank you ven @Snowbird for bringing it up. And thank you @HongDa for improvements.

It might be that my suggestions about filters more on the grouping and sorting side of the already produced search results. But these are also some kind of filters after all.

Here is an example with Kuį¹­hār search results on Suttacentral.net (partial match)

https://suttacentral.net/search?query=in:ebs%20Kuį¹­hār

And the same search on find.dhamma.gift

#1 imo this grouping like I did on dhamma.gift looks more user friendly and helping to work with search results. While output on suttacentral is fine for 13 texts. if there will be 50+ texts or more itā€™ll almost become unmanagable to make any desicions which texts might help more then other.

#2 The other big thing for the user is results aggregated by words. That can be a very important feature for the partial match search.

#3 is really minor but can be crucial for user. Showing variants of the word like
In the part ā€œVariants for Kuį¹­hārā€

Ps i gave test links for fdg. If youā€™ll see some tech info or errors please donā€™t mind. I just wanted to show the output that might improve the search result visualization for user on Sc.net.

1 Like

:sweat_smile:

What about sorting by number of parallels of that specific part of the text? That signifies some importance.

Thereā€™s also sorting by number, then by nikaya.

DN1 ā†’ MN1 ā†’ ā€¦ Dhp1 ā†’
DN2 ā†’ MN2 ā†’ ā€¦ Dhp2 ā†’
DN3ā€¦

Or with SN first.

Iā€™m just putting out ideas in case it inspires an actual good layout.

1 Like

I created this issue previously that kind of addresses your suggestion:

My personal feeling is that the Digital Pali Reader does such a good job at search for Pali that itā€™s not worth it to try and replicate it on SuttaCentral.

But if we do, I would like to see those partial match hits as well as breakdown by book.

Ideally I might like to see filter suggestions customized to the results (i.e. only show potential filters that would have some effect on the results)

That would avoid having all of the same book at the top of the results. But Iā€™m not sure if thatā€™s the best way to do it.

One thing that would be helpful is to have the default search order by available translations. English search terms like ā€œGreedā€ will pull up texts which only have their titles translated. Iā€™m sure this is helpful for some people who want to search in English but do their own translation in root texts, but if a major goal of the site / search is to help people read translated EBTs in a language they know, then the current functionality is a bit suboptimal.

If this is the root issue, why not just apply a simple discount, like multiplying by length (or some function of length, like log length).

I think a part of a good benchmark might be that a search for right view returns MN9 at the top.

You donā€™t want really long texts which just briefly touch on a subject to rise to the top, but you also want longer texts which directly focus on a topic to be treated fairly.

1 Like

Thanks so much for your feedback!

Iā€™m not sure I quite understand. Could you say this in a different way?

Currently, title matches are shown as suttaplex cards in the right column on desktop and at the top of the page on mobile.

When viewed on mobile, common terms like this to create the illusion that only items with title matches are returned.

Thatā€™s a very good point. Currently it seems to be third in the right column on desktop. It could automatically be brought to the top in this case if

  • we sorted by longest sutta first
  • or we sorted by ā€œrecommended suttas firstā€

To me, the second option seems more reliable.

Greed has a different problem. The first ten title results are all for the abbreviated/repetition series at the end of AN chapters. I donā€™t want to disparage any sutta ever, but I doubt those are the suttas people want most.

Ah, yes I guess I just misunderstood what I was seeing.

This may not belong in this thread, but I do find the cards presence / presentation in search somewhat odd. And you can get it where they appear for texts with no translations - e.g. if I search ā€œseven buddhasā€ (in english) I get ā€œsuttaplex cardsā€ for two untranslated Chinese texts.

  • or we sorted by ā€œrecommended suttas firstā€

To me, the second option seems more reliable.

I donā€™t know exactly what you mean in terms of an underlying implementation, but I would worry something like a ā€œrecommendedā€ flag could lead to unintended consequences. For example, thereā€™s a lot of AN entries which are much more relevant for the search term ā€œharsh speechā€ (which occurs once in MN9ā€™s Bodhi translation). You wouldnā€™t want ā€œgoodā€ but tangential suttas being sorted above more directly relevant suttas.

I agree it is confusing. Iā€™m happy to get any and all feedback.

I also find it odd.

I think the idea is that translated texts will appear first (although Iā€™m not 100% sure this is happening everywhere and in all ways. In this case there were no suttas translated that have the words in the title.

Also a very good point! Currently there are only a limited number of suttas that have been given any kind of recommendation status. And as you point out it is for the whole sutta. Perhaps if more suttas could be given this recommendation status it would mitigate the problem.

Search is hard!!!