Discussion: How should search results be ranked 📃?

sabbamitta · February 21, 2024, 10:10pm

What exactly is a “recommendation status”?

If this is the little symbols for “beginner”, “middle”, or “advanced” status represented by a seedling etc., this has nothing at all to do with the relevance of a sutta for a particular search term—does it?

Snowbird · February 21, 2024, 10:53pm

No, but it does indicate that the sutta is somehow valuable/important. All other things being equal, I think it makes sense to prioritize valuable over not.

I’m open to anything, though. (and yes, I was talking about suttas with any of those three categories)

sabbamitta · February 22, 2024, 8:13am

I am not so sure about this judgement. There are many very valuable suttas in SN or AN, but these icons are only found on MN and DN suttas. It’s a bit one-sided in my view.

Snowbird · February 22, 2024, 8:15am

Right! I guess my theory is based on accurate recommendation data. I wonder if those recommendations were only added to MN and DN so the feature had data to work with and then the recommending just stopped. It would be an interesting project to add to SN and AN.

sabbamitta · February 22, 2024, 8:17am

As far as I know all MN and DN suttas have one of the icons*, and all other suttas don’t have them—or am I wrong? (* I checked again, and this is indeed the case; so using them for ranking search results just means ranking MN and DN at the top.)

And I also understood that they were not meant to say that these suttas are any more valuable than others, but just to give a grouping within these Nikayas about the level of expertise recommended for reading them. Basically, so that a newcomer doesn’t start with MN 1 and get frustrated.

That’s why I didn’t first know what you mean by “recommendation”.

And the absence of an icon I didn’t understand as meaning this sutta is less valuable, but that it’s not easy to group it in a category between beginners and advanced readers. Or that simply no-one found the time to do it.

Added:
On the SC licences page we find:

We use several icons from the Noun Project, kindly released via Creative Commons Attribution (CC BY 3.0 US) .

“Difficulty” icons created by Alena Artemova.

So they are not meant to indicate value, but difficulty.

They are also mentioned in the same sense here:

Snowbird · February 22, 2024, 5:04pm

Thanks so much for doing all that research. Somehow it escaped me that all the suttas there had a ranking.

DonatorProponent · February 28, 2024, 9:34pm

Indeed. I’m very sympathetic to you and anyone working to try and make it better.

Again, I’ll just throw out the idea of trying to apply some sort of formulaic discounting of the tf-idf score for these texts, improving the score for longer documents.

I did a little experiment and found that for the term “meditate” tf-idf gives the top 5 spots to AN texts, multiplying the score by the natural log of the text length gets sn52.1 in, and multiplying by length gets DN22 and MN10 their (IMO) proper place as #1 and #2. Similar results happen for “mindfulness”

The tf-idf formula I used is the default in tidytext, which is:
tf = occurrences of term / length of document
idf = ln(total number of documents / number of documents with at least one occurrence of the given term)

and tf-idf = tf*idf

So, my “best” discounting of tf*idf is equivalently idf * raw count of term occurrences.

Confusingly, sometimes that raw count is itself called “term frequency”, even in the context of tf-idf analysis. IDK what the current algo is using under the hood.

There might be some superior method to what I’ve devised (it IMO incorrectly ranks DN30 above any SN29 text when the search term is “dragons”), but the root issue is that we’re comparing sentence fragments to essays and narratives. Traditional tf-idf gives an artificial edge to a text like an1.416 whose whole text is “the faculty of mindfulness” v.s. “here’s the story of one time the Buddha spoke on many different aspects of mindfulness”

Another way to think of it is that using this modified version of tfidf is like ranking suttas by how much they have to say on a topic instead of what % of the text is purely targeting the topic.

Messing around further as I wrote this, I found that tfidf*log(length)*sqrt(length) passed both my personal “meditate” and “dragon” tests, but obviously lacks any sort of interpretability or justification.

Snowbird · February 28, 2024, 9:38pm

All that is very interesting! I think we may need your help when we get into the nitty-gritty of ranking.

Thanks!

bran · April 24, 2024, 10:49am

Just wanted to point out a variation of this idea. The number of parallels is the most common measure of importance of a topic that I’ve seen people genuinely try to use. e.g. more mentions of jhāna means they’re more important/reliable.

The number of repeat results can be used instead of parallels if that is too hard to measure. It’s not easy to say whether repeat results are referencing each other or if it’s just a common phrase, and some results are very similar with small differences (even punctuation, which could be ignored if search contains no punctuation or through some setting), but it would still give results that are easier to read this way, are less cluttered, and can be gone through by uniqueness. It’s then ordered by highest number of repeats since that could mean more important.

Personally, that would make searching a lot easier for me. One other reason is that I’m often looking for a term which was a part of longer phrase and it may be hundreds of results down which are full of repeats and I have to memorize which ones I’ve seen so far.

So when I search punabbhavo, I get 168 results. Scrolling through it, there’s many repeat results, but this way it would be organized like:

[15 repeats] AN 3.103, AN 3.104, AN 7.50, MN 26, Kd 1, Ps 2.6, SN 14.31, … SN 56.11
‘akuppā me vimutti, ayamantimā jāti, natthi dāni punabbhavo’”ti.
[3 repeats] AN 4.1, DN 16, Kv 1.5
Tayidaṁ, bhikkhave, ariyaṁ sīlaṁ anubuddhaṁ … ucchinnā bhavataṇhā, khīṇā bhavanetti, natthi dāni punabbhavo
[3 repeats] AN 8.64, AN 9.41, MN 128
akuppā me cetovimutti, ayamantimā jāti, natthi dāni punabbhavo
[2 repeats] MN 98, Snp 3.9
santo khīṇapunabbhavo
…
Singlets:
Snp 3.6
Vusitavā khīṇapunabbhavo sa bhikkhu
Bv 2
Dukkho punabbhavo nāma
…

I omit some content (the title, name of sutta, root language, whether aligned, language, original text, exact reference path), but that could still be included (not sure how important they really are, though, since if you wish to know about those details as they pertain to searching, then you could just search by them). Of course, it also ignores the context, but so does the current search, and one can still click on the sutta reference, linking to that same line.

The original results can just be mutated into a different data layout, so the original search process isn’t changed, which should be a relatively easy change. Also split by reference since some results contain multiple when the term appears on multiple lines. The display can get a little technically weird with title: or when multiple languages are listed, but I don’t think that presents a big challenge as it would still just list out those results separately as described.

Snowbird · April 24, 2024, 9:41pm

Thanks for the suggestion.

So if I am understanding correctly, you aren’t just proposing a way to sort results, but to also somehow group the results when the search term is found in identical segments?

This would complicate things, especially since you can search for more than one word, and those words can appear anywhere in the whole sutta.