Why do you think the segmented translations are âbetterâ? Iâm not sure I can think of a reason.
Yeah, I canât make a strong argument. And of course there is nothing inherently better about the translation itself.
But in terms of site functionality, they show better excerpts in the results. Kind of.
And they will take you to a text that can be viewed side by side with the Pali, which in general I would say is better. But of course that wonât matter to everyone. Also, those are the only texts that have translator notes. Basically the segmented texts give you the most complete SuttaCentral experience.
The segmented texts tend to be newer as well. (although I just now realized that @cdpattonâs translations are legacy texts) And potentially, although not necessarily, have more uniform translations since they were done using Bilara which promotes that.
At this point Iâm just kind of throwing out any possible way we might be able to rank them that might in some way be useful.
I think the biggest problem is that they are simply so many so that it seems that almost any ranking may be better than no ranking at all ⊠:TRIPLE_SIGH:
Hello, everyone.
Thatâs a great survey. Thank you ven @Snowbird for bringing it up. And thank you @HongDa for improvements.
It might be that my suggestions about filters more on the grouping and sorting side of the already produced search results. But these are also some kind of filters after all.
Here is an example with KuáčhÄr search results on Suttacentral.net (partial match)
https://suttacentral.net/search?query=in:ebs%20KuáčhÄr
And the same search on find.dhamma.gift
#1 imo this grouping like I did on dhamma.gift looks more user friendly and helping to work with search results. While output on suttacentral is fine for 13 texts. if there will be 50+ texts or more itâll almost become unmanagable to make any desicions which texts might help more then other.
#2 The other big thing for the user is results aggregated by words. That can be a very important feature for the partial match search.
#3 is really minor but can be crucial for user. Showing variants of the word like
In the part âVariants for KuáčhÄrâ
Ps i gave test links for fdg. If youâll see some tech info or errors please donât mind. I just wanted to show the output that might improve the search result visualization for user on Sc.net.
What about sorting by number of parallels of that specific part of the text? That signifies some importance.
Thereâs also sorting by number, then by nikaya.
DN1 â MN1 â ⊠Dhp1 â
DN2 â MN2 â ⊠Dhp2 â
DN3âŠ
Or with SN first.
Iâm just putting out ideas in case it inspires an actual good layout.
I created this issue previously that kind of addresses your suggestion:
My personal feeling is that the Digital Pali Reader does such a good job at search for Pali that itâs not worth it to try and replicate it on SuttaCentral.
But if we do, I would like to see those partial match hits as well as breakdown by book.
Ideally I might like to see filter suggestions customized to the results (i.e. only show potential filters that would have some effect on the results)
That would avoid having all of the same book at the top of the results. But Iâm not sure if thatâs the best way to do it.
One thing that would be helpful is to have the default search order by available translations. English search terms like âGreedâ will pull up texts which only have their titles translated. Iâm sure this is helpful for some people who want to search in English but do their own translation in root texts, but if a major goal of the site / search is to help people read translated EBTs in a language they know, then the current functionality is a bit suboptimal.
If this is the root issue, why not just apply a simple discount, like multiplying by length (or some function of length, like log length).
I think a part of a good benchmark might be that a search for right view returns MN9 at the top.
You donât want really long texts which just briefly touch on a subject to rise to the top, but you also want longer texts which directly focus on a topic to be treated fairly.
Thanks so much for your feedback!
Iâm not sure I quite understand. Could you say this in a different way?
Currently, title matches are shown as suttaplex cards in the right column on desktop and at the top of the page on mobile.
When viewed on mobile, common terms like this to create the illusion that only items with title matches are returned.
Thatâs a very good point. Currently it seems to be third in the right column on desktop. It could automatically be brought to the top in this case if
- we sorted by longest sutta first
- or we sorted by ârecommended suttas firstâ
To me, the second option seems more reliable.
Greed has a different problem. The first ten title results are all for the abbreviated/repetition series at the end of AN chapters. I donât want to disparage any sutta ever, but I doubt those are the suttas people want most.
Ah, yes I guess I just misunderstood what I was seeing.
This may not belong in this thread, but I do find the cards presence / presentation in search somewhat odd. And you can get it where they appear for texts with no translations - e.g. if I search âseven buddhasâ (in english) I get âsuttaplex cardsâ for two untranslated Chinese texts.
- or we sorted by ârecommended suttas firstâ
To me, the second option seems more reliable.
I donât know exactly what you mean in terms of an underlying implementation, but I would worry something like a ârecommendedâ flag could lead to unintended consequences. For example, thereâs a lot of AN entries which are much more relevant for the search term âharsh speechâ (which occurs once in MN9âs Bodhi translation). You wouldnât want âgoodâ but tangential suttas being sorted above more directly relevant suttas.
I agree it is confusing. Iâm happy to get any and all feedback.
I also find it odd.
I think the idea is that translated texts will appear first (although Iâm not 100% sure this is happening everywhere and in all ways. In this case there were no suttas translated that have the words in the title.
Also a very good point! Currently there are only a limited number of suttas that have been given any kind of recommendation status. And as you point out it is for the whole sutta. Perhaps if more suttas could be given this recommendation status it would mitigate the problem.
Search is hard!!!
What exactly is a ârecommendation statusâ?
If this is the little symbols for âbeginnerâ, âmiddleâ, or âadvancedâ status represented by a seedling etc., this has nothing at all to do with the relevance of a sutta for a particular search termâdoes it?
No, but it does indicate that the sutta is somehow valuable/important. All other things being equal, I think it makes sense to prioritize valuable over not.
Iâm open to anything, though. (and yes, I was talking about suttas with any of those three categories)
I am not so sure about this judgement. There are many very valuable suttas in SN or AN, but these icons are only found on MN and DN suttas. Itâs a bit one-sided in my view.
Right! I guess my theory is based on accurate recommendation data. I wonder if those recommendations were only added to MN and DN so the feature had data to work with and then the recommending just stopped. It would be an interesting project to add to SN and AN.
As far as I know all MN and DN suttas have one of the icons*, and all other suttas donât have themâor am I wrong? (* I checked again, and this is indeed the case; so using them for ranking search results just means ranking MN and DN at the top.)
And I also understood that they were not meant to say that these suttas are any more valuable than others, but just to give a grouping within these Nikayas about the level of expertise recommended for reading them. Basically, so that a newcomer doesnât start with MN 1 and get frustrated.
Thatâs why I didnât first know what you mean by ârecommendationâ.
And the absence of an icon I didnât understand as meaning this sutta is less valuable, but that itâs not easy to group it in a category between beginners and advanced readers. Or that simply no-one found the time to do it.
Added:
On the SC licences page we find:
We use several icons from the Noun Project, kindly released via Creative Commons Attribution (CC BY 3.0 US) .
- âDifficultyâ icons created by Alena Artemova.
So they are not meant to indicate value, but difficulty.
They are also mentioned in the same sense here:
Thanks so much for doing all that research. Somehow it escaped me that all the suttas there had a ranking.
Indeed. Iâm very sympathetic to you and anyone working to try and make it better.
Again, Iâll just throw out the idea of trying to apply some sort of formulaic discounting of the tf-idf score for these texts, improving the score for longer documents.
I did a little experiment and found that for the term âmeditateâ tf-idf gives the top 5 spots to AN texts, multiplying the score by the natural log of the text length gets sn52.1 in, and multiplying by length gets DN22 and MN10 their (IMO) proper place as #1 and #2. Similar results happen for âmindfulnessâ
The tf-idf formula I used is the default in tidytext, which is:
tf = occurrences of term / length of document
idf = ln(total number of documents / number of documents with at least one occurrence of the given term)
and tf-idf = tf*idf
So, my âbestâ discounting of tf*idf is equivalently idf * raw count of term occurrences.
Confusingly, sometimes that raw count is itself called âterm frequencyâ, even in the context of tf-idf analysis. IDK what the current algo is using under the hood.
There might be some superior method to what Iâve devised (it IMO incorrectly ranks DN30 above any SN29 text when the search term is âdragonsâ), but the root issue is that weâre comparing sentence fragments to essays and narratives. Traditional tf-idf gives an artificial edge to a text like an1.416 whose whole text is âthe faculty of mindfulnessâ v.s. âhereâs the story of one time the Buddha spoke on many different aspects of mindfulnessâ
Another way to think of it is that using this modified version of tfidf is like ranking suttas by how much they have to say on a topic instead of what % of the text is purely targeting the topic.
Messing around further as I wrote this, I found that tfidf*log(length)*sqrt(length) passed both my personal âmeditateâ and âdragonâ tests, but obviously lacks any sort of interpretability or justification.
All that is very interesting! I think we may need your help when we get into the nitty-gritty of ranking.
Thanks!
Just wanted to point out a variation of this idea. The number of parallels is the most common measure of importance of a topic that Iâve seen people genuinely try to use. e.g. more mentions of jhÄna means theyâre more important/reliable.
The number of repeat results can be used instead of parallels if that is too hard to measure. Itâs not easy to say whether repeat results are referencing each other or if itâs just a common phrase, and some results are very similar with small differences (even punctuation, which could be ignored if search contains no punctuation or through some setting), but it would still give results that are easier to read this way, are less cluttered, and can be gone through by uniqueness. Itâs then ordered by highest number of repeats since that could mean more important.
Personally, that would make searching a lot easier for me. One other reason is that Iâm often looking for a term which was a part of longer phrase and it may be hundreds of results down which are full of repeats and I have to memorize which ones Iâve seen so far.
So when I search punabbhavo, I get 168 results. Scrolling through it, thereâs many repeat results, but this way it would be organized like:
-
[15 repeats] AN 3.103, AN 3.104, AN 7.50, MN 26, Kd 1, Ps 2.6, SN 14.31, ⊠SN 56.11
âakuppÄ me vimutti, ayamantimÄ jÄti, natthi dÄni punabbhavoââti. -
[3 repeats] AN 4.1, DN 16, Kv 1.5
Tayidaáč, bhikkhave, ariyaáč sÄ«laáč anubuddhaáč ⊠ucchinnÄ bhavataáčhÄ, khÄ«áčÄ bhavanetti, natthi dÄni punabbhavo -
[3 repeats] AN 8.64, AN 9.41, MN 128
akuppÄ me cetovimutti, ayamantimÄ jÄti, natthi dÄni punabbhavo -
[2 repeats] MN 98, Snp 3.9
santo khÄ«áčapunabbhavo
âŠ
Singlets:
Snp 3.6
VusitavÄ khÄ«áčapunabbhavo sa bhikkhu
Bv 2
Dukkho punabbhavo nÄma
âŠ
I omit some content (the title, name of sutta, root language, whether aligned, language, original text, exact reference path), but that could still be included (not sure how important they really are, though, since if you wish to know about those details as they pertain to searching, then you could just search by them). Of course, it also ignores the context, but so does the current search, and one can still click on the sutta reference, linking to that same line.
The original results can just be mutated into a different data layout, so the original search process isnât changed, which should be a relatively easy change. Also split by reference since some results contain multiple when the term appears on multiple lines. The display can get a little technically weird with title:
or when multiple languages are listed, but I donât think that presents a big challenge as it would still just list out those results separately as described.
Thanks for the suggestion.
So if I am understanding correctly, you arenât just proposing a way to sort results, but to also somehow group the results when the search term is found in identical segments?
This would complicate things, especially since you can search for more than one word, and those words can appear anywhere in the whole sutta.