Hmm. You used to be able to at least search for sutta names, but something seems to have happened to that. Perhaps there is a flag in the robots.txt file that is preventing the site being indexed.
I didn’t try Bing!
But I think something needs to be done, as not having it indexed in Google is potentially stopping people from finding SuttaCentral!
Maybe @blake can help??
I agree, but my point was that if the bing robot is scanning the site, it seems unlikely that the google robot is being made deliberately unwelcome. Perhaps the new technology being used on the site is confusing the google robot…
There is something there, since:
site:suttacentral.net mn 5
does give a link (a little way down the page, at least for me):
Can we do anything about it? Maybe create an index behind the scenes so the bit can make the right links and associations with keywords and text strings ?
You can use an “advanced” sitemap in xml to tell Google how important certain pages or sections are and how frequently the Googlebot should crawl. Of course, these are only suggestions to the Googlebot — it does what it wants!
The way a webcrawler like Googlebot works is it basically follows hyperlinks both internal and external. Indexing and pageranking (within SERPs = Search Engine Results Pages) still works mostly by the quantity of quality (these days) links to a page. Even though it’s much more sophisticated these days than it used to be (when Larry Page wrote PageRank), links are still probably the number one factor in pageranking.
Yes, I also used to have an issue with D&D pages clogging up returned results, but you can add a negative search to filter those out (also for legacy pages):
site:suttacentral.net -site:*.suttacentral.net (to filter out everything but results from the main site)
Applying this to the above MN149 example, I do get three results (Italian, Portuguese & Russian), but it is still proper weird that the main sutta page isn’t returned. I’ll try to follow up.
Well, looks like something in the interaction between our servers, Cloudflare, and Googlebot broke down and pages stopped being indexed a couple of weeks ago. I managed to coerce it into working again so they should be getting indexed again.
At the moment we don’t have a sitemap, we’d probably want to figure out what should be important URLs in the vain hope that the GoogleBot will prioritize them.
Google will accept a simple sitemap of just url’s listed out in a textfile, one per line (saved as sitemap.txt in the “document root”). I have confidence @blake could generate this programmatically.
Not sure if SC already has Google Search Console (previously known as Webmaster Tools) installed, but that should help with identifying crawl errors.
We haven’t really focused on this. What has changed, though, is that Google’s bot has finally been updated; it was previously based on an old Chrome version that didn’t play nice with advanced sites like SC. So that should be better now.