I have recently tried to search something within SuttaCentral by using Google’s "site:* command and only managed to find stuff from this forum.
As a result I ask of SuttaCentral.net is indexed in Google’s databases?
Is it the case SuttaCentral hasn’t yet been indexed by Google or this is an outcome of a specific and conscious setting?
Hmm. You used to be able to at least search for sutta names, but something seems to have happened to that. Perhaps there is a flag in the robots.txt file that is preventing the site being indexed.
I do get suttacentral.net/mn1/vn/ from searching
site:https://suttacentral.net/ mn 1
but it is a long way down the list…
A search for Bhante @Sujato’s text, such as:
site:https://suttacentral.net/ Mendicants, I shall teach you the great discourse on the six sense fields.
does not find MN149, and searching for Bhikkhu Bodhi’s text:
site:https://suttacentral.net/ Bhikkhus, I shall teach you a discourse on the great sixfold base. Listen and attend closely to what I shall say.
goes to legacy.suttacentral.net/en/mn149
I didn’t try Bing!
But I think something needs to be done, as not having it indexed in Google is potentially stopping people from finding SuttaCentral!
Maybe @blake can help??
I agree, but my point was that if the bing robot is scanning the site, it seems unlikely that the google robot is being made deliberately unwelcome. Perhaps the new technology being used on the site is confusing the google robot…
There is something there, since:
site:suttacentral.net mn 5
does give a link (a little way down the page, at least for me):
site:suttacentral.net mn 5 - Google Search
However, there are many more hits on the legacy site…
Can we do anything about it? Maybe create an index behind the scenes so the bit can make the right links and associations with keywords and text strings ?
You can use an “advanced” sitemap in xml to tell Google how important certain pages or sections are and how frequently the Googlebot should crawl. Of course, these are only suggestions to the Googlebot — it does what it wants!
The way a webcrawler like Googlebot works is it basically follows hyperlinks both internal and external. Indexing and pageranking (within SERPs = Search Engine Results Pages) still works mostly by the quantity of quality (these days) links to a page. Even though it’s much more sophisticated these days than it used to be (when Larry Page wrote PageRank), links are still probably the number one factor in pageranking.
Yes, I also used to have an issue with D&D pages clogging up returned results, but you can add a negative search to filter those out (also for legacy pages):
site:suttacentral.net -site:*.suttacentral.net (to filter out everything but results from the main site)
Applying this to the above MN149 example, I do get three results (Italian, Portuguese & Russian), but it is still proper weird that the main sutta page isn’t returned. I’ll try to follow up.
Well, looks like something in the interaction between our servers, Cloudflare, and Googlebot broke down and pages stopped being indexed a couple of weeks ago. I managed to coerce it into working again so they should be getting indexed again.
At the moment we don’t have a sitemap, we’d probably want to figure out what should be important URLs in the vain hope that the GoogleBot will prioritize them.
Thanks for your reply and update.
How could one help putting together a sitemap?
Google will accept a simple sitemap of just url’s listed out in a textfile, one per line (saved as sitemap.txt in the “document root”). I have confidence @blake could generate this programmatically.
Not sure if SC already has Google Search Console (previously known as Webmaster Tools) installed, but that should help with identifying crawl errors.
This might be good to do as well:
A big plus is that you get put into a list of domains included in Chrome.
Thanks for the feedback here, just so you know, Blake is working on a sitemap, and hopefully our Google situation will improve.
Hi bhante @sujato, @blake, has there been any update on this since September 2018?
We haven’t really focused on this. What has changed, though, is that Google’s bot has finally been updated; it was previously based on an old Chrome version that didn’t play nice with advanced sites like SC. So that should be better now.