Is SuttaCentral properly indexed in Google?

Gabriel_L · September 1, 2018, 4:15pm

I have recently tried to search something within SuttaCentral by using Google’s "site:* command and only managed to find stuff from this forum.

As a result I ask of SuttaCentral.net is indexed in Google’s databases?

Is it the case SuttaCentral hasn’t yet been indexed by Google or this is an outcome of a specific and conscious setting?

mikenz66 · September 1, 2018, 8:44pm

Hmm. You used to be able to at least search for sutta names, but something seems to have happened to that. Perhaps there is a flag in the robots.txt file that is preventing the site being indexed.

I do get SuttaCentral from searching

site:https://suttacentral.net/ mn 1

but it is a long way down the list…

A search for Bhante @Sujato’s text, such as:

site:https://suttacentral.net/ Mendicants, I shall teach you the great discourse on the six sense fields.

does not find MN149, and searching for Bhikkhu Bodhi’s text:

site:https://suttacentral.net/ Bhikkhus, I shall teach you a discourse on the great sixfold base. Listen and attend closely to what I shall say.

goes to legacy.suttacentral.net/en/mn149

mikenz66 · September 1, 2018, 9:10pm

Hmm, I see it does work to do the search:

site:https://suttacentral.net/ Mendicants, I shall teach you the great discourse on the six sense fields.

on bing.com

Gabriel_L · September 2, 2018, 12:39am

I didn’t try Bing!
But I think something needs to be done, as not having it indexed in Google is potentially stopping people from finding SuttaCentral!
Maybe @blake can help??

mikenz66 · September 2, 2018, 3:54am

I agree, but my point was that if the bing robot is scanning the site, it seems unlikely that the google robot is being made deliberately unwelcome. Perhaps the new technology being used on the site is confusing the google robot…

There is something there, since:

site:suttacentral.net mn 5

does give a link (a little way down the page, at least for me):

site:suttacentral.net mn 5 - Google Search

However, there are many more hits on the legacy site…

Gabriel_L · September 2, 2018, 4:42am

Can we do anything about it? Maybe create an index behind the scenes so the bit can make the right links and associations with keywords and text strings ?

SCMatt · September 2, 2018, 1:59pm

You can use an “advanced” sitemap in xml to tell Google how important certain pages or sections are and how frequently the Googlebot should crawl. Of course, these are only suggestions to the Googlebot — it does what it wants!

The way a webcrawler like Googlebot works is it basically follows hyperlinks both internal and external. Indexing and pageranking (within SERPs = Search Engine Results Pages) still works mostly by the quantity of quality (these days) links to a page. Even though it’s much more sophisticated these days than it used to be (when Larry Page wrote PageRank), links are still probably the number one factor in pageranking.

Aminah · September 3, 2018, 9:31am

Yes, I also used to have an issue with D&D pages clogging up returned results, but you can add a negative search to filter those out (also for legacy pages):

site:suttacentral.net -site:*.suttacentral.net (to filter out everything but results from the main site)

Applying this to the above MN149 example, I do get three results (Italian, Portuguese & Russian), but it is still proper weird that the main sutta page isn’t returned. I’ll try to follow up.

blake · September 5, 2018, 2:18pm

Well, looks like something in the interaction between our servers, Cloudflare, and Googlebot broke down and pages stopped being indexed a couple of weeks ago. I managed to coerce it into working again so they should be getting indexed again.

At the moment we don’t have a sitemap, we’d probably want to figure out what should be important URLs in the vain hope that the GoogleBot will prioritize them.

Gabriel_L · September 5, 2018, 10:32pm

Thanks for your reply and update.
How could one help putting together a sitemap?

SCMatt · September 6, 2018, 2:14am

Google will accept a simple sitemap of just url’s listed out in a textfile, one per line (saved as sitemap.txt in the “document root”). I have confidence @blake could generate this programmatically.

Not sure if SC already has Google Search Console (previously known as Webmaster Tools) installed, but that should help with identifying crawl errors.

SCMatt · September 7, 2018, 3:11am

This might be good to do as well:

https://hstspreload.org/?domain=suttacentral.net

A big plus is that you get put into a list of domains included in Chrome.

sujato · September 11, 2018, 2:22am

Thanks for the feedback here, just so you know, Blake is working on a sitemap, and hopefully our Google situation will improve.

Gabriel_L · May 25, 2020, 1:06am

Hi bhante @sujato, @blake, has there been any update on this since September 2018?

sujato · May 25, 2020, 1:58am

We haven’t really focused on this. What has changed, though, is that Google’s bot has finally been updated; it was previously based on an old Chrome version that didn’t play nice with advanced sites like SC. So that should be better now.