nikjohn nikjohn - 1 month ago 5
HTML Question

Site appearing on Google SERP in spite of proper robots.txt configuration

I have a ExpressJS web application, that is used for internal purposes, that I don't want Google to index. So I have implemented the following route:

app.get('/robots.txt', function(req,res) {
res.set('Content-Type', 'text/plain');
res.send('User-agent: *\nDisallow: /');
}


I verified that this was working fine, by hitting the URL and checking the response, which is

User-agent: *
Disallow: /


In spite of this, I can see the my page result on Google when I search for the site title. The app has been online for a year or so now, so it couldn't have been cached results. Is there any other possible reason why this is happening? Any methods to troubleshoot?

Answer

http://webmasters.stackexchange.com/questions/54879/does-google-ignore-robots-txt

Google will still see sites blocked by robots.txt, and may even list them in search results.

This is especially the case when entire domains/subdomains are blocked. Google will list links to these along with the text A description for this result is not available because of this site's robots.txt – learn more with a link to https://support.google.com/webmasters/answer/156449 .

add a <meta name="robots" content="noindex, nofollow"> to your pages output.

EDIT From the discission in the comments:

If you allow a page with robots.txt but block it from being indexed using a meta tag, Googlebot will access the page, read the meta tag, and subsequently not index it.

So to prevent google crawling your site: use deny in robots.txt, no need for meta tags.
If there are external links pointing to your site: use allow in robots.txt, use noindex, nofollow on those pages that appear in google.

How to easely see which pages google has on you:

Use site:stackoverflow.com as search query, and google will list basically all pages of that website it has indexed.

To learn more about how google crawls your pages: https://support.google.com/webmasters/topic/4617736?hl=en&ref_topic=4589290

Also, remember, google isn't the only search engine. There's bing, yahoo, baidu and a plethora of other searchengines, and not all play nice with meta tags or robots.txt, some even pretend to be another search engine so their crawl doesn't get blocked.