nikjohn nikjohn - 1 year ago 70
HTML Question

Site appearing on Google SERP in spite of proper robots.txt configuration

I have a ExpressJS web application, that is used for internal purposes, that I don't want Google to index. So I have implemented the following route:

app.get('/robots.txt', function(req,res) {
res.set('Content-Type', 'text/plain');
res.send('User-agent: *\nDisallow: /');
}


I verified that this was working fine, by hitting the URL and checking the response, which is

User-agent: *
Disallow: /


In spite of this, I can see the my page result on Google when I search for the site title. The app has been online for a year or so now, so it couldn't have been cached results. Is there any other possible reason why this is happening? Any methods to troubleshoot?

Answer Source

http://webmasters.stackexchange.com/questions/54879/does-google-ignore-robots-txt

Google will still see sites blocked by robots.txt, and may even list them in search results.

This is especially the case when entire domains/subdomains are blocked. Google will list links to these along with the text A description for this result is not available because of this site's robots.txt – learn more with a link to https://support.google.com/webmasters/answer/156449 .

add a <meta name="robots" content="noindex, nofollow"> to your pages output.

EDIT From the discission in the comments:

If you allow a page with robots.txt but block it from being indexed using a meta tag, Googlebot will access the page, read the meta tag, and subsequently not index it.

So to prevent google crawling your site: use deny in robots.txt, no need for meta tags.
If there are external links pointing to your site: use allow in robots.txt, use noindex, nofollow on those pages that appear in google.

How to easely see which pages google has on you:

Use site:stackoverflow.com as search query, and google will list basically all pages of that website it has indexed.

To learn more about how google crawls your pages: https://support.google.com/webmasters/topic/4617736?hl=en&ref_topic=4589290

Also, remember, google isn't the only search engine. There's bing, yahoo, baidu and a plethora of other searchengines, and not all play nice with meta tags or robots.txt, some even pretend to be another search engine so their crawl doesn't get blocked.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download