SEO: Tell Google Which Pages Not to Crawl

The typical goal of search engine optimization is to have your site’s pages show up on a Google results page in answer to a query. The object is for Google and every other search engine to crawl and index all your product detail pages, blog posts, and articles, and anything else that results in conversions.

But there are pages that should not be included in search results. Removing them from Google’s index might actually increase search engine traffic to more important, better-converting pages.

Don’t Index These

But do you really care if your privacy policy, GDPR disclosures, or similar pages are showing up on Google? Pages you likely don’t want Google to index include:

Thank you pages (displayed after a survey or similar)

Ad landing pages (meant for pay-per-click campaigns)

Internal site search results (because going from Google’s results page right into your website’s search results page may not be a good user experience).

Not every page in your company's website should be indexed with Google. <em>Photo by Campaign Creators.</em>

Not every page on your company’s website should be indexed with Google. Photo: Campaign Creators.

Removing Pages

Getting these sorts of pages out of Google’s index could also improve your website’s authority which in turn might improve how well its various pages rank on Google for relevant queries.

Some SEO practitioners argue that Google has become adept at identifying content quality and is on the lookout, so to speak, for redundant, duplicate, or relatively low-quality pages.

What’s more, some SEO professionals have suggested that Google averages the relative value of all of the pages on your website to create an aggregate authority or value score. This might be domain authority, domain rank, or a similar metric.

If your company has stuffed Google’s index with relatively low-value pages — such as the privacy policy your tech guy copied and pasted from your ecommerce platform provider — it could affect how authoritative Google believes your site is as a whole.

For example, writing about the topic of removing website pages (deleting pages, in this instance), Chris Hickey of Inflow, an ecommerce agency in Denver, Colorado, reported a 22 percent increase in organic search engine traffic and a 7 percent increase in revenue from organic search traffic after culling thousands of duplicate pages from a client’s ecommerce website.

Similarly, in 2017 SEO tool maker Moz removed 75 percent of the pages on its website from the Google index. The pages were primarily low-value member profiles from the Moz community. These pages did not have much unique content, and removing them from the Google index resulted in a 13.7 percent increase in year-over-year organic search traffic.

Removal Tool

Perhaps the best tool for removing an individual page from Google’s index is the robots noindex meta tag.

<meta name="robots" content="noindex" />

Inserted in the <head> section of a page’s HTML markup, this simple tag asks all search engines not to index the associated page. Google’s primary web crawler, Googlebot, follows this directive and will drop any page marked with noindex the next time it crawls that page.

Using your website’s content management system, it should be relatively easy to add this tag to policy pages, internal search results, and other pages that don’t need to be included in Google’s index or shown in response to a Google query.

HTTP Response Header

The robots noindex directive may also be passed in an HTTP response header. Think of the HTTP response header as a text message your server sends to a web browser or web crawler (such as Googlebot) when it requests a page.

Within this header, your site can tell Google not to index the page. Here is an example.

HTTP/1.1 200 OK
X-Robots-Tag: noindex

For some businesses, it may be easier to write a script that will place this X-Robots-Tag than it would be to manually or even programmatically add the robots meta tag. Both this HTTP tag and the meta tag have the same effect. Which one of these methods your business uses is a matter of preference.

Prevent Indexing?

Robots.txt does not prevent indexing. A robots.txt file is located in a website’s root directory. This simple text file tells a search engine web crawler which pages on the site it can access.

Often, website owners and managers mistakenly think that disallowing a page in a robots.txt file will prevent that page from showing up in Google’s index. But that is not always the case.

For example, if another site links to a page on your company’s website, Googlebot could follow that link and index the page even if that page is disallowed in a robots.txt file.

If you want to remove pages from Google’s index, the robots.txt file is probably not the best choice. Rather, it is helpful for limiting how Google indexes your site and preventing search engine bots from overwhelming your company’s web server.

It is important to mention that you should not disallow a page in a robots.txt file and use a noindex tag at the same time. Doing so could cause Googlebot to miss the noindex directive.

Ultimately, it may sound counterintuitive, but there are almost certainly pages on your company’s website that should not be included in Google’s index or displayed on a Google results page. The best way to remove those pages is with a robots noindex tag.

Source link