The Hostway Blog

Hands Off, Google!

If you have a Web site that you’ve worked hard to get ranked well with the search engines, you may be wondering why on earth you’d want to prevent any part of it from getting indexed. However, there are several types of pages you might want to share only with certain visitors. These might include:

  • Duplicate content: For example, print friendly or downloadable PDF versions of your HTML pages
  • Error message pages
  • Thank you and confirmation pages
  • Special landing pages: Pages you have designed specifically for PPC or Email advertising campaigns, for example

There are several ways to tell the search engines to ignore specific pages on your Web site, but the most common and easiest are meta tags and the robots.txt file.

The NoIndex Meta Tag

Probably the simplest way to exclude just a few pages is with a meta data tag on each page you want the search engines to ignore. The following tag on a page tells all the search engine robots to ignore it:

<meta name=”robots” content=”noindex” />

The search engines will still read these pages, but will not index them.

You can also specify specific bots to exclude, or include. For example:

<meta name=“googlebot” content=“noindex” /> tells only Google’s robot to ignore the page, while

<meta name=“robots” content=“noindex” /> <meta name=“googlebot” content=“index” /> excludes all robots except Google’s.

This tag only works for HTML pages. You cannot block .pdf, .doc or other non-HTML files.

The Robots.txt File

The robots.txt file is a plain text file that lives in a site’s root directory and specifies to the search engines which pages and directories are off limits. Most search engine robots automatically look for this file. Like the meta tag, you can specify which robots it applies to.

To tell all robots to stay away from directories named “promo” and “print”, your robots.txt file would look like this:

User-agent: *

Disallow: /promo/

Disallow: /print/

Remember the trailing slash, or the bots will interpret it to mean they should ignore any file beginning with promo or print, which may not be your intent.

While you can specify individual files, rather than directories, this can quickly become cluttered and unmanageable. It’s much easier to organize your site in such a way that any files you want to exclude from indexing reside in only a few directories.

This method is not foolproof, because some search engines ignore robots.txt.

Other ways to block search engine indexing include password protection — search engines cannot access content protected by a password —and the x-Robots tag for blocking non-HTML content. You may have also heard that not linking to a page will keep search engines from indexing it. While this makes sense in theory, many Web masters have found that it rarely works in practice, and supposedly orphaned pages appeared in search engine results within weeks.

For most small Web sites, using a noindex tag, a robots.txt file or both is sufficient to block indexing of specific pages.