Robots Exclusion Confusion, Part 1

It’s always nice to find a new use for an old, trusted tool instead of learning a new tool from scratch. The “canonical” element, for example, is a welcome addition to the SEM’s (define) tool set, but it will be several months before we start to see how it truly behaves in the wild.

Contrast that with the robots.txt file, an old tool with one of the highest ratios of benefits to learning time. Most sites use it as a machete (and it’s effective that way), but armed with a few facts, you’ll be able to wield it with scalpel-like precision.

This column refers to the way Yahoo, MSN/Live, and Google react to and process robots.txt directives. I isolate these three engines because almost a year ago, they all agreed to honor an expanded Robots Exclusion Protocol beyond what the original protocol defined. I make no guarantees about how other engines will react, so please don’t assume they’ll behave exactly the same as the big three.

Pointing to XML Sitemaps

You already know you can add the location of your XML sitemap to your robots.txt file, so I’ll skip the easy stuff. Many people, however, think you can list only one sitemap URL in the file. This is incorrect; you can list as many sitemaps as you have (up to a thousand, at least), including files that point to video and mobile content.

Valid sitemap files can contain no more than 50,000 URLs, so if you have 250,000 URLs on your site, dividing them up into five different sitemap files is a perfectly logical solution, and you can list each sitemap URL in a separate line in your robots file. For example:



During a relaunch, I recommend using two separate sitemap files, one each for old and new URLs. It doesn’t hurt anything to list URLs that either no longer exist or now redirect to new locations.

Eventually, they may show up in Google Webmaster Tools crawling reports as errors or under the category of “too many redirects,” but that doesn’t hurt anything. After engines process your redirects and index your new URLs, feel free to remove references to sitemap files that contain old URLs.

Testing Robots.txt Directives

One of the recent highlights of Google’s growing list of Webmaster tools is the “Analyze robots.txt” tool, which resides in the Tools section in Google Webmaster Tools’ left navigation. This enables you to experiment with robots.txt content in a safe, sealed environment.

The concept is stunningly simple. In one pane, enter robots.txt directives, as they would appear in a real robots.txt file. (When you first call up the page, the first pane already contains the content of your existing robots.txt file, if your site has one.) In the second pane, insert a test URL and click “Check.” The page then tells you whether your test URL is allowed or disallowed, and it tells you which line in your robots code is responsible for that status.

Google’s robots.txt testing tool can be used for any site, not just the one for which you’re verified. How? Simply replace the domain you want to test with the one of the verified site.

For example, if you’re verified through Webmaster Tools for but you want to create a robots.txt file for, use the test page normally, but in your test URLs, use instead of In other words, add your robots directives in the top pane as you normally would (since robots directives are domain-agnostic). If you want to test your robots directives against the URL, simply enter into the “Test URLs against this robots.txt” field. If it does what you want, you’ve written the correct directives for’s robots.txt file.

Disallowing URLs that Look Like Directories

I typically recommend a URL structure that avoids file extensions, suggesting a URL such as instead of This gives your site a more visually friendly look in SERPs (define), and it future-proofs your URLs against redirection if you ever migrate to a different platform.

The drawback with such nomenclature is that when such URLs appear in robots.txt files, engines treat them as directories, not as unique URLs. Consequently, a line like

Disallow: /webmail/

tells engines to exclude not only the /webmail/ URL, but every URL in that directory, such as /webmail/recover-password.asp.

But what if you want engines to index the URL /webmail/ so that employees can search for your Webmail address, but you don’t want any other URLs in that directory to be indexed?

Use the $ sign as a “terminator” character. Placing this symbol at the end of a URL tells bots to view that URL only as a URL, not as a directory. Consequently, the following two lines will ensure that /webmail/ isn’t excluded, but that every URL within that directory is excluded:

Disallow: /webmail/ Allow: /webmail/$

Technically, the $ sign means “any URL that ends with the preceding characters.” So in the first line, you’ve disallowed /webmail/ as both a URL and as an entire directory. In the second line, you’ve “added back” /webmail/ as a URL (but not as a directory) by telling engines to allow the indexing of any URL that ends with the characters “/webmail/”.

To Be Continued

There’s still a great deal to discuss with the robots.txt file. Next time, we’ll discuss the use of wildcards, commands that take precedence over other commands, and important misconceptions about giving different instructions to different bots.

Join us for Search Engine Strategies New York March 23-27 at the Hilton New York. The only major search marketing conference and expo on the East Coast, SES New York will be packed with more than 70 sessions, including a ClickZ track, plus more than 150 exhibitors, networking events, parties, and training days.

Related reading

Brand Top Level Domains