Robots Exclusion Confusion, Part 2

  |  March 18, 2009   |  Comments

How to use the wildcard character, avoid sending mixed signals to robots, and more. Second in a two-part series.

In my last column, I discussed several techniques you can use in your robots.txt file, including broadcasting your site maps' locations, testing your robots file to make sure it will work they way you need it to, and allowing URLs that look like directories while disallowing the content within the directory. Today, I finish up the topic by discussing how to use the wildcard character, how to direct the actions of multiple robots, and how to avoid sending mixed signals to robots.

The Asterisk Wildcard

You've seen the asterisk in the User-Agent: line and know it means "all robots." In Allow and Disallow lines it works similarly, symbolizing any character group of contiguous characters.

Combining the $ and * symbols can be particularly powerful. Suppose you recently migrated from .php to .aspx, and you find yourself with some stray .php files clogging the indices. Here' s a sample directive followed by an explanation of its components:

    Disallow: /*php$
  • / means that the string we're disallowing starts at the first slash following the top-level domain (.com, .net, and so on). Begin all your allow and disallow lines this way.

  • * stands for any number of letters, numbers, or other characters (including slashes, which indicate additional directories).

  • php means that following the random character or characters symbolized by the asterisk, the URL string will contain the characters "php."

  • $ further narrows the preceding step, dictating that only URLs ending in "php" will be affected by this particular disallow line.

A URL like /tags/php/oct-2008/, then, isn't affected by the disallow line, because it doesn't end with "php".

But can the asterisk mean no character? Consider this line:

    Disallow: /*products/default.aspx

We know it will exclude a URL such as /2004-products/default.aspx. But what will happen to the URL /products/default.aspx? The URL will also be excluded, because in addition to the asterisk symbolizing any character or characters, it may also symbolize no characters at all.

Allow Trumps Disallow

If you send mixed messages to robots within the same robots.txt file, which message do they honor? In other words, suppose your robots.txt file lists the following code:

    User-agent: *
    Disallow: /sales-secrets.php
    Disallow: /webmail/
    Allow: /sales-secrets.php
    Allow: /webmail/faq.php

The second and fourth lines contradict one another. Line two disallows a file that line 4 allows. So which line will engines respond to? Google will allow the file, based on both real-world experience as well as the Check Robots.txt tool within Webmaster Tools. MSN/Live and Yahoo should respond similarly because they both adhere to the advanced Robots Exclusion Protocol, although I recommend you verify this. It makes no difference whether the allow or disallow line comes first in the file, allow trumps disallow.

Rock, Paper, Scissors

Who wins when robots meta tags, robots.txt files, and XML files contradict each other about inclusion? Here are some guidelines:

  • If a URL is disallowed by your robots.txt file but it's allowed by a robots meta tag or included in an XML site map, the robots.txt file will take precedence.

  • If your robots.txt file allows a URL but it has a robots "noindex" meta tag, the meta tag will take precedence.

Directing Specific Robots

It's possible to give specific allow and disallow instructions to specific robots. Remember, once you've addressed a specific robot, that robot is no longer bound to global directives. For example, suppose your robots.txt file has the following code:

    User-agent: *
    Disallow: /webmail/
    Disallow: /pdf/
    User-agent: Googlebot
    Disallow: /files/printer-friendly/

I've seen the same mistake many times: people think that the preceding code lines tell Google to disallow the /webmail/, /pdf/, and /files/printer-friendly/ directories. This isn't the case. Because the code has a section dedicated to Googlebot, the bot will adhere only to the specific directions given to it within its specific section. Consequently, Google will crawl /webmail/ and /pdf/ since it hasn't been specifically instructed not to. To get Google to exclude all three directories, you would need the following code:

    User-agent: *
    Disallow: /webmail/
    Disallow: /pdf/
    User-agent: Googlebot
    Disallow: /files/printer-friendly/
    Disallow: /webmail/
    Disallow: /pdf/


The findings in this column are based on a combination of real-world observation and testing and a lot of time experimenting with Google's robots.txt tools in Webmaster Tools. I hope it's a helpful resource in your quest to deal with duplication, data privacy, and overall site maintenance.

Join ClickZ at Search Engine Strategies New York on March 25. More than one dozen online marketing professionals will discuss the latest issues in the larger universe of digital marketing.

ClickZ Live Toronto On the heels of a fantastic event in New York City, ClickZ Live is taking the fun and learning to Toronto, June 23-25. With over 15 years' experience delivering industry-leading events, ClickZ Live offers an action-packed, educationally-focused agenda covering all aspects of digital marketing. Register today!

ClickZ Live San Francisco Want to learn more? Join us at ClickZ Live San Francisco, Aug 10-12!
Educating marketers for over 15 years, ClickZ Live brings together industry thought leaders from the largest brands and agencies to deliver the most advanced, educational digital marketing agenda. Register today and save $500!


Erik Dafforn

Erik Dafforn is the executive vice president of Intrapromote LLC, an SEO firm headquartered in Cleveland, Ohio. Erik manages SEO campaigns for clients ranging from tiny to enormous and edits Intrapromote's blog, SEO Speedwagon. Prior to joining Intrapromote in 1999, Erik worked as a freelance writer and editor. He also worked in-house as a development editor for Macmillan and IDG Books. Erik has a Bachelor's degree in English from Wabash College. Follow Erik and Intrapromote on Twitter.

COMMENTSCommenting policy

comments powered by Disqus

Get the ClickZ Search newsletter delivered to you. Subscribe today!



Featured White Papers

Gartner Magic Quadrant for Digital Commerce

Gartner Magic Quadrant for Digital Commerce
This Magic Quadrant examines leading digital commerce platforms that enable organizations to build digital commerce sites. These commerce platforms facilitate purchasing transactions over the Web, and support the creation and continuing development of an online relationship with a consumer.

Paid Search in the Mobile Era

Paid Search in the Mobile Era
Google reports that paid search ads are currently driving 40+ million calls per month. Cost per click is increasing, paid search budgets are growing, and mobile continues to dominate. It's time to revamp old search strategies, reimagine stale best practices, and add new layers data to your analytics.




    • GREAT Campaign Project Coordinator
      GREAT Campaign Project Coordinator (British Consulate-General, New York) - New YorkThe GREAT Britain Campaign is seeking an energetic and creative...
    • Paid Search Senior Account Manager
      Paid Search Senior Account Manager (Hanapin Marketing) - BloomingtonHanapin Marketing is hiring a strategic Paid Search Senior Account Manager...
    • Paid Search Account Manager
      Paid Search Account Manager (Hanapin Marketing) - BloomingtonHanapin Marketing is hiring an experienced Paid Search Account Manager to...