Home  › Media › Media Buying

Spiders and Robots and Crawlers, Oh My!

  |  October 25, 2001   |  Comments

The Web is full of creepy-crawlies: spiders, bots, and other slithery things that skew your measurements and statistics. The IAB has a plan to play exterminator -- but is it a sound one?

On Monday, the Interactive Advertising Bureau (IAB) made an interesting move intended to inch the industry closer to a standard for online campaign measurement. Recognizing the problems spiders and robots pose in measurement, the IAB announced that it will work with ABC Interactive (ABCi), a leader in the site-auditing space, to provide a master list of spiders and robots for the benefit of the industry.

Spiders and robots are applications that crawl the Web indexing and retrieving content, usually for the benefit of search engines, information resources, and news organizations. As they travel, they become responsible for quite a bit of traffic that gets counted in traffic statistics and ad campaign reports. A master list of these spiders would be useful to the industry for filtering purposes. According to an IAB press release, ABCi will create and maintain the master list for the benefit of IAB members and ABCi customers. The list will be updated monthly.

Like many of its initiatives, the IAB gets an A+ for intent here, but a D at best for execution. The idea is sound, but it needs to be refined.

Let's talk a bit about spiders and robots and how they affect reporting on traffic and campaign stats. The terms "spider," "robot," and "crawler" have been used interchangeably for years to describe applications that gather information from the Internet. These applications can surf the Web, much like you and I do. In their search for information, spiders can artificially inflate traffic statistics. A Web server typically cannot distinguish between information requested by a spider and information requested by a person. Sometimes spiders request ads from a server. Sometimes they'll even follow links from ads, whether they're text links or banners, thus registering ad views and clicks. Obviously, if you're an advertiser, this isn't desirable.

Just how widespread is spider activity? Consider that a spider can be any application that searches or indexes the Web, from the crawler that indexes pages for search engines like Google to the bot written by a computer science student in a sophomore Perl class. People write and use these applications for a variety of purposes and range of activities. Their use is more widespread than most nonprogrammers might think.

You might think that a master list of spiders to assist in filtering their activity is a good idea. It is. But it's more complicated than putting together a list of IP addresses and updating it monthly. Why? The number of spiders on the Web at any given time isn't finite and often isn't tied to specific IP addresses.

Let's use our sophomore computer science student as an example. Say he's working on a spider to retrieve information from a variety of online news sites. He tests it on one computer lab PC on Wednesday and on another on Thursday. Both machines may end up on the master list. If their activity were filtered from traffic statistics using a database updated monthly, no activity would be registered from either of those two machines for the better part of a month (even if other students used the machine to surf the Web at other times).

It may seem like an obscure case, but when you consider how widespread spiders are we could be eliminating plenty of legitimate traffic for no good reason. Forget the geeky programmers for a second, and consider that some of the applications in use by many recreational Web surfers use spidering technology. Ever bookmark a page on Internet Explorer and check the dialogue box that says "Make available offline"? Guess what. When you do that, your computer runs a little application that spiders that page and pulls the content onto your hard drive. Spider use might be a bit more widespread than many people think.

Any database that is expected to track known spiders and crawlers must be updated much more frequently than once a month to be useful. The best way to do this is to observe behavior and filter spider activity at the server level. It's relatively easy to write an application that would notice several page requests from the same IP address within a short period of time (e.g., 100 requests for different Web pages within a second) and recognize it as a spider. A human can't read 100 pages of content in that amount of time. Should that spider's IP address then be added to a master list and be filtered out of every server log in the future? Probably not. Who knows whether that IP address also hosts a Web browser used by a human being? The same spider might show up again in the future with an entirely different IP address. Best to filter activity that is clearly mechanical in nature and leave it at that.

The IAB's idea may seem noble in concept, but it doesn't makes sense in practice. I'm glad that it thought to address the issue. Spider and robot activity is not a subject the average online media planner gives much thought to, but it should be. The IAB deserves thanks for putting it on the agenda and reminding us all that it's a big reason behind inaccurate measurement statistics.

ClickZ Live New York What's New for 2015?
You spoke, we listened! ClickZ Live New York (Mar 30-Apr 1) is back with a brand new streamlined agenda. Don't miss the latest digital marketing tips, tricks and tools that will make you re-think your strategy and revolutionize your marketing campaigns. Super Saver Rates are available now. Register today!

ABOUT THE AUTHOR

Tom Hespos

Tom Hespos heads up the interactive media department at Mezzina Brown & Partners. He has been involved in online media buying since the commercial explosion of the Web and has worked at such firms as Young & Rubicam, K2 Design, NOVO Interactive/Blue Marble ACG, and his own independent consulting practice, Underscore Inc. For more information, please visit the Mezzina Brown Web site. He can be reached at thespos@mezzinabrown.com.

COMMENTSCommenting policy

comments powered by Disqus

Get ClickZ Media newsletters delivered right to your inbox. Subscribe today!

COMMENTS

UPCOMING EVENTS

UPCOMING TRAINING

Featured White Papers

Google My Business Listings Demystified

Google My Business Listings Demystified
To help brands control how they appear online, Google has developed a new offering: Google My Business Locations. This whitepaper helps marketers understand how to use this powerful new tool.

5 Ways to Personalize Beyond the Subject Line

5 Ways to Personalize Beyond the Subject Line
82 percent of shoppers say they would buy more items from a brand if the emails they sent were more personalized. This white paper offer five tactics that will personalize your email beyond the subject line and drive real business growth.

WEBINARS

    Information currently unavailable

Jobs

    • Creative Project Manager
      Creative Project Manager (Agora Inc. ) - BaltimoreThe Creative Project Manager of PubSVS will work directly with the IRIS team and will be responsible...
    • Digital Marketing Associate
      Digital Marketing Associate (Connections Media) - Washington, DCConnections Media, LLC, a Washington, DC-based digital agency providing strategy...
    • Lead Generation Specialist
      Lead Generation Specialist (The Oxford Club) - BaltimoreThe Oxford Club is seeking a talented writer/marketer to join our growing email lead-generation...