Home  › Media › Media Buying

Spiders and Robots and Crawlers, Oh My!

  |  October 25, 2001   |  Comments

The Web is full of creepy-crawlies: spiders, bots, and other slithery things that skew your measurements and statistics. The IAB has a plan to play exterminator -- but is it a sound one?

On Monday, the Interactive Advertising Bureau (IAB) made an interesting move intended to inch the industry closer to a standard for online campaign measurement. Recognizing the problems spiders and robots pose in measurement, the IAB announced that it will work with ABC Interactive (ABCi), a leader in the site-auditing space, to provide a master list of spiders and robots for the benefit of the industry.

Spiders and robots are applications that crawl the Web indexing and retrieving content, usually for the benefit of search engines, information resources, and news organizations. As they travel, they become responsible for quite a bit of traffic that gets counted in traffic statistics and ad campaign reports. A master list of these spiders would be useful to the industry for filtering purposes. According to an IAB press release, ABCi will create and maintain the master list for the benefit of IAB members and ABCi customers. The list will be updated monthly.

Like many of its initiatives, the IAB gets an A+ for intent here, but a D at best for execution. The idea is sound, but it needs to be refined.

Let's talk a bit about spiders and robots and how they affect reporting on traffic and campaign stats. The terms "spider," "robot," and "crawler" have been used interchangeably for years to describe applications that gather information from the Internet. These applications can surf the Web, much like you and I do. In their search for information, spiders can artificially inflate traffic statistics. A Web server typically cannot distinguish between information requested by a spider and information requested by a person. Sometimes spiders request ads from a server. Sometimes they'll even follow links from ads, whether they're text links or banners, thus registering ad views and clicks. Obviously, if you're an advertiser, this isn't desirable.

Just how widespread is spider activity? Consider that a spider can be any application that searches or indexes the Web, from the crawler that indexes pages for search engines like Google to the bot written by a computer science student in a sophomore Perl class. People write and use these applications for a variety of purposes and range of activities. Their use is more widespread than most nonprogrammers might think.

You might think that a master list of spiders to assist in filtering their activity is a good idea. It is. But it's more complicated than putting together a list of IP addresses and updating it monthly. Why? The number of spiders on the Web at any given time isn't finite and often isn't tied to specific IP addresses.

Let's use our sophomore computer science student as an example. Say he's working on a spider to retrieve information from a variety of online news sites. He tests it on one computer lab PC on Wednesday and on another on Thursday. Both machines may end up on the master list. If their activity were filtered from traffic statistics using a database updated monthly, no activity would be registered from either of those two machines for the better part of a month (even if other students used the machine to surf the Web at other times).

It may seem like an obscure case, but when you consider how widespread spiders are we could be eliminating plenty of legitimate traffic for no good reason. Forget the geeky programmers for a second, and consider that some of the applications in use by many recreational Web surfers use spidering technology. Ever bookmark a page on Internet Explorer and check the dialogue box that says "Make available offline"? Guess what. When you do that, your computer runs a little application that spiders that page and pulls the content onto your hard drive. Spider use might be a bit more widespread than many people think.

Any database that is expected to track known spiders and crawlers must be updated much more frequently than once a month to be useful. The best way to do this is to observe behavior and filter spider activity at the server level. It's relatively easy to write an application that would notice several page requests from the same IP address within a short period of time (e.g., 100 requests for different Web pages within a second) and recognize it as a spider. A human can't read 100 pages of content in that amount of time. Should that spider's IP address then be added to a master list and be filtered out of every server log in the future? Probably not. Who knows whether that IP address also hosts a Web browser used by a human being? The same spider might show up again in the future with an entirely different IP address. Best to filter activity that is clearly mechanical in nature and leave it at that.

The IAB's idea may seem noble in concept, but it doesn't makes sense in practice. I'm glad that it thought to address the issue. Spider and robot activity is not a subject the average online media planner gives much thought to, but it should be. The IAB deserves thanks for putting it on the agenda and reminding us all that it's a big reason behind inaccurate measurement statistics.

ClickZ Live Chicago Join the Industry's Leading eCommerce & Direct Marketing Experts in Chicago
ClickZ Live Chicago (Nov 3-6) will deliver over 50 sessions across 4 days and 10 individual tracks, including Data-Driven Marketing, Social, Mobile, Display, Search and Email. Check out the full agenda and register by Friday, Oct 3 to take advantage of Early Bird Rates!

ABOUT THE AUTHOR

Tom Hespos

Tom Hespos heads up the interactive media department at Mezzina Brown & Partners. He has been involved in online media buying since the commercial explosion of the Web and has worked at such firms as Young & Rubicam, K2 Design, NOVO Interactive/Blue Marble ACG, and his own independent consulting practice, Underscore Inc. For more information, please visit the Mezzina Brown Web site. He can be reached at thespos@mezzinabrown.com.

COMMENTSCommenting policy

comments powered by Disqus

Get ClickZ Media newsletters delivered right to your inbox. Subscribe today!

COMMENTS

UPCOMING EVENTS

Featured White Papers

IBM: Social Analytics - The Science Behind Social Media Marketing

IBM Social Analytics: The Science Behind Social Media Marketing
80% of internet users say they prefer to connect with brands via Facebook. 65% of social media users say they use it to learn more about brands, products and services. Learn about how to find more about customers' attitudes, preferences and buying habits from what they say on social media channels.

An Introduction to Marketing Attribution: Selecting the Right Model for Search, Display & Social Advertising

An Introduction to Marketing Attribution: Selecting the Right Model for Search, Display & Social Advertising
If you're considering implementing a marketing attribution model to measure and optimize your programs, this paper is a great introduction. It also includes real-life tips from marketers who have successfully implemented attribution in their organizations.

WEBINARS

Jobs

    • Web Writer
      Web Writer (Money Map Press) - BaltimoreDo you have a passion for the markets and investing, and writing? Do you want to spend your days providing...
    • Web Production Specialist
      Web Production Specialist (Money Map Press) - BaltimoreMoney Map Press is looking for a self-starter to perform and oversee the production of daily...
    • Internet Marketing Campaign Manager
      Internet Marketing Campaign Manager (Straight North, LLC) - Downers GroveWe are looking for a talented Internet Marketing Campaign Manager...