Spiders and Robots and Crawlers, Oh My!

On Monday, the Interactive Advertising Bureau (IAB) made an interesting move intended to inch the industry closer to a standard for online campaign measurement. Recognizing the problems spiders and robots pose in measurement, the IAB announced that it will work with ABC Interactive (ABCi), a leader in the site-auditing space, to provide a master list of spiders and robots for the benefit of the industry.

Spiders and robots are applications that crawl the Web indexing and retrieving content, usually for the benefit of search engines, information resources, and news organizations. As they travel, they become responsible for quite a bit of traffic that gets counted in traffic statistics and ad campaign reports. A master list of these spiders would be useful to the industry for filtering purposes. According to an IAB press release, ABCi will create and maintain the master list for the benefit of IAB members and ABCi customers. The list will be updated monthly.

Like many of its initiatives, the IAB gets an A+ for intent here, but a D at best for execution. The idea is sound, but it needs to be refined.

Let’s talk a bit about spiders and robots and how they affect reporting on traffic and campaign stats. The terms “spider,” “robot,” and “crawler” have been used interchangeably for years to describe applications that gather information from the Internet. These applications can surf the Web, much like you and I do. In their search for information, spiders can artificially inflate traffic statistics. A Web server typically cannot distinguish between information requested by a spider and information requested by a person. Sometimes spiders request ads from a server. Sometimes they’ll even follow links from ads, whether they’re text links or banners, thus registering ad views and clicks. Obviously, if you’re an advertiser, this isn’t desirable.

Just how widespread is spider activity? Consider that a spider can be any application that searches or indexes the Web, from the crawler that indexes pages for search engines like Google to the bot written by a computer science student in a sophomore Perl class. People write and use these applications for a variety of purposes and range of activities. Their use is more widespread than most nonprogrammers might think.

You might think that a master list of spiders to assist in filtering their activity is a good idea. It is. But it’s more complicated than putting together a list of IP addresses and updating it monthly. Why? The number of spiders on the Web at any given time isn’t finite and often isn’t tied to specific IP addresses.

Let’s use our sophomore computer science student as an example. Say he’s working on a spider to retrieve information from a variety of online news sites. He tests it on one computer lab PC on Wednesday and on another on Thursday. Both machines may end up on the master list. If their activity were filtered from traffic statistics using a database updated monthly, no activity would be registered from either of those two machines for the better part of a month (even if other students used the machine to surf the Web at other times).

It may seem like an obscure case, but when you consider how widespread spiders are we could be eliminating plenty of legitimate traffic for no good reason. Forget the geeky programmers for a second, and consider that some of the applications in use by many recreational Web surfers use spidering technology. Ever bookmark a page on Internet Explorer and check the dialogue box that says “Make available offline”? Guess what. When you do that, your computer runs a little application that spiders that page and pulls the content onto your hard drive. Spider use might be a bit more widespread than many people think.

Any database that is expected to track known spiders and crawlers must be updated much more frequently than once a month to be useful. The best way to do this is to observe behavior and filter spider activity at the server level. It’s relatively easy to write an application that would notice several page requests from the same IP address within a short period of time (e.g., 100 requests for different Web pages within a second) and recognize it as a spider. A human can’t read 100 pages of content in that amount of time. Should that spider’s IP address then be added to a master list and be filtered out of every server log in the future? Probably not. Who knows whether that IP address also hosts a Web browser used by a human being? The same spider might show up again in the future with an entirely different IP address. Best to filter activity that is clearly mechanical in nature and leave it at that.

The IAB’s idea may seem noble in concept, but it doesn’t makes sense in practice. I’m glad that it thought to address the issue. Spider and robot activity is not a subject the average online media planner gives much thought to, but it should be. The IAB deserves thanks for putting it on the agenda and reminding us all that it’s a big reason behind inaccurate measurement statistics.

Related reading

Vector graphic of a megaphone spewing out business themed items, such as a laptop, tablet, pen, @ symbol and smartphone