Spiders and Robots and Crawlers, Oh My!

The Web is full of creepy-crawlies: spiders, bots, and other slithery things that skew your measurements and statistics. The IAB has a plan to play exterminator -- but is it a sound one?

On Monday, the Interactive Advertising Bureau (IAB) made an interesting move intended to inch the industry closer to a standard for online campaign measurement. Recognizing the problems spiders and robots pose in measurement, the IAB announced that it will work with ABC Interactive (ABCi), a leader in the site-auditing space, to provide a master list of spiders and robots for the benefit of the industry.

Spiders and robots are applications that crawl the Web indexing and retrieving content, usually for the benefit of search engines, information resources, and news organizations. As they travel, they become responsible for quite a bit of traffic that gets counted in traffic statistics and ad campaign reports. A master list of these spiders would be useful to the industry for filtering purposes. According to an IAB press release, ABCi will create and maintain the master list for the benefit of IAB members and ABCi customers. The list will be updated monthly.

Like many of its initiatives, the IAB gets an A+ for intent here, but a D at best for execution. The idea is sound, but it needs to be refined.

Let’s talk a bit about spiders and robots and how they affect reporting on traffic and campaign stats. The terms “spider,” “robot,” and “crawler” have been used interchangeably for years to describe applications that gather information from the Internet. These applications can surf the Web, much like you and I do. In their search for information, spiders can artificially inflate traffic statistics. A Web server typically cannot distinguish between information requested by a spider and information requested by a person. Sometimes spiders request ads from a server. Sometimes they’ll even follow links from ads, whether they’re text links or banners, thus registering ad views and clicks. Obviously, if you’re an advertiser, this isn’t desirable.

Just how widespread is spider activity? Consider that a spider can be any application that searches or indexes the Web, from the crawler that indexes pages for search engines like Google to the bot written by a computer science student in a sophomore Perl class. People write and use these applications for a variety of purposes and range of activities. Their use is more widespread than most nonprogrammers might think.

You might think that a master list of spiders to assist in filtering their activity is a good idea. It is. But it’s more complicated than putting together a list of IP addresses and updating it monthly. Why? The number of spiders on the Web at any given time isn’t finite and often isn’t tied to specific IP addresses.

Let’s use our sophomore computer science student as an example. Say he’s working on a spider to retrieve information from a variety of online news sites. He tests it on one computer lab PC on Wednesday and on another on Thursday. Both machines may end up on the master list. If their activity were filtered from traffic statistics using a database updated monthly, no activity would be registered from either of those two machines for the better part of a month (even if other students used the machine to surf the Web at other times).

It may seem like an obscure case, but when you consider how widespread spiders are we could be eliminating plenty of legitimate traffic for no good reason. Forget the geeky programmers for a second, and consider that some of the applications in use by many recreational Web surfers use spidering technology. Ever bookmark a page on Internet Explorer and check the dialogue box that says “Make available offline”? Guess what. When you do that, your computer runs a little application that spiders that page and pulls the content onto your hard drive. Spider use might be a bit more widespread than many people think.

Any database that is expected to track known spiders and crawlers must be updated much more frequently than once a month to be useful. The best way to do this is to observe behavior and filter spider activity at the server level. It’s relatively easy to write an application that would notice several page requests from the same IP address within a short period of time (e.g., 100 requests for different Web pages within a second) and recognize it as a spider. A human can’t read 100 pages of content in that amount of time. Should that spider’s IP address then be added to a master list and be filtered out of every server log in the future? Probably not. Who knows whether that IP address also hosts a Web browser used by a human being? The same spider might show up again in the future with an entirely different IP address. Best to filter activity that is clearly mechanical in nature and leave it at that.

The IAB’s idea may seem noble in concept, but it doesn’t makes sense in practice. I’m glad that it thought to address the issue. Spider and robot activity is not a subject the average online media planner gives much thought to, but it should be. The IAB deserves thanks for putting it on the agenda and reminding us all that it’s a big reason behind inaccurate measurement statistics.

Subscribe to get your daily business insights

Whitepapers

US Mobile Streaming Behavior
Whitepaper | Mobile

US Mobile Streaming Behavior

5y

US Mobile Streaming Behavior

Streaming has become a staple of US media-viewing habits. Streaming video, however, still comes with a variety of pesky frustrations that viewers are ...

View resource
Winning the Data Game: Digital Analytics Tactics for Media Groups
Whitepaper | Analyzing Customer Data

Winning the Data Game: Digital Analytics Tactics for Media Groups

5y

Winning the Data Game: Digital Analytics Tactics f...

Data is the lifeblood of so many companies today. You need more of it, all of which at higher quality, and all the meanwhile being compliant with data...

View resource
Learning to win the talent war: how digital marketing can develop its people
Whitepaper | Digital Marketing

Learning to win the talent war: how digital marketing can develop its peopl...

2y

Learning to win the talent war: how digital market...

This report documents the findings of a Fireside chat held by ClickZ in the first quarter of 2022. It provides expert insight on how companies can ret...

View resource
Engagement To Empowerment - Winning in Today's Experience Economy
Report | Digital Transformation

Engagement To Empowerment - Winning in Today's Experience Economy

2m

Engagement To Empowerment - Winning in Today's Exp...

Customers decide fast, influenced by only 2.5 touchpoints – globally! Make sure your brand shines in those critical moments. Read More...

View resource