Testing the Performance of Your Media Buys

Much is made of our ability to take an ongoing media campaign, learn from it, and change the campaign to reflect what we’ve just learned. And, in principle, this is true. But to get valid results, we need to follow some particular processes and avoid some rather common bugaboos.

A Valid Comparison Involves Apples Versus Apples

The most common failing in this testing and iteration process is the assumption that the data we get from one source can be compared to and used with data from another source. Many media buyers take the click-through rate of one site and compare it to the click-through rate of another site to determine the quality of audience relative to the creative. Unless those two sites happen to be reporting via the same technology and methodology, you’re going to have significant differences.

But media buyers are an optimistic lot. They tend to know that the numbers they throw together into their Excel spreadsheets aren’t exactly apples and apples. Maybe there are a few pears and peaches thrown in, perhaps even some carrots and arugula. But it’ll all work out in the wash, won’t it?

Well, not really. You have to remember that the differences we seek in the numbers are quite small. We’re looking for a 5 percent difference in media performance, for instance. This requires a great deal of data harmony to not set off our data detectors at completely inappropriate times.

Here’s a good way to really bring this point home. Put together your own Microsoft Excel spreadsheet with three columns of numbers. Make them obvious patterns, like 1, 2, 3, etc. Then, make a fourth column that multiplies them all together in some sort of an equation – like column A times column B divided by column C to the seventh power. Now, chart that last column, and you’ll see some sort of pattern. Perhaps a sine wave or a trending curve.

Now, use the deviation Excel function to change the first three columns of numbers by 15 percent. Essentially, this allows you to take that original list of patterned numbers and have every one of them changed by a specified range.

After doing that, look at your pretty graph. It gets rather ugly. In fact, the more you have the spreadsheet recalculate the numbers, the more you wind up seeing different “patterns” in the graph each time. It goes up one way one time and down that way the next. This is the moral equivalent of employing different site numbers from different sources and attempting to draw conclusions about your campaign. You wind up chasing all those phantom patterns.

Anecdotal Versus Planned Tests

This brings us to the “any jerk” factor. Any jerk can look at a set of numbers and see some sort of pattern. Many of us have experienced this while playing pool with friends. There’s usually some person who whacks the heck out of the white ball, which then wings around the table hitting all the other balls several times, creating havoc. When one of his balls goes into the pocket, he smiles as though this were the precise intended result. This, any jerk can do. We know the validity of the trend or of the player’s ability only when the player “calls” his shots.

Online testing is just like this. Any jerk can look at a postbuy report and draw all sorts of conclusions about which creative and which sites are doing better and why. Just try proving to the client that this will happen similarly next month, as it’s improbable that many of these conclusions are valid. We have to set up the tests first, deliberately testing a particular hypothesis against known controls.


When we test a given hypothesis (for instance, that creative A will perform better than creative B on sites that have users of type X), we have to eliminate all those random factors that would otherwise mislead us. I assume that we’ve already figured out a way to get apples and apples data. But then we have to set up the test to make sure that other errant variables won’t spoil the mix.

We do this by purchasing media in identical, smaller batches. We then test only one variable at a time, to make sure that the particular time of day or particular site we purchased or some other factor doesn’t come in and give us a false read. For instance, we take two identical little media packages and put the two different creative pieces on at the very same time, with all other factors presumably “controlled.” The results we get from that test are much more likely to truly reflect reality.

Accumulated Knowledge

If we do our media buying with these deliberate tests in mind, we can accumulate an enormous amount of data from our clients’ purchases. Over time, this becomes an incredible resource for the agency. After buying millions of dollars of media, we can even begin to run predictive “phantom” media runs, where we query the database to tell us what a particular type of creative would likely do for a particular product on particular sites. And, if enough good data is in there, we can get some fairly good predictions.

This assumes that we collect all our data via an accurate mechanism, that we use the same technology for all these media buys, and that we keep pretty good records as to what creative runs where and just what type of creative that is. Some agencies some of the best ones actually take the time to categorize the creative in their media database. This way, they can later query the database, “What are the best sites in terms of transactional performance with direct-marketing creative in this product category?” Useful stuff.

I’ve found that one product category frequently differs greatly from another, and much of what we learn does not apply across clients. It’s not good enough to tell your clients that they need to zig this way one month and zag that way the next month based on pure stimulus-response reactions to postbuy analysis. You don’t get anywhere. You need to get good data and purchase media in a controlled fashion to really build one piece of knowledge upon another. This way, we can advance a client fairly quickly down the road to sophistication in the product category.

Related reading