Big Data: Simple as a Can of Soup

Let’s start with the over-hyped view of big data: Just dump a lot (volume) of different types (variety) of streaming (velocity) data into a magic big data machine and out pops fascinating facts we had no way of deriving before. Like heating up a can of soup for lunch.

If only.

campbells-chicken-noodle-soup

Next, we move to the partially educated view of big data: Extract, transform, and load a lot of different types of data into a big data service in the cloud and suddenly, we’re able to ask much more sophisticated questions than ever. Nothing just “pops out,” but the width and depth of our queries inevitably leads to new insights.

This is almost true, and yes, it takes some serious deep thinking. Moreover, it takes serious deep thinking by people who know whereof they think. Stephen Hawking, brilliant though he may be about the theory of everything, knows next to nothing about some things. Given a fabulous dataset and enough cycles, he could ask excruciatingly insightful questions about cosmology and amyotrophic lateral sclerosis. But he wouldn’t be very helpful if asked to formulate deeply valuable questions on matters of advertising and marketing attribution. He is simply not a subject matter expert in this arena.

Finally, the experienced view of big data agrees that you need a subject matter expert, but knows that a data matter expert is also required. This is somebody who speaks with authority about the organization’s goals and KPIs but intimately understands one particular data set or another. The more streams you introduce into the river, the more people you need who understand the origins of their respective data sets.

Let’s try an analogy: that can of soup.

Big Data as Supply Chain

Let’s assume you make soup for a living, millions of gallons of chicken noodle soup. That makes you responsible for 22 individual ingredients as well as five different kinds of chicken (chicken stock, chicken powder, chicken fat, dehydrated cooked chicken, and cooked chicken meat).

If somebody comes along and says your latest batch is yucky, you have to figure out what ingredient or process was the culprit. You cannot be an expert on all ingredients all the time. You have to rely on others who know where that calcium carbonate came from and what happened to it along the way. Somebody else can tell you from where the vegetable oil was source and how trustworthy it is. You may know your chicken, but only an expert can tell you if that yeast extract went bad in the vat.

Data is just the same way. Each stream has an expert behind it who captured it, sampled it, cleaned it, filtered it, aggregated it, segmented it and stuffed it into a Hot Pocket dashboard so you could microwave it for two minutes whenever you like.

Did you know that one wrong character in one variable on one page could ruin more data than you could ever fix? No data is bad, but false data is worse. – ObservePoint

If the numbers look odd, you have to know which data stream might have been to blame and who is responsible for the care and feeding of that data set. Much worse, is not knowing something might be wrong.

Big Data Means Big Potential for Error

Craig Scribner of Tracking First recently posted an eye-opening article that kicks off with this gem: “Adobe Analytics clients are genuinely surprised to learn how far off the mark their own reports have strayed.”

Tracking First is a tracking code generator that provides automation for classifications, link formation verification, and, generally, data collection assurance. All of this becomes manifestly necessary when you find out what Scribner comes across on a daily basis. At one client:

The code generator(s) started to stray from the standards they themselves had set. …

At this point of my audit, I had only focused on the tracking codes themselves; I hadn’t even reached the classifications yet.

When I did shift my focus to classifications, I saw that the client intended to classify each of these new codes in five different ways. One classification in particular: Campaign Name, outshone all the rest in terms of accuracy and longevity. It was almost always classified the right way — that is, until September, when classification work for this report (along with all others) ceased entirely. Complete blackout.

The classification report that outlasted all of the others

When I showed this to the client, he remembered that September was when Judy left the team and handed this job off to somebody else. That’s so typical!

As a result, reports got worse and worse over time without so much as a yellow flag or some underling scratching their heads and saying, “Excuse me, but this doesn’t look right.” A bit like slowly getting more and more off course in the ocean with no instruments to guide you.

While the Internet might be the most measurable medium ever, the instruments are fragile and cranky. Changing from row-and-column data management to time series is a challenge and bringing lots of discrete types of data together creates manifold complications.

Data Wranglers Needed

As Gary Angel put it in a recent post called “The Big Data Analytics Warehouse and Digital Data Models“:

With digital data, we get a very detailed stream of events. At the lowest level, these events are a combination of page views and intra-page events like link clicks or client-side UI events (faceting, exposures, rotations, plays, etc.). A typical Web session may have anywhere from a half-dozen to hundreds of these events. Of course for some types of analysis, including most intra-session site studies, you’ll want the data to be in exactly this form. You can’t analyze site usability unless you can drill down into the low-level components of site navigation. However, if you want to analyze questions about visitor behavior and journey, leaving the data like this is problematic.

So, argues Angel, one must start aggregating and he offers some suggestions about how to go about it. However, while,

There’s nothing very hard about this, … there’s also nothing very good about this. In aggregating the session this way, we’ve stripped almost everything meaningful out of the data. It’s simply impossible, using this aggregation, to understand what the visit was about and whether it was successful. The greatest analyst in the world couldn’t squeeze much meaning out of this aggregation.

He then provides the next logical step (flag sessions with important success events).

But when data is pre-processed like this, it takes a team of forensic data scientists to track down the cause of any problems – supposing somebody is aware that there is a problem.

Angel himself advocated in a prior blog post, “You get a data feed. You get a data geek.”

If you can’t embed a team member, I’d suggest both an upfront, detailed walk-through of the data feed with a discussion of every field, how it’s used and how it’s often misinterpreted. Following this, I’d suggest regular check-ins on the usage of the data.

It turns out that making a good chicken soup and putting it into a can is hard work. The same is true for data. Those who would be data-driven just want to open a dashboard and heat it up for lunch. But without trustworthy, audited, data supply chain management we’re in danger of slipping something sour into it.

Related reading

tencent_emily-ma_featured-image
nurcin-erdogan-loeffler_wikipedia-definition-the-future_featured-image
12919894_10154847711668475_3893080213398294388_n
kenneth_ning_emarsys_featured-image
<