Customer Data Munging and Reconciliation for Correlation

You want custom made clothing? Step right up and we’ll measure you. We’ll find out how tall, wide, and thick you are in any number of places. Well, 33 places to put a specific number on it.

At least, that’s how many are used by MGL Industries and burlesque costumer Glitz by Linda Joyce and that’s not even counting ear height, glove length, or pastie size.

Now imagine that your custom clothier were to measure your neck with calipers, your arms with a yard stick, your wrist with a micrometer, your hat size with a protractor, your inseam with a tape measure, your arms with a laser range finder, and your waist with a Smart Finger.

Not only would the process be time consuming, the results would be a mishmash of not quite relatable numbers.

I’ve been pondering the dilemma of differing customer data for a while. I remain hopeful but not immediately confident. Much the same as I feel about medicine, law, and government. Will we ever be able to put all our digital data eggs into one customer warehouse basket and come out with a reliable omelet?

Data management mechanics have long been mapped out: capture, cleanse, store, extract, etc. But it’s the transformation of all that customer behavioral data that comes just before loading it all into the master warehouse that has me concerned. Customer data come in all shapes, sizes, weights, density, and value.

It is a given that any two advertising servers will record their performance in slightly different ways, that any two web analytics tools on the same site will generate different numbers, and that any two customer satisfaction indexes will differ. This is merely the problem of the man with two watches who does not know what time it really is.

This issue is put to bed by giving up hope for standard, industrial strength metrics, acknowledging that every yardstick is slightly dissimilar. Organizations succeed when they settle for internal consistency over galactic exactitude.

Data cleansing is not as problematic. It draws on the services of a data dictionary. In system A, men and women are identified as either M or W, in system B, as either M or F, and in system C, as either 1 or 2. A quick cross-reference puts all things to right as long as “Decline to state,” “A little of each,” and “Not sure yet” are accounted for.

Merging or joining all of these data so they make sense requires a thorough understanding of how each is calibrated. In one case, a week’s worth of data represents data collected between Monday morning and Sunday night. In another, it’s Sunday morning to Saturday night. In a third, it’s simply the monthly total divided by 4, 4.25, or 4.33333. Messy, but manageable.

The real tricky bit comes when trying to attribute said data to individual individuals. For that, a common key is needed. If we all have one and only one telephone number, email address, customer ID number, or ship-to address, then all the information about one person could be correlated to all of the other information about that one person. Multiply that multi-headed hydra with the number of cookies we have on the number of devices we use and the problem becomes nail-biting.

The additional challenge is something I have heard referred to as “data munging.” This is the art of associating apples and orangutans. The two have very little in common and have dramatically different attributes. Nevertheless, we are compelled to assume that their coexistence in the same database will reveal hitherto unrealized returns on investigatory investment.

Social media influences, advertising exposures, click-through activities, email opens, blog post sentiments, shopping cart inclusions, likes, shares, and +1s are not measurable in the same way by the same scale and in any standard form. And yet…

During a recent interview, Brandt Dainow from ThinkMetrics asked about the complexity of data reconciliation. I could not give him a clear answer. I was, instead, frustrated that the term “data reconciliation” would be perfect for this problem if it were not already in vogue to describe rectifying errors introduced by measurement noise.

Now that you’ve read this far, I have two pleas to make to you:

  1. What is the proper term for the conjoining of disparate data types for the purpose of building a truly useful model – in this case of customer behavior – for the purpose of optimizing marketing?
  2. And, does anybody have any ideas they’d like to share on how this can be done in a way that is useful across more than one instance (industry/product line/campaign)?

I’m all ears. (3 ¼” x 1 ¾” each.)

Related reading

good and bad data
identified vs anonymous
American Apparel_Signage_Featured Image