Getting Your Data Into Shape

What does your company have in common with both the U.S. government and an e-commerce/catalog company? Very likely, you share a problem: Your data is not in the format most conducive to extracting useful information.

Two projects I worked on illustrate the problem and its implications, as well as suggest a beginning to the solution. One project involves data supplied by the U.S. government; the other involves data from an e-commerce/catalog company.

Case Study: U.S. Census Bureau

The U.S. government project involved determining the best target markets for a particular product. The marketers didn’t have a house file with information about their best prospects. However, they did have a reasonable idea of their target market’s general demographics.

Armed with this knowledge, the marketers investigated the U.S. Census data as a source of demographic knowledge about different U.S. regions. The marketers then compared their target demographics against regional demographics and determined which areas were most appropriate for their product rollout. The U.S. Census is an excellent resource; it’s comprehensive and available free of charge.

The marketers went to the U.S. Census Web site, where they found good news and bad. The good news was data are available for free via an FTP download. The bad news was the necessary data were scattered across an innumerable number of different tables and data files, not to mention different file formats.

Converting the available data from ASCII to another database format (Access and SAS instructions are provided) then to a statistical analysis format can be a formidable task. It could involve combining data from over 76 different files!

Yet since this is the land of opportunity, some enterprising entrepreneurs realized marketers probably don’t want to go through this process. They completed much of the work for us. GeoLytics is one such company.

This type of data provider offers to put data into a format that’s much more useful to marketers. It will provide either one of its CD products that already contain what you need, or a custom data set pulled to your specifications and requirements.

Though I appreciate these companies’ efforts, I fault the folks at the U.S. Census Bureau for not creating some summary-level files that would be useful to marketers. Yes, data are collected and provided to the public. But it’s not provided in a format that even approaches useable. I do, however, applaud them for their intent of making the data available via FTP.

Case Study: E-Commerce Cataloger

I worked on a retail catalog data analysis project, which involved examining the purchase history of customers who bought via mail, telephone, or the Web. The good news is the data funneled into one system rather than being stored in separate systems.

Unfortunately, the data was provided in a manner unfavorable to analysis without extensive data manipulation. Instead of being in one file, the data was in multiple files. Separately, the files yielded no valuable information.

One file contained basic customer information, such as name, address, sex, and age. Another file had a lengthy list of order numbers and the customer identification numbers associated with them. Yet another file held information about all purchased items and order numbers associated with them.

Storing data in separate tables like this allows it to be captured and methodologically stored in the data warehouse. However, in this structure the data can only be used to retrieve all information about one customer in a simple fashion. For example, a telemarketer can quickly extract all the records pertaining to one specific customer from each of these tables. When a customer calls the customer service line, the telemarketer can easily retrieve all the customer’s information.

Unfortunately very little else can be done with the data in this structure. If you want to know what the most popular product is in Florida, for example, you’ll have to go through four steps:

  1. Extract all records with “FL” as the state field from the customer file.
  2. Extract all orders for Florida customers from the order file.
  3. Match the orders to the item file to find the ordered items.
  4. Sum and rank the counts of each item to determine the most popular item in Florida.

What a cumbersome and time-consuming process! Imagine if you had 10 other states to examine, not to mention any other potential subsets. The analysis will take more time than it is reasonable.

The Solution

In both cases, the data is not designed to end-user requirements. It’s designed to store, rather than use, data. Were the Census Bureau to identify some of the data’s uses, it could easily provide the data in a format that’s more easy to use. The information needed in the retail analysis was fairly basic and very useful — as well as reusable. The retailer should examine it every month. However, in its current format, the data must be manipulated every time to put it in a useable format.

Data being provided in an nonuseable format occurs at far too many companies. It causes companies to be inefficient and costs them money.

If the provided data is going to be used for a one-time analysis, it isn’t worth the effort to make the process more efficient. However, if the analysis is to be conducted regularly, process improvements should be made.

Clearly define how you want to use the data. The data warehouse group can then produce summary tables containing the necessary data, at the required level. Once in place, the summary table can be updated with current information. It can be used by end users, including marketers, without much further assistance from the data warehouse.

Going through this process helps the immediate parties involved in producing the new information and any subsequent groups that want to use similar information. The result is (once again) better information at a reduced cost, as well as happier marketing and data warehouse employees.

As a marketer, what’s your relationship with your data source (likely a data warehouse group)? If you can help the data source complete the extra formatting necessary to accomplish your goals, it will prevent multiple instances of reformatting down the line.

The result? Better information at a reduced cost, as well as happier marketing and data warehouse employees.

Related reading

Big Data & Travel
Flat design modern vector illustration concept of website analytics search information.