My last column took a quick look at some tools for the analyst’s toolkit other than the Web analytics system. These include BI (define) or OLAP (define) tools, visualisation tools, statistical analysis, and data mining tools. This week, I want to look deeper at the use (and possible abuse) of statistical analysis and data mining techniques.
Statistical analysis and data mining covers a wide variety of approaches, methodologies, and techniques that might be useful for the Web analyst. They can be broadly be classified as follows:
- Statistical analysis
- Classification techniques
- Clustering and segmentation methodologies
- Text analysis
There’s a saying: “If you torture the data long enough, it will tell you anything you want it to.” These kinds of data analysis techniques can be very powerful and can be used to uncover nuggets of gold in your data. They must also be used carefully. The analyst must ensure results are robust and reliable and, above all, make sense. Data mining is as much an art as a science.
Simple statistical analysis techniques, such as frequencies and histograms, can reveal interesting data patterns. I’ve written before about the dangers of using averages metrics, such as average pages per visit, as they hide interesting differences in behavior. Worse than that, they can be misleading.
In the work we do, we often spend a lot of time initially carrying out exploratory analysis, looking at data patterns and distributions. It’s time well spent. It gives us a feel for what’s going on below the top-line metrics and helps later when we begin to look at the results of other analytical techniques. As a marketing analyst, you must have a sense of how data are made up, how top-line metrics are constructed, and where they come from. For example, you may find some extreme values, or outliers, that might affect your results and so need to be dealt with in some way or another.
With statistical analysis, you may want to compare different groups of visitors or customers. For example, you can look to see whether the repeat order rate is higher among some groups of customers than others. You can apply statistical tests to see whether any differences are real significant differences or just are because of data variability. Significant difference testing can be important in A/B tests, to ensure A is really better or worse than B before making any changes to the site.
There are many different types of classification techniques, including regression analysis, which is often used in credit scoring; and AI (define) approaches, including neural networks. Today, we’ll look at the decision tree (define) classification. There are a number of different algorithms in this type of technique, including CHAID (define), CART (define), and QUEST (define). These algorithms essentially do the same thing in different ways: assign data records (such as visitors or customers) into interest groups based on the other variables on the record.
Let’s say you have customer records split into two groups: single-order customers and repeat customers. You also have a string of other data on those customers and want to understand what the key characteristics are that distinguish someone who orders once from someone who orders repeatedly. Decision-tree methods look at all the other variables and determine which is the most important factor in the difference between a single-order shopper and a repeat shopper. The process repeats again and again until it has determined what all the significant factors are in order of priority.
The great thing about decision trees is the output is very visual and relatively easy to understand. They can get big and cumbersome, though, especially if you are dealing with a lot of variables. Decision-tree techniques have been used for years in direct marketing work to determine which type of people is most likely to respond to mailings, so companies can cut mailing costs.
In online marketing, mailing costs aren’t as big an issue as in the offline world, but we’ve used techniques like decision tress in other areas to understand what which factors influence visitors to do something or not. In the earlier example, we did a piece of work in which we looked at many potential factors, including:
- Size of the first order
- Number of site visits after the first order
- Product category of the first order
- Product categories browsed after the first order
- Whether customers were opted in to the email newsletter
- Number of newsletters they received
- Timing of the newsletters after the first order
Of all the factors we examined, the most important one was whether the customer had opted in to the email newsletter and had received a newsletter within five days of the first order. Vital input into a retention marketing program.
Decision-tree techniques are also useful for profiling and understanding different segments of visitors or customers. Next time, we’ll look at segmentation techniques.
Emily Ma, product director of Tencent’s advertising platform products department, was a keynote speaker at ClickZ Live Shanghai where she discussed the ... read more
In today's multichannel world how can marketers use data to ensure the experience a customer receives is relevant to them?
The terms that customers type into your site search function can help you to gain an understanding of user behaviour and can be used to optimise ... read more