Stats Attack II: Correlation

This is the first of a two-part series covering multivariate methods in market research. Today’s article focuses on correlation, a concept that is one of the primary building blocks of many market-research tools, such as key driver analysis, factor analysis, structural equation models, and so forth. I will discuss some of these methods and their use in Web market research in my next article. But this column will explain some fundamentals about correlation, covering both uses and misuses of the concept. Understanding these ideas will enable you to look at market-research methodology with a critical eye.

What Is Correlation?

The simplest kind of analysis takes variables one at a time. For example, the average age of visitors to your Web site is 33.7; the average purchase per customer is $15.37. These numbers are good to know, but perhaps not too exciting. A more interesting, and more actionable, finding would be that younger visitors tend to make larger purchases. Relationships between two variables are often more interesting than one-variable summaries.

The Correlation Coefficient

One of the most common ways to summarize the relationship between two variables is the correlation coefficient. The correlation coefficient quantifies the degree to which two variables are associated. There is a formula for it, but you don’t need to know it. However, here are some facts you should be familiar with:

  • Correlation coefficients are between 1 and 1.

  • A large positive correlation means that two variables “move together” — when one goes up, the other tends to go up. For example, if you recorded both “hits” and “unique visitors” to your Web site on a daily basis, these two variables would likely be highly correlated.
  • Negative correlation means the variables move in opposite directions — when one goes up, the other goes down. For example, the total size of images on a Web page and the speed of page downloads would be negatively correlated.
  • A correlation of 1 means there is an exact linear relationship between two variables. In other words, if you plotted one against the other, the points would lie on a straight (upward-sloping) line. If exactly 50 percent of your site visitors are male, then the total number of visitors would be perfectly correlated with (because it is twice) the number of male site visitors. Similarly, a correlation of -1 means there is a perfect (negative) linear relationship.
  • A correlation of 0 usually means the variables are not associated. There can be exceptions to this, but they’re beyond the scope of this column.

The correlation coefficient is a quick, simple way to measure the association between two variables. If you have many variables, the table (or matrix) of all the correlations between each pair of variables provides a convenient summary of the interrelationships in complex data. For this reason, computing correlations is often the first step in multivariate research.

The Limits of Correlation

Suppose a consumer-electronics Web site does a survey and finds that the speed of Internet connectivity is highly correlated with the number of purchases. Can we conclude that faster connections cause people to buy more? Not at all — there are several other plausible explanations. For example, those with faster connections may have more disposable income than other users and are therefore more likely to purchase. Or, alternatively, we might suspect that those users with faster connections are more likely to be techies and therefore more likely to purchase at an electronics site.

The moral of the story, which is repeated in every introductory statistics class across the land, is that “correlation is not causation.” This truism is often repeated in research circles, but it is easy to forget, especially when the correlations are hidden in much more complicated models. There are many reasons two variables X and Y may be correlated:

  1. X may influence Y.
  2. Y may influence X.
  3. A third variable may influence both X and Y.

This third option is important to keep in mind when analyzing data. If you have measured the third variable, it is possible (using more advanced methods) to control for it and determine the relationship between X and Y in isolation.

Correlation is the engine under the hood of many market-research methods. Knowing the proper uses and also the limitations of correlation will improve your ability to carry out market research and use it to make intelligent business decisions. In my next column, I’ll take the discussion of multivariate methods a step further and discuss some Web applications.

Related reading

Report: millennials are more likely to share an ad (but also to mute it)
A graph showing a small pile of money at one end and a larger pile at the other, with an ascending line connecting the two.
A cartoon depicting web analytics. It shows a bubble with the letters WWW in it, surrounded by "click! click!". An arrow leads from the bubble to a notepad with graphs and charts on it. Another arrow leads to a chart on a piece of paper, with a lightbulb next to it, which leads on to a spanner with the words "tweak! tweak!" next to it. In the bottom left corner is a box with four bullet points in it: gather, report, analyse and optimise.