Analyzing Unstructured Data

In this series, I’ve been looking at various techniques and methods to get value from visitor and customer data beyond what you might get from a typical Web analytics tool. Techniques we’ve looked at include:

What’s common about all these techniques, and indeed most data analysis, is they work on structured data. That’s data that sits in rows and columns, files, or databases. We spend the most time on structured data with most analysis, yet most business data is unstructured. It doesn’t sit in files and databases; it sits in documents, presentations, e-mail, and the like, and it’s mainly composed of text rather than numbers. Typically, 80 percent of all business data is unstructured.

So what insight can we get about customers and site visitors from this other type of data?

Almost by definition, analyzing unstructured data is usually difficult. Data analysis relies on the ability to detect patterns and relationships, that’s what we analysts try to do. It’s easier to detect these patterns and relationships when data is nicely structured and we can easily see one cell value is bigger than another cell value, for instance. It’s a bit harder when you’re faced with 200 inbound e-mail messages per day and you’re trying to get some sense of what they’re telling you. The key to analyzing this unstructured data is to provide some structure to it.

Market research companies have been doing this for years, to some extent. Market research surveys often contain open-ended questions. Rather than ask respondents to choose an answer from a list, you request a opinion or comment. “Is there anything you’d like to tell us about your experience on our Web site?” is one example of an open-ended question. You can get all sorts of answers, and you have to find a way of reducing the data into structured, meaningful chunks.

Traditionally, answers from open-ended questions (“verbatims” in trade jargon) are coded up into a codeframe by data coders. These people would assign all the verbatims to various predetermined categories. A verbatim could be assigned to any number of categories. This is obviously a lengthy, time-consuming process, somewhat prone to subjective assessment. It does allow structured and unstructured data to be looked at side by side in reporting and analysis.

These days there are more automated means of analyzing unstructured data through text-mining algorithms. Most big players in the statistical analysis and data-mining software market, such as SAS and SPSS, have text-mining components or modules. There are numerous other text analysis software tools out there. A good directory can be found at KDnuggets.

These tools scan documents to discover and extract underlying concepts and themes. The software mimics and automates what we do naturally when we read a document; that is, they extract meaning. That meaning can then be related back to other data.

Data mining and text mining can be brought together, combining the analysis of both structured and unstructured data. Models that predict propensity for customers to churn are being enhanced through the analysis of call center notes or inbound e-mail. Customer segments can be better profiled and understood through the addition of insight from other data sources, such as surveys, e-mail, and contact forms.

Let’s not forget all the unstructured data lying around in our businesses in e-mail systems, contact forms, call centers, and so on. Let’s see if we can extract additional customer insight from them. I’m very interested in understanding who’s doing what in this space, so if you have any good examples of how you’ve gotten great insights from unstructured data, or how you’ve used it alongside structured data sources, please let me know.

Related reading

site search hp