SEM’s Hidden Science

When I was studying marketing at university, there was always a lively, ongoing debate about whether it was an art or a science. Eventually, the marketing industry adopted the idea of it being both.

And there’s a healthy serving of both under the marketing umbrella these days. With SEO (define), however, an extraordinarily rich and frequently complex mixture of scientific disciplines is hidden below the surface of the major search engines. It’s this science I find is so frequently misunderstood, misrepresented, or just plain ignored by many in the SEO community.

The science of information retrieval (IR) predates search engines by a very long time. It’s at the heart of search engine algorithms. It’s emerged as the third subject, along with logic and philosophy, that deals with relevance — a very elusive human notion.

In 1976, library scientist Tefko Saracevic traced the notion of relevance to problems of scientific communication. Relevance, he said, is considered a measure of the effectiveness of a contact between a communication’s source and destination. This perfectly sums up a search engine’s job for end users.

Classic IR models take nothing about HTML code, dynamic information delivery, or barriers to being crawled or indexed, into consideration. These are, in the main, minor issues when a search engine builds its index (or tiered index, as it is in fact).

As many readers are aware, I’m noted for separating the reasonably straight-forward SEO task of eliminating crawling barriers from the far more important issue of understanding ranking mechanisms. Without a decent rank nobody’s ever going to find you, so there’s really not much point in being in a search engine index.

IR, including ranking algorithms, is a fascinating field. I’ve become ultra-absorbed. My interest and research in it is purely from an online marketer’s point of view, not as a researcher or scientist in the field.

I find incredible the number of people I meet at industry events who simply don’t get the importance of understanding the real challenge of applying marketing communications to IR on the Web.

I have to prevent my jaw from dropping when people ask such extraneous questions as, “Can a search engine understand CSS code, Mike?” I’m dumbfounded by the number of times I hear people (often conference speakers) mention IR elements with more than mildly erroneous explanations. “Latent semantic indexing” is one term bandied around by all and sundry. Rarely do I hear it explained in its true context.

Latent semantic indexing (LSI) has been around for some time. Loosely described, it tackles the old IR problem of vocabulary diversity in human-computer interaction. Specifically, that people use different words to describe the same object or concept. At the same time, some words can have more than one meaning (and these can be semantically very different).

At times, LSI can improve the conventional vector space model (define). However, LSI’s run-time performance is a major concern to search engines wishing to provide results to end users in less than a second.

With LSI, an inverted index isn’t possible, as the end user query is represented as just another document. It must, therefore, be compared with all other documents. And that would take a long time for every user query. It’s difficult to discuss the various methods used by search engines to index and rank documents without going into the science behind it, at whatever level.

I’ve often overheard SEO experts talking to potential or existing clients at a conference using snippets of IR terminology, such as LSI, in some of the most out-of-context ways: “Yes, it’s a symptom called the ‘sandbox.’ It’s because Google uses latent semantic indexing. Now…”

What the notion of a sandbox could have to do with LSI is beyond me. Not to mention the fact LSI isn’t a Google thing. It’s an IR thing. It belongs to the entire IR research community.

In my experience, having a general understanding of IR techniques and how they can be applied to commercial search engines (an entirely different proposition to the homogenous collections they were originally conceived for) can save an awful lot of wasted effort and mind clutter in SEO.

Such understanding also lets you see through a lot of the BS that’s pitched at poor clients who are still scratching their heads trying to come to terms with the perceived technologically advanced concept of a meta tag.

There’s tons of information about IR models and techniques in the literature. Much of the classic information still stands up today. In the realm of document space, however, much more research continues.

I’m extremely fortunate my friend Dr. Edel Garcia, who attended a recent workshop held by the applied mathematics community, was able to give me personal insight into the proceedings. He’s allowed me to publish his report and share it with those who would like a high-level overview of what researchers in the field are currently engaged in.

Want more search information? ClickZ SEM Archives contain all our search columns, organized by topic.

Related reading

Brand Top Level Domains