Lies, Lies, and LSI

It’s five years since I first referenced latent semantic indexing (LSI) ( define) and the work of Microsoft super scientist Susan Dumais in the first edition of my best practice guide to search marketing (or search engine positioning, as it was known then).

At the time, there was a whole lot of confusion and some very bad information floating around about the vector space model (developed by Dr. Gerard Salton), and exactly what term vectors are. A research paper entitled “The Term Vector Database: fast access to indexing terms for Web pages,” only seemed to add more fuel to the fire. People in the rapidly developing SEO industry openly speculated as to how this new technology would challenge and affect optimization efforts.

As I often pointed out in forums and newsletters back then, term vector theory wasn’t new at all (it predates the Web by some considerable time). I also referenced many times an interview I did with Brian Pinkerton, developer of WebCrawler, arguably the Web’s first full text retrieval search engine. Pinkerton explained to me he had applied the vector space model to WebCrawler from the very beginning. And that was back in 1994.

Latent semantic indexing has also been around for a very long time. One of the first papers I read on the subject dates back to 1990.

Recently, I received a spam message which declared:


Google is coming up with Semantic web. Are you ranking well with this latest algorithm of search engines and will you continue to rank well?

Is you website LSI compliant?

Search Engines like Google (who are pallbearers for technology) are already reaching out for it by adopting LSI in their ranking algorithms.

We will check your website for its LSI algorithm readiness.

What a complete crock of you-know-what.

I read a newsletter promoting LSI tools and technology for your Web site. It even referred to the term vector database (which I doubt ever worked anyway!). Most of these so-called LSI tools and technology are nothing more than parlor tricks. Anyone can knock together a tool that takes a query and runs a thesaurus look-up on it.

Should you lose any sleep over LSI?

I asked my buddy and SEO expert Rand Fishkin of the popular seomoz resource for his thoughts. I referenced Dr. Edel Garcia’s recent tutorials on LSI and SVD (which he had already was aware of) and basically I asked:

Should SEOs care about LSI anyway, should we lose sleep over it?

If we should care about it, how would we go about optimizing for it?

In the first case he said:

“Care about it, absolutely. Lose sleep over it, almost certainly no. LSI is a method for determining semantic relationships and in all honesty, while I do believe it’s critical for an SEO to be informed enough to explain the concept to a client, I don’t see a lot of practical use. With the advancement in search engine algorithms over the last 2-3 years (particularly at Google & Yahoo!), SEO has shifted away from manipulating language use and placement to building a savvy marketing campaign.”

And to the second question, he said:

“I believe that one of Dr. Garcia’s primary points when examining the math behind LSI is that without access to accurate data about the search engines’ indices and the use of language therein, we’re shooting in the dark to a certain degree. He’s laid out a process in his articles on the subject that will allow for rough calculations to uncover potentially more valuable combinations of words and phrases for optimizing text for search engines. However, as Dr. Garcia notes:

‘These days we know that most current LSI models are not based on mere local weights, but on models that incorporate local, global and document normalization weights. Others incorporate entropy weights and link weights.’

I’m inclined to believe the value we get out of “local” weight calculations for terms in a document provide only the most minimal value to SEOs.

However, this could be very useful to spammers writing programs to auto-generate text designed to pull in long tail searches and serve contextual ads – even a slight improvement in 50 million documents could turn to big $$ for that crowd.”

I asked Dr Garcia for his own thoughts.

“Many SEOs are misquoting old papers and the focus of that old research. Many of these SEO “experts” don’t even know how to do basic SVD decomposition, nor do they understand the how-to steps involved in computing LSI scores. In the process they have stretched such research findings and added a few of their own myths in order to market better whatever they sell. For instance, today one can see some suggesting that to have documents “LSI friendly” one needs to stuff content with synonyms or related terms. This perception is incorrect.”

So if your SEO vendor is throwing terms such as LSI at you, you should really get them to qualify what they actually know about the subject.

Take a look at Dr Garcia’s fast-track paper (download PDF) yourself. Even if you don’t grasp any of the math and only have a half a clue of what it’s all about, don’t worry. At least by reading it, you may never understand what it is or what it does: But Garcia certainly emphasizes what it isn’t. And that little bit of knowledge will certainly help you to dispel any BS thrown at you by snake oil SEOs.

Related reading

Brand Top Level Domains