What Keeps Search Engines Awake at Night?

In the SEO (define) industry, we tend to talk about search engine algorithms in much the same way as the Coca-Cola formula. No one really knows what goes into it. The recipe is locked away in Fort Knox, along with England’s crown jewels and tapes of unreleased Beatles material.

OK, that may be a little far-fetched. But it’s not too far from the comments made online about search engine algorithms. Although there’s a huge amount of nonsense and speculation published about them, we actually know a lot about their underlying principles.

At one of my sessions at Search Engine Strategies San Jose, an audience member explained to the panel Google had index all his pages except the home page. What could the problem be? We sat there scratching heads, chewing on the ends of pencils, and looking up with furrowed brows.

The mystery stuck with me even as I was heading home and was drifting off to sleep somewhere over the Atlantic. It had me thinking about the billions of documents search engines index and the amount of data they must process. And there we were on that panel, trying to figure out what happened to a single page.

Let’s briefly look at some of search engines’ main algorithmic challenges. It may not explain what happened to the missing page, but it might help highlight how the odd home page, even an entire Web site, may slip through a crawler’s net from time to time.

More Efficient Crawling

One point I try to make clear in my sessions is the difference between search engine marketers’ and search engines’ view of the Web. For search engine marketers, the Web is a bunch of interlinked pages they try to get surfers to notice so they can sell stuff.

For search engines, the Web is a graph. To a crawler, each Web page is a node, and each hyperlink a directed edge. The crawler must decide which pages are the most suitable to crawl. If the engines had a better understanding of the Web graph, they could find more efficient ways of crawling the Web. Then, fewer pages would be missed or underrepresented.

Understanding Web properties and developing a technique to uniformly sample pages would certainly provide answers to many complex issues related to more efficient crawling. Endless research has gone into looking for a solution, yet no such technique is known, as far as I’m aware.

Link building is an important aspect of SEO. If you look at links mathematically, as search engines do, they look random. Modeling the Web as a random graph, then, appears obvious. But the Web graph has a dominant property that’s hard to model: it’s mostly a two-level structure. Each Web page belongs to a host, and the highest percentage of hyperlinks connect pages on the same host.

The problem, then, is creating a random graph model that simulates the Web graph’s behavior on both the page and host level. Again, this is an issue search engines have yet to solve.

Duplicate Pages

Then there are duplicate and near-duplicate pages a search engine must weed out. Duplicate content contributes no new information and simply annoys the end user if he keeps getting same old thing time after time. This problem has been tackled a number of different ways, yet there’s no complete solution. Either some dupes or spammy pages remain, or some innocent pages suffer collateral damage.

Researchers in the field use “top gainers” and “top losers” to describe queries with the largest increase or decrease from one time period over another. This data stream provides a great deal of information relating to trends and user interests. However, there are temporal effects the engine must detect to identify which queries are asked most often over time and which pages should be presented to end users on a SERP (define). The answer to the query “most popular movie of all time” changes as one film becomes more popular than the current most-popular movie.


And let’s not forget our old friend PageRank. Again, the stationary distribution of a random walk on the Web graph assigns relative ranks to pages. However, emerging Web features and the calculation’s global nature present a huge computational challenge as the Web continues to grow.

Throw in the growing number of “dangling pages” on crawl’s frontier (pages the search engine is aware of via link information but have yet to be crawled and assigned a PageRank), add a bit of link rot (old pages falling out of maintenance), and you have whole chapter to cover.

I still don’t have an answer to the missing index page. It’s either as plain as the nose on my face, or it’s mixed in somewhere with the hugely complex issues mentioned above.

One thing I do know. While search engines lay awake at night trying to come up with solutions to the problems I touched on here (and the many others that go with them), there’ll always be occasions where the answer is quite simply “I don’t know.”

Want more search information? ClickZ SEM Archives contain all our search columns, organized by topic.

Related reading