Is a Bigger Search Index Better for Relevancy?

Last month, I was part of a group Tim Mayer invited to the Yahoo campus on the day he announced Yahoo’s index had reached over 20 billion items. I wondered how long it would be before Google disputed the claim, and the size wars erupted again. It didn’t take long.

My colleague Danny Sullivan was immediately on the case with a column covering the subject. That should’ve been the end, this time around.

Except many industry people viewed a report linked to the National Center for Supercomputing Applications (NCSA). This added more fuel to the fire. (Interestingly, a preface distances NCSA from the work.) The survey itself was much disputed because of the way it was carried out. Many people said it was hardly scientific.

Then, last week I was contacted by “The Wall Street Journal,” regarding a follow-up to an article about the survey. I was happy to pass on my opinions.

Since then, it’s been playing on my mind. So I updated my own research on the subject.

Search engines use index size mainly as PR fodder. The average surfer has little idea how to evaluate a search engine’s pure worth. He’s happy as long as he gets something satisfactory in the results page. So the “mine’s bigger than yours” PR is probably one of the few things that may cause the surfer to switch engines. He might think the search engine with the biggest index has the best results.

I talked with the new head of Yahoo Research Labs, Dr. Prabhakar Raghavan. Raghavan is a hugely respected scientist and a pioneer in search and text mining. He worked directly with Professor Jon Kleinberg (inventor of the “hubs and authorities” algorithm for analyzing linkage data) on IBM’s CLEVER project. First, he commented on the “questionable” survey.

“The NCSA study is known to be extremely unscientific,” said Raghavan. “For years, I’ve been teaching far more sophisticated estimation methods, and even those have flaws. And besides statistical errors, there are lots of architectural reasons why, sitting outside an engine, you cannot tell what’s inside.”

We discussed at length the nature of multitiered index architecture, which has been adopted by the major search engines. Although searchers simply use the word index, the tiered index is a lot more complicated. It’s actually layered in tiers of priority.

The type of query, such as common or more obscure, dictates how deep into the index tiers a search engine will look for relevant documents. Let’s say I search for a document I know is in the index, and the search engine can’t locate it with a direct search. When I use a site scope search (a search for all of the pages within a specific domain), I find it. How deep the engine reaches into the tiers depends on a query’s specificity. For fairly broad queries, the engines can search just the top tier or two.

There’s also the data center’s location and where the query issues from. The same query issued from different geographical territories (all to the same search engine domain) often return vastly differing results.

Raghavan said, “The query-parsing algorithm uses a ton of heuristics to govern how deep into an index a search engine goes.”

We also need to take into account many other aspects of a search engine index and the methods used to gauge what’s in there and how big the index is. Search engines using synonym expansion or semantic expansion of queries will offer different results, both in content and quantity of results.

In 1998, Dr. Andrei Broder, former head scientist at AltaVista (now at IBM), along with Dr. Krishner Bharat (now at Google), developed a technique for measuring public Web search engines’ relative size and overlap. As far as the current survey dispute goes, Broder believes not only did the study authors not fully “get it” (although they admit to reading his paper), similar studies have also not gotten it.

Broder has a very strong background in statistics and was keen to point out that “there’s a big difference between collecting numbers and doing statistical analysis.

“Let’s assume you want to know what the most popular color for cars is for the average American driver. One way to do that would be to go stand on a corner, count cars, and make a note of the colors,” he continued. “If you stand on a corner in Manhattan doing that, you’ll discover that most Americans seem to prefer yellow!”

Regardless of the study’s methodology, he believes bigger can be better. More popular queries scoped at the two top tiers will be easier to return. But answers to very obscure or esoteric queries are much harder to return unless you’ve crawled the Web extensively and have a very large corpus. Of course, bringing in more content means bringing in more garbage, too.

Ask Jeeves told me those commercial queries search marketers are interested in make up the smaller percentage of queries handled by a search engine: only 20 percent.

As search marketers, should we care about those esoteric, less popular queries? Yes. I presented a workshop in Washington, DC, for a religious organization. The organization has a Web site that reaches out to all denominations. These are religious communicators with a message. Bringing that message to search engines means being aware of some very obscure queries.

But just because it’s a nonprofit doesn’t mean it doesn’t have to market itself online the same way a profit organization does. For that group, a wide-reaching, deep index is a good thing for a search engine to have.

Is bigger better? Yes and no. It depends on who’s counting and why.

Want more search information? ClickZ SEM Archives contain all our search columns, organized by topic.

Related reading

Brand Top Level Domains