Is a Bigger Search Index Better for Relevancy?

The answer depends on who's counting and why.

Author

Mike Grehan

Date published September 19, 2005 Categories

Last month, I was part of a group Tim Mayer invited to the Yahoo campus on the day he announced Yahoo’s index had reached over 20 billion items. I wondered how long it would be before Google disputed the claim, and the size wars erupted again. It didn’t take long.

My colleague Danny Sullivan was immediately on the case with a column covering the subject. That should’ve been the end, this time around.

Except many industry people viewed a report linked to the National Center for Supercomputing Applications (NCSA). This added more fuel to the fire. (Interestingly, a preface distances NCSA from the work.) The survey itself was much disputed because of the way it was carried out. Many people said it was hardly scientific.

Then, last week I was contacted by “The Wall Street Journal,” regarding a follow-up to an article about the survey. I was happy to pass on my opinions.

Since then, it’s been playing on my mind. So I updated my own research on the subject.

Search engines use index size mainly as PR fodder. The average surfer has little idea how to evaluate a search engine’s pure worth. He’s happy as long as he gets something satisfactory in the results page. So the “mine’s bigger than yours” PR is probably one of the few things that may cause the surfer to switch engines. He might think the search engine with the biggest index has the best results.

I talked with the new head of Yahoo Research Labs, Dr. Prabhakar Raghavan. Raghavan is a hugely respected scientist and a pioneer in search and text mining. He worked directly with Professor Jon Kleinberg (inventor of the “hubs and authorities” algorithm for analyzing linkage data) on IBM’s CLEVER project. First, he commented on the “questionable” survey.

“The NCSA study is known to be extremely unscientific,” said Raghavan. “For years, I’ve been teaching far more sophisticated estimation methods, and even those have flaws. And besides statistical errors, there are lots of architectural reasons why, sitting outside an engine, you cannot tell what’s inside.”

We discussed at length the nature of multitiered index architecture, which has been adopted by the major search engines. Although searchers simply use the word index, the tiered index is a lot more complicated. It’s actually layered in tiers of priority.

The type of query, such as common or more obscure, dictates how deep into the index tiers a search engine will look for relevant documents. Let’s say I search for a document I know is in the index, and the search engine can’t locate it with a direct search. When I use a site scope search (a search for all of the pages within a specific domain), I find it. How deep the engine reaches into the tiers depends on a query’s specificity. For fairly broad queries, the engines can search just the top tier or two.

There’s also the data center’s location and where the query issues from. The same query issued from different geographical territories (all to the same search engine domain) often return vastly differing results.

Raghavan said, “The query-parsing algorithm uses a ton of heuristics to govern how deep into an index a search engine goes.”

We also need to take into account many other aspects of a search engine index and the methods used to gauge what’s in there and how big the index is. Search engines using synonym expansion or semantic expansion of queries will offer different results, both in content and quantity of results.

In 1998, Dr. Andrei Broder, former head scientist at AltaVista (now at IBM), along with Dr. Krishner Bharat (now at Google), developed a technique for measuring public Web search engines’ relative size and overlap. As far as the current survey dispute goes, Broder believes not only did the study authors not fully “get it” (although they admit to reading his paper), similar studies have also not gotten it.

Broder has a very strong background in statistics and was keen to point out that “there’s a big difference between collecting numbers and doing statistical analysis.

“Let’s assume you want to know what the most popular color for cars is for the average American driver. One way to do that would be to go stand on a corner, count cars, and make a note of the colors,” he continued. “If you stand on a corner in Manhattan doing that, you’ll discover that most Americans seem to prefer yellow!”

Regardless of the study’s methodology, he believes bigger can be better. More popular queries scoped at the two top tiers will be easier to return. But answers to very obscure or esoteric queries are much harder to return unless you’ve crawled the Web extensively and have a very large corpus. Of course, bringing in more content means bringing in more garbage, too.

Ask Jeeves told me those commercial queries search marketers are interested in make up the smaller percentage of queries handled by a search engine: only 20 percent.

As search marketers, should we care about those esoteric, less popular queries? Yes. I presented a workshop in Washington, DC, for a religious organization. The organization has a Web site that reaches out to all denominations. These are religious communicators with a message. Bringing that message to search engines means being aware of some very obscure queries.

But just because it’s a nonprofit doesn’t mean it doesn’t have to market itself online the same way a profit organization does. For that group, a wide-reaching, deep index is a good thing for a search engine to have.

Is bigger better? Yes and no. It depends on who’s counting and why.

Want more search information? ClickZ SEM Archives contain all our search columns, organized by topic.

Subscribe to get your daily business insights

More about:

Read the next article

Explore Tech Talks

Lucy

Lucy helps organizations leverage knowledge for in... View Tech Talk
TVSquared

TVSquared is the global leader in cross-platform T... View Tech Talk
Grata

Grata is a B2B search engine for discovering small... View Tech Talk

Whitepapers

US Mobile Streaming Behavior

Whitepaper | Mobile

US Mobile Streaming Behavior

Streaming has become a staple of US media-viewing habits. Streaming video, however, still comes with a variety of pesky frustrations that viewers are ...

View resource

Winning the Data Game: Digital Analytics Tactics for Media Groups

Whitepaper | Analyzing Customer Data

Winning the Data Game: Digital Analytics Tactics for Media Groups

Winning the Data Game: Digital Analytics Tactics f...

Data is the lifeblood of so many companies today. You need more of it, all of which at higher quality, and all the meanwhile being compliant with data...

View resource

Learning to win the talent war: how digital marketing can develop its people

Whitepaper | Digital Marketing

Learning to win the talent war: how digital marketing can develop its peopl...

Learning to win the talent war: how digital market...

This report documents the findings of a Fireside chat held by ClickZ in the first quarter of 2022. It provides expert insight on how companies can ret...

View resource

Engagement To Empowerment - Winning in Today's Experience Economy

Report | Digital Transformation

Engagement To Empowerment - Winning in Today's Experience Economy

Engagement To Empowerment - Winning in Today's Exp...

Customers decide fast, influenced by only 2.5 touchpoints – globally! Make sure your brand shines in those critical moments. Read More...

View resource

Mastering voice search optimization: Talk like a local, rank like a pro

Search Marketing

Mastering voice search optimization: Talk like a local, rank like a pro

1m ClickZ News Staff

Mastering voice search optimization: Talk like a l...

Forget typing, voice search is booming. Businesses need Voice Search Optimization (VSO) to rank for conversational queries and secure top spots in sea...

View article

How to Create Impactful SEO Reports that Drive Business Success

2m ClickZ News Staff

How to Create Impactful SEO Reports that Drive Bus...

Wielding graphs and analytics has its place. But to truly capture executive attention in today’s impatient digital arena, we must step into the shoes ...

View article

How Google's Search Generative Experience (SGE) is Reshaping SEO

2m ClickZ News Staff

How Google's Search Generative Experience (SGE) is...

As the search giant delves deeper into the realm of artificial intelligence (AI), it is clear that SGE will have a profound impact on the future of SE...

View article

The secrets to getting the best SEO traffic without even ranking

11m Daniel Tannenbaum

The secrets to getting the best SEO traffic withou...

Did you know that there are ways to get to the top of Google without ranking your own site? You can still get lots of good organic traffic using alter...

View article

How SEO is changing because of ChatGPT

11m Daniel Tannenbaum

How SEO is changing because of ChatGPT

When ChatGPT was introduced in 2022, it changed the internet. Today, we speak to some startups and experts to understand how ChatGPT is changing SEO R...

View article

Winning at search: why vigilance and strategy alignment are necessary evils

Data-Driven Marketing

Winning at search: why vigilance and strategy alignment are necessary evils

11m Prasanna Dhungel

Winning at search: why vigilance and strategy alig...

As brands and agencies struggle to prioritize visibility of ever-changing SERP features, here's how they can build effective, holistic search strategi...

View article

What role does page speed play for SEO?

SEO

What role does page speed play for SEO?

1y DebugBear

What role does page speed play for SEO?

Page speed has been a ranking factor for a long time, but it has increased in importance over the last two years. Learn about Google’s Core Web Vitals...

View article

iOS 14 uncovers measurement vulnerabilities for business

322023

iOS 14 uncovers measurement vulnerabilities for business

1y Jamie Bolton

iOS 14 uncovers measurement vulnerabilities for bu...

How will marketers handle the advertising industry upheaval in regard to data and measurement? Read More...

View article

Follow us

Is a Bigger Search Index Better for Relevancy?

Subscribe to get your daily business insights

Read the next article

Explore Tech Talks

Whitepapers

Whitepapers

US Mobile Streaming Behavior

US Mobile Streaming Behavior

Winning the Data Game: Digital Analytics Tactics for Media Groups

Winning the Data Game: Digital Analytics Tactics f...

Learning to win the talent war: how digital marketing can develop its peopl...

Learning to win the talent war: how digital market...

Engagement To Empowerment - Winning in Today's Experience Economy

Engagement To Empowerment - Winning in Today's Exp...

Related Articles

Mastering voice search optimization: Talk like a local, rank like a pro

Mastering voice search optimization: Talk like a l...

How to Create Impactful SEO Reports that Drive Business Success

How to Create Impactful SEO Reports that Drive Bus...

How Google's Search Generative Experience (SGE) is Reshaping SEO

How Google's Search Generative Experience (SGE) is...

The secrets to getting the best SEO traffic without even ranking

The secrets to getting the best SEO traffic withou...

How SEO is changing because of ChatGPT

How SEO is changing because of ChatGPT

Winning at search: why vigilance and strategy alignment are necessary evils

Winning at search: why vigilance and strategy alig...

What role does page speed play for SEO?

What role does page speed play for SEO?

iOS 14 uncovers measurement vulnerabilities for business

iOS 14 uncovers measurement vulnerabilities for bu...