The answer depends on who's counting and why.
Last month, I was part of a group Tim Mayer invited to the Yahoo campus on the day he announced Yahoo's index had reached over 20 billion items. I wondered how long it would be before Google disputed the claim, and the size wars erupted again. It didn't take long.
My colleague Danny Sullivan was immediately on the case with a column covering the subject. That should've been the end, this time around.
Except many industry people viewed a report linked to the National Center for Supercomputing Applications (NCSA). This added more fuel to the fire. (Interestingly, a preface distances NCSA from the work.) The survey itself was much disputed because of the way it was carried out. Many people said it was hardly scientific.
Then, last week I was contacted by "The Wall Street Journal," regarding a follow-up to an article about the survey. I was happy to pass on my opinions.
Since then, it's been playing on my mind. So I updated my own research on the subject.
Search engines use index size mainly as PR fodder. The average surfer has little idea how to evaluate a search engine's pure worth. He's happy as long as he gets something satisfactory in the results page. So the "mine's bigger than yours" PR is probably one of the few things that may cause the surfer to switch engines. He might think the search engine with the biggest index has the best results.
I talked with the new head of Yahoo Research Labs, Dr. Prabhakar Raghavan. Raghavan is a hugely respected scientist and a pioneer in search and text mining. He worked directly with Professor Jon Kleinberg (inventor of the "hubs and authorities" algorithm for analyzing linkage data) on IBM's CLEVER project. First, he commented on the "questionable" survey.
"The NCSA study is known to be extremely unscientific," said Raghavan. "For years, I've been teaching far more sophisticated estimation methods, and even those have flaws. And besides statistical errors, there are lots of architectural reasons why, sitting outside an engine, you cannot tell what's inside."
We discussed at length the nature of multitiered index architecture, which has been adopted by the major search engines. Although searchers simply use the word index, the tiered index is a lot more complicated. It's actually layered in tiers of priority.
The type of query, such as common or more obscure, dictates how deep into the index tiers a search engine will look for relevant documents. Let's say I search for a document I know is in the index, and the search engine can't locate it with a direct search. When I use a site scope search (a search for all of the pages within a specific domain), I find it. How deep the engine reaches into the tiers depends on a query's specificity. For fairly broad queries, the engines can search just the top tier or two.
There's also the data center's location and where the query issues from. The same query issued from different geographical territories (all to the same search engine domain) often return vastly differing results.
Raghavan said, "The query-parsing algorithm uses a ton of heuristics to govern how deep into an index a search engine goes."
We also need to take into account many other aspects of a search engine index and the methods used to gauge what's in there and how big the index is. Search engines using synonym expansion or semantic expansion of queries will offer different results, both in content and quantity of results.
In 1998, Dr. Andrei Broder, former head scientist at AltaVista (now at IBM), along with Dr. Krishner Bharat (now at Google), developed a technique for measuring public Web search engines' relative size and overlap. As far as the current survey dispute goes, Broder believes not only did the study authors not fully "get it" (although they admit to reading his paper), similar studies have also not gotten it.
Broder has a very strong background in statistics and was keen to point out that "there's a big difference between collecting numbers and doing statistical analysis.
"Let's assume you want to know what the most popular color for cars is for the average American driver. One way to do that would be to go stand on a corner, count cars, and make a note of the colors," he continued. "If you stand on a corner in Manhattan doing that, you'll discover that most Americans seem to prefer yellow!"
Regardless of the study's methodology, he believes bigger can be better. More popular queries scoped at the two top tiers will be easier to return. But answers to very obscure or esoteric queries are much harder to return unless you've crawled the Web extensively and have a very large corpus. Of course, bringing in more content means bringing in more garbage, too.
Ask Jeeves told me those commercial queries search marketers are interested in make up the smaller percentage of queries handled by a search engine: only 20 percent.
As search marketers, should we care about those esoteric, less popular queries? Yes. I presented a workshop in Washington, DC, for a religious organization. The organization has a Web site that reaches out to all denominations. These are religious communicators with a message. Bringing that message to search engines means being aware of some very obscure queries.
But just because it's a nonprofit doesn't mean it doesn't have to market itself online the same way a profit organization does. For that group, a wide-reaching, deep index is a good thing for a search engine to have.
Is bigger better? Yes and no. It depends on who's counting and why.
Want more search information? ClickZ SEM Archives contain all our search columns, organized by topic.
Learn Digital Marketing Insights From Leading Brands!
ClickZ Live Chicago (Nov 3-6) will deliver over 50 sessions across 4 days and 10 individual tracks, including Data-Driven Marketing, Social, Mobile, Display, Search and Email. Check out the full agenda, or register and attend one of the best ClickZ events yet!
Mike Grehan is currently chief marketing officer and managing director at Acronym, where he is responsible for directing thought leadership programs and cross-platform marketing initiatives, as well as developing new, innovative content marketing campaigns.
Prior to joining Acronym, Grehan was group publishing director at Incisive Media, publisher of Search Engine Watch and ClickZ, and producer of the SES international conference series. Previously, he worked as a search marketing consultant with a number of international agencies handling global clients such as SAP and Motorola. Recognized as a leading search marketing expert, Grehan came online in 1995 and is the author of numerous books and white papers on the subject and is currently in the process of writing his new book From Search to Social: Marketing to the Connected Consumer to be published by Wiley later in 2014.
In March 2010 he was elected to SEMPO's board of directors and after a year as vice president he then served two years as president and is now the current chairman.
Hong Kong, October 21-22
London, November 13-14
San Francisco, November 13-14
London, November 18-19
Google My Business Listings Demystified
To help brands control how they appear online, Google has developed a new offering: Google My Business Locations. This whitepaper helps marketers understand how to use this powerful new tool.
5 Ways to Personalize Beyond the Subject Line
82 percent of shoppers say they would buy more items from a brand if the emails they sent were more personalized. This white paper offer five tactics that will personalize your email beyond the subject line and drive real business growth.
October 23, 2014
1:00pm ET/10:00am PT