Search Engine Size Wars

  |  September 10, 2003   |  Comments

Which search engine index is bigger -- and who really cares?

Ah, summer. Time to play at the beach, head out on vacation, and, if you're a search engine, announce to the world you've got the largest index.

Around this time last year, AlltheWeb.com kicked off a "who's biggest" round by claiming the largest index size. It's happened again. AlltheWeb said last month its index increased to 3.2 billion documents, toppling Google.

Google took only days to respond, quietly but deliberately notching up the number of pages it claims to index, listed on its home page. Like the McDonald's signs of old that gradually increased to show how many customers had been served, Google went from 3.1 to 3.3 billion Web pages indexed.

Yawn.

Actually, no yawn. I'm filled with Andrew Goodman-style rage (that's a compliment to Andrew) that search engine size wars may once again erupt. In terms of documents indexed, Google and AlltheWeb are essentially tied for biggest. Hey, so is Inktomi. So what? Knowing this gives you no idea which is better in terms of relevancy.

Size figures have long been used as a surrogate for the relevancy figures the search engine industry as a whole has failed to provide. Size figures are a poor surrogate. More pages in no way guarantees better results.

How Big Is Your Haystack?

There's a haystack analogy I use to explain the idea that size doesn't equal relevancy. If you want to find a needle in a haystack, you must search the entire haystack, right? If the Web is a haystack, a search engine that looks through only part of it may miss the portion with the needle.

Though that sounds convincing, the reality is more like this: The Web is a haystack and even if a search engine has combed every straw, you won't find the needle if the haystack's dumped on your head. That's what happens when the focus is on size and relevancy is a secondary concern. A search engine with good relevancy is like a person equipped with a powerful magnet. He'll find the needle without digging through the entire haystack. It will be pulled to the surface.

Deconstructing the Size Hot Dog

Much as I hate to, let's talk about what's in the quoted numbers. The figures are self-reported, unaudited, and not accompanied by a list of ingredients. Consider the hot dog. It appears to be all meat. Analyze it, and you may find water and filler make it appear plump.

Google's figure is the biggest self-reported number. Its home page now reports "Searching 3,307,998,701 web pages." What's inside that hot dog?

"Web pages" actually includes things that aren't Web pages, such as Word documents, PDF files, even text documents. It would be more accurate to say, "3.3 billion documents indexed" or "3.3 billion text documents indexed." That's what Google's really talking about.

Not all those 3.3 billion documents are actually indexed, either. They may be listed in search results based on links to documents. Links give Google a very rough idea what a page may be about.

Try a search for Pontneddfechan, a village in South Wales (where my mother-in-law lives). You should see in the top results a listing for www.estateangels.co.uk/place/40900/Pontneddfechan. That's a partially indexed page, as Google calls it. It's fairer to call it an unindexed page. In reality, it hasn't been indexed.

What chunk of the 3.3 billion figure has been indexed? Google's checking that for me. It doesn't always provide an answer to this particular question. Last time I got a figure was in June 2002. Then, 75 percent of the 2 billion pages listed on Google's home page had actually been indexed. If that percentage still holds true, the number of documents Google has indexed might be closer to 2.5 billion, not the 3.3 billion claimed.

But wait! The supplemental index has yet to be counted. Sorry, we can't count it. Google isn't saying how big it is. Certainly, it adds to Google's overall figure. How much is a mystery.

Let's mix in more complications. For HTML documents, Google only indexes the first 101K it reads. Some long documents may not be wholly indexed. Do they count as "whole" documents in the overall figure? (Google says a small minority of documents are over this size.)

Auditing Sizes

We've raised a lot of questions about what's in Google's size figure. There are more we could ask. The same questions should be directed to the other search engines, too. AlltheWeb's 3.2 billion figure may include some pages known only by seeing links. They might include duplicates. Instead of asking questions, why not test or audit the figures ourselves?

That's exactly what Greg Notess of Search Engine Showdown is known for. Expect Notess to take a swing at these figures soon. The last test was conducted in December. It involves searching for single-word queries, then examining each result -- a time-consuming task. But a necessary one, as counts from search engines are often not trustworthy.

Grow, But Be Relevant

I'm certainly not against indexes growing. I find self-reported figures to be useful. Maybe Google is slightly larger than AlltheWeb, perhaps AlltheWeb just squeaks past Google. The important point is both are doubtless well above a small service such as Gigablast, which has indexed only 200 million pages.

That's not to say a little service such as Gigablast isn't relevant. It may well be for certain queries. Google gained many converts when it launched with a much smaller index than the established players. Google's greater relevancy -- its ability to find the needle in the haystack, not bury you in straw -- was the important factor. If the size wars continue, look beyond the numbers listed on search engine home pages. Consider instead, does the search engine find what you want?

By the way, the baby of the major search engines, Teoma, did some growing last month. The service moved from 500 million to 1.5 billion documents indexed.

Paul Gardi, vice president of search for Ask Jeeves, which owns Teoma, wants to grow more. He adds Teoma is presently focused on English-language content. Teoma's smaller size may not be an issue for English speakers. Subtract non-English-language pages from Teoma competitors, and the size differences may be much smaller.

"Comparatively speaking, I would argue that we are very close to Google's size in English," Gardi claims.

This column was adopted from ClickZ's sister site, SearchEngineWatch.com. A longer, more detailed version is available to paid Search Engine Watch members.

ClickZ Live Chicago Join the Industry's Leading eCommerce & Direct Marketing Experts in Chicago
ClickZ Live Chicago (Nov 3-6) will deliver over 50 sessions across 4 days and 10 individual tracks, including Data-Driven Marketing, Social, Mobile, Display, Search and Email. Check out the full agenda and register by Friday, Oct 3 to take advantage of Early Bird Rates!

ABOUT THE AUTHOR

Danny Sullivan

Danny Sullivan left Search Engine Watch as of Dec. 1, 2006.

COMMENTSCommenting policy

comments powered by Disqus

Get the ClickZ Search newsletter delivered to you. Subscribe today!

COMMENTS

UPCOMING EVENTS

Featured White Papers

IBM: Social Analytics - The Science Behind Social Media Marketing

IBM Social Analytics: The Science Behind Social Media Marketing
80% of internet users say they prefer to connect with brands via Facebook. 65% of social media users say they use it to learn more about brands, products and services. Learn about how to find more about customers' attitudes, preferences and buying habits from what they say on social media channels.

Marin Software: The Multiplier Effect of Integrating Search & Social Advertising

The Multiplier Effect of Integrating Search & Social Advertising
Latest research reveals 68% higher revenue per conversion for marketers who integrate their search & social advertising. In addition to the research results, this whitepaper also outlines 5 strategies and 15 tactics you can use to better integrate your search and social campaigns.

Resources

Jobs

    • Digital Marketing Analyst
      Digital Marketing Analyst (GovLoop) - Washington D.C.Are you passionate about audience acquisition? Love effective copy and amazingly effective...
    • Product Specialist
      Product Specialist (Agora Inc. ) - BaltimoreDescription: The Product Specialist is hyper-focused on the customer experience and ensures that our...
    • Partnerships Senior Coordinator
      Partnerships Senior Coordinator (Zappos.com, Inc.) - Las VegasZappos IP, Inc. is looking for a Partnerships Senior Coordinator! Why join us? Our...