Ah, summer. Time to play at the beach, head out on vacation, and, if you’re a search engine, announce to the world you’ve got the largest index.
Around this time last year, AlltheWeb.com kicked off a “who’s biggest” round by claiming the largest index size. It’s happened again. AlltheWeb said last month its index increased to 3.2 billion documents, toppling Google.
Google took only days to respond, quietly but deliberately notching up the number of pages it claims to index, listed on its home page. Like the McDonald’s signs of old that gradually increased to show how many customers had been served, Google went from 3.1 to 3.3 billion Web pages indexed.
Actually, no yawn. I’m filled with Andrew Goodman-style rage (that’s a compliment to Andrew) that search engine size wars may once again erupt. In terms of documents indexed, Google and AlltheWeb are essentially tied for biggest. Hey, so is Inktomi. So what? Knowing this gives you no idea which is better in terms of relevancy.
Size figures have long been used as a surrogate for the relevancy figures the search engine industry as a whole has failed to provide. Size figures are a poor surrogate. More pages in no way guarantees better results.
How Big Is Your Haystack?
There’s a haystack analogy I use to explain the idea that size doesn’t equal relevancy. If you want to find a needle in a haystack, you must search the entire haystack, right? If the Web is a haystack, a search engine that looks through only part of it may miss the portion with the needle.
Though that sounds convincing, the reality is more like this: The Web is a haystack and even if a search engine has combed every straw, you won’t find the needle if the haystack’s dumped on your head. That’s what happens when the focus is on size and relevancy is a secondary concern. A search engine with good relevancy is like a person equipped with a powerful magnet. He’ll find the needle without digging through the entire haystack. It will be pulled to the surface.
Deconstructing the Size Hot Dog
Much as I hate to, let’s talk about what’s in the quoted numbers. The figures are self-reported, unaudited, and not accompanied by a list of ingredients. Consider the hot dog. It appears to be all meat. Analyze it, and you may find water and filler make it appear plump.
Google’s figure is the biggest self-reported number. Its home page now reports “Searching 3,307,998,701 web pages.” What’s inside that hot dog?
“Web pages” actually includes things that aren’t Web pages, such as Word documents, PDF files, even text documents. It would be more accurate to say, “3.3 billion documents indexed” or “3.3 billion text documents indexed.” That’s what Google’s really talking about.
Not all those 3.3 billion documents are actually indexed, either. They may be listed in search results based on links to documents. Links give Google a very rough idea what a page may be about.
Try a search for Pontneddfechan, a village in South Wales (where my mother-in-law lives). You should see in the top results a listing for www.estateangels.co.uk/place/40900/Pontneddfechan. That’s a partially indexed page, as Google calls it. It’s fairer to call it an unindexed page. In reality, it hasn’t been indexed.
What chunk of the 3.3 billion figure has been indexed? Google’s checking that for me. It doesn’t always provide an answer to this particular question. Last time I got a figure was in June 2002. Then, 75 percent of the 2 billion pages listed on Google’s home page had actually been indexed. If that percentage still holds true, the number of documents Google has indexed might be closer to 2.5 billion, not the 3.3 billion claimed.
But wait! The supplemental index has yet to be counted. Sorry, we can’t count it. Google isn’t saying how big it is. Certainly, it adds to Google’s overall figure. How much is a mystery.
Let’s mix in more complications. For HTML documents, Google only indexes the first 101K it reads. Some long documents may not be wholly indexed. Do they count as “whole” documents in the overall figure? (Google says a small minority of documents are over this size.)
We’ve raised a lot of questions about what’s in Google’s size figure. There are more we could ask. The same questions should be directed to the other search engines, too. AlltheWeb’s 3.2 billion figure may include some pages known only by seeing links. They might include duplicates. Instead of asking questions, why not test or audit the figures ourselves?
That’s exactly what Greg Notess of Search Engine Showdown is known for. Expect Notess to take a swing at these figures soon. The last test was conducted in December. It involves searching for single-word queries, then examining each result — a time-consuming task. But a necessary one, as counts from search engines are often not trustworthy.
Grow, But Be Relevant
I’m certainly not against indexes growing. I find self-reported figures to be useful. Maybe Google is slightly larger than AlltheWeb, perhaps AlltheWeb just squeaks past Google. The important point is both are doubtless well above a small service such as Gigablast, which has indexed only 200 million pages.
That’s not to say a little service such as Gigablast isn’t relevant. It may well be for certain queries. Google gained many converts when it launched with a much smaller index than the established players. Google’s greater relevancy — its ability to find the needle in the haystack, not bury you in straw — was the important factor. If the size wars continue, look beyond the numbers listed on search engine home pages. Consider instead, does the search engine find what you want?
By the way, the baby of the major search engines, Teoma, did some growing last month. The service moved from 500 million to 1.5 billion documents indexed.
Paul Gardi, vice president of search for Ask Jeeves, which owns Teoma, wants to grow more. He adds Teoma is presently focused on English-language content. Teoma’s smaller size may not be an issue for English speakers. Subtract non-English-language pages from Teoma competitors, and the size differences may be much smaller.
“Comparatively speaking, I would argue that we are very close to Google’s size in English,” Gardi claims.
There is of course a lot of discussion about content and what does and doesn't work online. Is long-form the key? Does short-form content have a role to play? Are there other factors at play?
There is still confusion over which search results are ads and which are organic, at least in the minds of some web ... read more