Google recently dropped the famous count of pages in its index from its home page, while simultaneously claiming it has the most comprehensive collection of Web documents available. This is the long-expected counterblow to Yahoo’s recent claim to have outdistanced Google.
Dropping the home page count is a positive move I believe helps defuse the size wars situation. That’s because it divorces the notion of page count from a method to “prove” comprehensiveness, a long overdue move for the search industry.
I’ve written before about the Yahoo-Google size fight that broke out in August, and I’ll recount some of that history as part of this analysis. Suffice to say, I’ve had many conversations with both companies over the past few weeks. In pondering their arguments and statements, I was ultimately left feeling how little counting pages, either self-reported index counts or those seen in response to actual queries, mean in terms of whether a search engine is comprehensive.
I’ll explain further, with some examples. But it’s best to start at the beginning. I hope you’ll indulge a little history.
The Bigger-Is-Better Attitude
In the last century (December 1995, to be exact), AltaVista burst on the search engine scene with what, at the time, was a giant 21 million page index, well above rivals in the 1 to 2 million range. The Web was growing fast. The more pages you had, the greater the odds you really would find that needle in a haystack. Bigger did, to some extent, mean better.
That fact wasn’t wasted on PR folks. The rush to seem bigger began in earnest. Lycos would talk about the number of pages it “knew” about, even if they weren’t actually indexed or accessible to searchers through its engine. That irritated Excite so much it posted a page on how to count URLs, archived here.
Size As Relevancy Surrogate
While size initially did mean bigger was better, that soon ended when the scale of indexes grew from millions of pages to tens of millions. For many queries, you could be overwhelmed with matches.
I’ve long used the needle in the haystack metaphor to explain this. You want to find the needle? You need the whole haystack, size proponents say. But if I dump the entire haystack on your head, can you find the needle then? Biggest isn’t good enough.
That’s why I and others have advised against fixating on size since as long ago as 1997 and 1998. Bigger was no longer better, regardless of the size wars that kept erupting. Google, when it came to popular attention in 1998-9, was one of the tiniest search engines with around 20 to 85 million pages. Despite a supposed lack of comprehensiveness, it grew because of the quality of its results.
Why have size wars persisted? Search engines saw an index size announcement as a quick, effective way to convey the impression they were more relevant. In lieu of a relevancy figure, size figures were trotted out. The biggest search engine wins! See my Screw Size! I Dare Google & Yahoo To Report On Relevancy and In Search Of The Relevancy Figure columns for more on this.
August’s Yahoo/Google Dispute
The latest size wars erupted when Yahoo said on its blog it now provided access to over 19 billion Web documents. Yahoo had been silent on its index size since rolling out its own technology in early 2004. Now, we had a figure — one over twice that claimed by Google on its home page.
Yahoo’s post did not claim Yahoo was more comprehensive or bigger than any other search engine. But Yahoo did claim this in an A.P. article about its self-reported size increase.
More than anything, that was a red flag to Google. Google wants to lead in all things search. The idea it might be second-best in comprehensiveness wasn’t a statement to be left unchallenged.
Counting Pages Doesn’t Measure Comprehensiveness
But how do you prove comprehensiveness? I’ll skip past some past attempts (you can read more here) and dive quickly into a study two NCSA students recently conducted. In short, they looked at rare words. The idea is the more matches you get for a rare terms, the more comprehensive a search engine must be. If Yahoo was really over twice as big as Google, it should return more than twice as many matches.
The study has many flaws, as the two students admit in a follow-up. It skewed toward returning dictionary lists rather than “normal” documents useful to searchers. Were duplicate pages checked? Was Yahoo somehow filtering spam pages? To what depth were pages indexed; full-text, or only after a certain length?
There was a bigger flaw. The students not unreasonably assumed the 8 billion pages indexed claim on Google’s home page was accurate. It wasn’t. In talking with Google recently, it turns out the home page count was its most conservative estimate. Beyond what was officially claimed were other documents, including “partially indexed” ones. These documents, while not actually indexed, might still appear in results and add toward a count.
In total, Google’s actual index size was above 8 billion, maybe as high as 9-11 billion pages, possibly more. That higher count, unbeknownst to the students, meant the “gap” between Google and Yahoo counts would be less. It wasn’t necessarily that Yahoo was smaller than it claimed. It could also be Google was bigger than it claimed.
The biggest flaw in the study was the idea that counting pages equals measuring comprehensiveness. Years ago, perhaps. Today, in an era of syndicated and near-duplicated content, measuring comprehensiveness is much more subjective.
Assume you search for a “rare” term and one search engine comes up with three matches, while a rival returns only one. Is the “big” search engine three times better? Look at the actual pages. What if those three pages are duplicates or near-duplicates? If that’s the case, the counts don’t accurately reflect comprehensiveness.
The Duplicate Content Issue
Let’s look at some real-life examples to fully understand the problems in depending on counts alone. I’ll start with data from Google’s new blog search service. It handily keeps me updated with posts in feeds that are link to our Search Engine Watch Blog. As it turns out, this is a great way to find those who carry my content.
Consider this page, from the Amazezing site. It’s simply a summary of my article here. The Amazezing site’s summary potentially could have comments making it unique from my article. But there aren’t any. Carrying that page in a search engine’s index, at the moment, doesn’t make the search engine any more comprehensive than if it just carried my original article. Yet if a search engine indexes both pages, then it seems twice as good on a count basis as a search engine that carries only my article.
Head to the Amazezing site’s homepage. Now look at the Open Directory’s homepage. Seem familiar? Amazezing carries a copy of the Open Directory’s listings. There’s nothing wrong with that. The Open Directory encourages people to use its listings. But where’s the unique content?
Look at the Open Directory’s Anime Genres category. Compare it with the corresponding page on Amazezing. The key difference? Amazezing has Google AdSense links, and the Open Directory doesn’t. Carrying the Amazeing category page makes a search engines negligibly more comprehensive to a human eye. On a pure count basis? Negligible becomes twice as good.
Counting Pages Indexed Per Site
What about looking at comprehensiveness in terms of pages from particular sites that are indexed? Surely, sites with fewer pages listed at one search engine than another may mean the latter search engine isn’t very comprehensive.
Reader Sam Davyson certainly felt that way. He recently discovered Google indexed nearly all of his 110 Web pages, while Yahoo had only five.
Then again, comprehensiveness may be in the eye of the beholder. Another reader, Patrick Mondout, is frustrated that Yahoo has refused to list any pages other than the home page of his Super70s.com site. In contrast, Google has up to 62,000 pages listed (many, however, seem to be “link” only pages that haven’t been indexed), and MSN has 3,000.
When I followed up, Yahoo’s response was that Mondout’s site seemed to be mostly content scraped from Amazon or eBay, plus excessive cross-linking. In turn, Mondout claims to have plenty of unique content.
Readers can look at the site and judge for themselves. However there’s no doubt that in some cases, a search engine may drop a site or content from a site for good reasons. In doing so, count numbers go down. Ironically, that may make them look less comprehensive, which isn’t the case. In other situations, of course, the opposite holds true.
Next in Part 2: How do you measure comprehensiveness?