Will the Crawler Survive?

  |  July 14, 2008   |  Comments

Google's universal search is proving that methods beyond the crawl are required to retrieve relevant information from the Web's emerging new structure. Will the crawler survive?

As I'm English, Independence Day celebrations don't rank too high on my social calendar. However, when I was invited to join a friend and some marketing big-brains for the holiday weekend at the beach, how could I refuse?

The group was a mix of people from conventional and interactive agencies, such as Grey, Leo Burnett, and Beyond Interaction, specifically from the search side. Naturally, most of the conversation (over many cold beers) was marketing related.

It was so refreshing to note how my new friends were fascinated by search and its complexities. But one thing stuck out more than anything else during the conversations: the recurring assumption that Google has access to all the content on the entire Web.

It's a long time since I last wrote about the discoverability of content on the Web. And it's always worth a revisit. To many end users, Google is the Web. Yet, mighty as Google is, it can only return results from the fraction of the Web it has managed to crawl. Of course, there are other methods through which Google can discover content, such as user submitted content via YouTube, Google Base, Google Maps, Google Picasa, and so forth.

But when it comes to the SEO (define) favorite, the search engine crawler, there are strong freshness requirements and multiple timescales. Trying to discover the relevance of existing pages in the index while dealing with the high arrival rate of new Web content isn't an easy task.

The overhead (average number of fetches required to discover one new page) needs to be kept to a minimum. Plus bandwidth is still an issue. It wouldn't be practical to attempt to download the entire Web every day. Politeness rules still exist when it comes to crawling the Web. And some sites may be so large that they simply can't be crawled from beginning to end in the space of a week.

No crawler is ever likely to be able to crawl the entire reachable Web. Almost infinite Web sites, spider traps, spam, and many other issues prevent it.

There will always be a tradeoff between recrawling existing pages and crawling a new page. In a connected world where breaking news is of global concern, search engines must be able to provide that information almost in real time to avoid end-user dissonance.

At the same time, consider the user looking for seemingly less urgent information, such as an operating manual. The user knows it must exist on the Web, yet he can't find it through a search engine. This is also a disappointing experience.

New pages are primarily discovered when a Web site uploads them and links to them from existing indexed pages. Or an entirely new Web site is created and is linked to from an existing indexed Web site.

Of course, this is also where the "filthy linking rich" dilemma that I've written about comes into play. Web sites with more links attract more links than those with fewer. As a result, they have more content indexed, more links, and perhaps preference when it comes to ranking.

And then there's the temporal issue of stale pages. For instance, the most relevant documents for a query about who hit the most home runs in baseball history up until 2007 would have been about Hank Aaron. However, after 2007 the most relevant pages for exactly the same type of query would be about Barry Bonds.

Then there's the case of Google knowing about the existence of Web pages, but not yet having crawled them. Billions of links are extracted from billions of pages by Google, and there must be some order and priority as to which get crawled first.

Even though Google is far better now at dealing with dynamically delivered content and different file types, the invisible Web still exists. Millions of pages are locked in databases or behind password-protected areas that crawlers are blocked from.

Search engine crawlers are certainly much smarter now than the early days of the Web. Yet the effectiveness of graph-generated crawling or even the total effectiveness of the crawling model may never be able to provide timely discovery of Web content in the future.

Google's universal search is proving that methods beyond the crawl are required to retrieve relevant information from the Web's emerging new structure.

User-generated content analysis. Cross-content analysis. Community analysis. Aggregate analysis. All of these must be taken into account to provide the most relevant results and richest end user experience.

So, will the crawler survive?

Join me over at Search Engine Watch's forum to discuss the crawler's possible fate.

Meet Mike at SES San Jose, August 18-22 at San Jose Convention Center.

ClickZ Live San Francisco This Year's Premier Digital Marketing Event is #CZLSF
ClickZ Live San Francisco (Aug 11-14) brings together the industry's leading practitioners and marketing strategists to deliver 4 days of educational sessions and training workshops. From Data-Driven Marketing to Social, Mobile, Display, Search and Email, this year's comprehensive agenda will help you maximize your marketing efforts and ROI. Register today!

ABOUT THE AUTHOR

Mike Grehan

Mike Grehan is Publisher of Search Engine Watch and ClickZ and Producer of the SES international conference series. He is the current president of global trade association SEMPO, having been elected to the board of directors in 2010.

Formerly, Mike worked as a search marketing consultant with a number of international agencies, handling such global clients as SAP and Motorola. Recognized as a leading search marketing expert, Mike came online in 1995 and is author of numerous books and white papers on the subject. He is currently in the process of writing his new book "From Search To Social: Marketing To The Connected Consumer" to be published by Wiley in 2013.

COMMENTSCommenting policy

comments powered by Disqus

Get the ClickZ Search newsletter delivered to you. Subscribe today!

COMMENTS

UPCOMING EVENTS

Featured White Papers

BigDoor: The Marketers Guide to Customer Loyalty

The Marketer's Guide to Customer Loyalty
Customer loyalty is imperative to success, but fostering and maintaining loyalty takes a lot of work. This guide is here to help marketers build, execute, and maintain a successful loyalty initiative.

Marin Software: The Multiplier Effect of Integrating Search & Social Advertising

The Multiplier Effect of Integrating Search & Social Advertising
Latest research reveals 68% higher revenue per conversion for marketers who integrate their search & social advertising. In addition to the research results, this whitepaper also outlines 5 strategies and 15 tactics you can use to better integrate your search and social campaigns.

WEBINARS

Jobs

    • Interactive Product Manager
      Interactive Product Manager (Western Governors University) - Salt Lake CityWestern Governors University, one of the 20 largest universities...
    • SEO Senior Analyst
      SEO Senior Analyst (University of Phoenix (Apollo Education Group)) - San FranciscoSEO Senior Analyst   Position Summary...
    • SEM & Biddable Media Manager
      SEM & Biddable Media Manager (Kepler Group LLC) - New YorkAs an Optimization & Innovation Manager at Kepler Group, you will be on the bleeding...