License to Crawl

  |  November 27, 2006   |  Comments

Crawling the Web for new content is a complex task. A new organization helps site owners simplify that task.

Search engine innovator Brian Pinkerton's WebCrawler was the Web's first full-text retrieval search engine. Prior to this development, search engine crawlers simply scanned the first couple of hundred words (often less) from the top of your page and moved on. (WebCrawler is now a metasearch engine.)

At the time, WebCrawler was starved of fresh content. Pinkerton was programming his crawler to pillage Usenet (now GoogleGroups) at a whopping 50 gigs a day. The purpose, of course, was to suck out all the links in the Usenet feed for future crawling. But little or nothing was dynamically delivered then, so crawling for new content while maintaining the freshness of existing index content was pretty straightforward.

Today, that task is much trickier. The Web continues to grow exponentially every day, and, although there's tons of new content in good old flat HTML, there's ten times more tucked away in millions and millions of online databases that dynamically generate pages. And accessing, indexing, and maintaining all that information is most certainly not a trivial task.

As Web site development technology continues to develop at a rapid pace, so must search engine crawlers. The question mark in a URL used to be a dreaded token for search engine crawlers. With early crawlers, a major part of the task was to be polite and not put too much pressure on visited Web sites. As was avoiding serious technical issues, such as getting caught in a recursive loop, as is possible when bypassing the question mark in a URL. Once inside a huge database, the crawler could potentially bring a Web site to a standstill or crash itself. And that could have legal ramifications attached to it.

Certainly search engine crawlers do a much better job now of crawling dynamically generated content as well as Flash and other non-HTML file types. But the matter of index freshness is still a challenge.

And that still puts pressure on site owners' bandwidth, as the only way a crawler can know if a page has changed and update it in the index is by downloading it again. And again, and again, and again.

How much easier would it be if a search engine crawler only visited and downloaded your pages when it absolutely knew the page had changed? The entire crawling process would be streamlined and sped up.

Back then, Pinkerton and I kicked around the idea of using a XML schema to sit on Web servers, maintained by the owners (in the main, ISPs). The idea was simple. The ISP would update the XML file with site changes based on server analytics. The crawler would bring down the XML file first, then only retrieve pages that had been changed since the last crawl. Unfortunately, that practice was wide open to spamming and other forms of manipulation so the idea was canned pretty quickly.

Then last year, Google announced its sitemap initiative. It's an excellent step forward in programming crawler development. Not only that, it provides a much needed method of being able to submit your site to the engine as opposed to having to wait until you build up linkage data to get crawled and revisited on a regular basis.

And all that's why I welcomed the new sitemap protocol. At least now there's a way to create a single feed that you can submit to the three major search engines (Ask isn't included at this time) to make them aware of your Web pages. In fact, it provides them with a licence to crawl.

Of course, as with the launch of Google Sitemaps last year, there's a strong emphasis on the fact that knowing your Web pages exist in no way guarantees they'll get crawled by all the big three. However, avoiding those early technical barriers that prevented many Web sites from being crawled, even if they had linkage data through the roof, is big.

One day all Web sites will be submitted this way. Though at some point, we'll probably have to pay for it!

Have a burning desire to know everything about crawling the Web? I strongly recommend this thesis. It's written by Junghoo Cho, a researcher who's work I've studied a great deal. And although it was written in 2001, it still stands up today.

Meet Mike at Search Engine Strategies in Chicago, December 4-7, at the Hilton Chicago.

Want more search information? ClickZ SEM Archives contain all our search columns, organized by topic.

ClickZ Live Chicago Join the Industry's Leading eCommerce & Direct Marketing Experts in Chicago
ClickZ Live Chicago (Nov 3-6) will deliver over 50 sessions across 4 days and 10 individual tracks, including Data-Driven Marketing, Social, Mobile, Display, Search and Email. Check out the full agenda and register by Friday, Oct 3 to take advantage of Early Bird Rates!

ABOUT THE AUTHOR

Mike Grehan

Mike Grehan is currently CMO & managing director at Acronym where he is responsible for directing thought leadership programs and cross platform marketing initiatives, as well as developing new, innovative content marketing campaigns.
 
Prior to joining Acronym, Grehan was global VP, Content, at Incisive Media, publisher of Search Engine Watch and ClickZ, and producer of the SES international conference series. Previously, he worked as a search marketing consultant with a number of international agencies handling global clients such as SAP and Motorola. Recognized as a leading search marketing expert, Grehan came online in 1995 and is the author of numerous books and white papers on the subject and is currently in the process of writing his new book “From Search To Social: Marketing To The Connected Consumer” to be published by Wiley later in 2014.
 
In March 2010 he was elected to SEMPO’s board of directors and after a year as VP he then served two years as president and is now the current chairman.

COMMENTSCommenting policy

comments powered by Disqus

Get the ClickZ Search newsletter delivered to you. Subscribe today!

COMMENTS

UPCOMING EVENTS

Featured White Papers

IBM: Social Analytics - The Science Behind Social Media Marketing

IBM Social Analytics: The Science Behind Social Media Marketing
80% of internet users say they prefer to connect with brands via Facebook. 65% of social media users say they use it to learn more about brands, products and services. Learn about how to find more about customers' attitudes, preferences and buying habits from what they say on social media channels.

An Introduction to Marketing Attribution: Selecting the Right Model for Search, Display & Social Advertising

An Introduction to Marketing Attribution: Selecting the Right Model for Search, Display & Social Advertising
If you're considering implementing a marketing attribution model to measure and optimize your programs, this paper is a great introduction. It also includes real-life tips from marketers who have successfully implemented attribution in their organizations.

Jobs

    • Tier 1 Support Specialist
      Tier 1 Support Specialist (Agora Inc.) - BaltimoreThis position requires a highly motivated and multifaceted individual to contribute to and be...
    • Recent Grads: Customer Service Representative
      Recent Grads: Customer Service Representative (Agora Financial) - BaltimoreAgora Financial, one of the nation's largest independent publishers...
    • Managing Editor
      Managing Editor (Common Sense Publishing) - BaltimoreWE’RE HIRING: WE NEED AN AMAZING EDITOR TO POLISH WORLD-CLASS CONTENT   The Palm...