License to Crawl

Crawling the Web for new content is a complex task. A new organization helps site owners simplify that task.

Author

Mike Grehan

Date published November 27, 2006 Categories

Search engine innovator Brian Pinkerton’s WebCrawler was the Web’s first full-text retrieval search engine. Prior to this development, search engine crawlers simply scanned the first couple of hundred words (often less) from the top of your page and moved on. (WebCrawler is now a metasearch engine.)

At the time, WebCrawler was starved of fresh content. Pinkerton was programming his crawler to pillage Usenet (now GoogleGroups) at a whopping 50 gigs a day. The purpose, of course, was to suck out all the links in the Usenet feed for future crawling. But little or nothing was dynamically delivered then, so crawling for new content while maintaining the freshness of existing index content was pretty straightforward.

Today, that task is much trickier. The Web continues to grow exponentially every day, and, although there’s tons of new content in good old flat HTML, there’s ten times more tucked away in millions and millions of online databases that dynamically generate pages. And accessing, indexing, and maintaining all that information is most certainly not a trivial task.

As Web site development technology continues to develop at a rapid pace, so must search engine crawlers. The question mark in a URL used to be a dreaded token for search engine crawlers. With early crawlers, a major part of the task was to be polite and not put too much pressure on visited Web sites. As was avoiding serious technical issues, such as getting caught in a recursive loop, as is possible when bypassing the question mark in a URL. Once inside a huge database, the crawler could potentially bring a Web site to a standstill or crash itself. And that could have legal ramifications attached to it.

Certainly search engine crawlers do a much better job now of crawling dynamically generated content as well as Flash and other non-HTML file types. But the matter of index freshness is still a challenge.

And that still puts pressure on site owners’ bandwidth, as the only way a crawler can know if a page has changed and update it in the index is by downloading it again. And again, and again, and again.

How much easier would it be if a search engine crawler only visited and downloaded your pages when it absolutely knew the page had changed? The entire crawling process would be streamlined and sped up.

Back then, Pinkerton and I kicked around the idea of using a XML schema to sit on Web servers, maintained by the owners (in the main, ISPs). The idea was simple. The ISP would update the XML file with site changes based on server analytics. The crawler would bring down the XML file first, then only retrieve pages that had been changed since the last crawl. Unfortunately, that practice was wide open to spamming and other forms of manipulation so the idea was canned pretty quickly.

Then last year, Google announced its sitemap initiative. It’s an excellent step forward in programming crawler development. Not only that, it provides a much needed method of being able to submit your site to the engine as opposed to having to wait until you build up linkage data to get crawled and revisited on a regular basis.

And all that’s why I welcomed the new sitemap protocol. At least now there’s a way to create a single feed that you can submit to the three major search engines (Ask isn’t included at this time) to make them aware of your Web pages. In fact, it provides them with a licence to crawl.

Of course, as with the launch of Google Sitemaps last year, there’s a strong emphasis on the fact that knowing your Web pages exist in no way guarantees they’ll get crawled by all the big three. However, avoiding those early technical barriers that prevented many Web sites from being crawled, even if they had linkage data through the roof, is big.

One day all Web sites will be submitted this way. Though at some point, we’ll probably have to pay for it!

Have a burning desire to know everything about crawling the Web? I strongly recommend this thesis. It’s written by Junghoo Cho, a researcher who’s work I’ve studied a great deal. And although it was written in 2001, it still stands up today.

Meet Mike at Search Engine Strategies in Chicago, December 4-7, at the Hilton Chicago.

Want more search information? ClickZ SEM Archives contain all our search columns, organized by topic.

Subscribe to get your daily business insights

More about:

Read the next article

Explore Tech Talks

Lucy

Lucy helps organizations leverage knowledge for in... View Tech Talk
TVSquared

TVSquared is the global leader in cross-platform T... View Tech Talk
Grata

Grata is a B2B search engine for discovering small... View Tech Talk

Whitepapers

US Mobile Streaming Behavior

Whitepaper | Mobile

US Mobile Streaming Behavior

Streaming has become a staple of US media-viewing habits. Streaming video, however, still comes with a variety of pesky frustrations that viewers are ...

View resource

Winning the Data Game: Digital Analytics Tactics for Media Groups

Whitepaper | Analyzing Customer Data

Winning the Data Game: Digital Analytics Tactics for Media Groups

Winning the Data Game: Digital Analytics Tactics f...

Data is the lifeblood of so many companies today. You need more of it, all of which at higher quality, and all the meanwhile being compliant with data...

View resource

Learning to win the talent war: how digital marketing can develop its people

Whitepaper | Digital Marketing

Learning to win the talent war: how digital marketing can develop its peopl...

Learning to win the talent war: how digital market...

This report documents the findings of a Fireside chat held by ClickZ in the first quarter of 2022. It provides expert insight on how companies can ret...

View resource

Engagement To Empowerment - Winning in Today's Experience Economy

Report | Digital Transformation

Engagement To Empowerment - Winning in Today's Experience Economy

Engagement To Empowerment - Winning in Today's Exp...

Customers decide fast, influenced by only 2.5 touchpoints – globally! Make sure your brand shines in those critical moments. Read More...

View resource

Mastering voice search optimization: Talk like a local, rank like a pro

Search Marketing

Mastering voice search optimization: Talk like a local, rank like a pro

1m ClickZ News Staff

Mastering voice search optimization: Talk like a l...

Forget typing, voice search is booming. Businesses need Voice Search Optimization (VSO) to rank for conversational queries and secure top spots in sea...

View article

How to Create Impactful SEO Reports that Drive Business Success

2m ClickZ News Staff

How to Create Impactful SEO Reports that Drive Bus...

Wielding graphs and analytics has its place. But to truly capture executive attention in today’s impatient digital arena, we must step into the shoes ...

View article

How Google's Search Generative Experience (SGE) is Reshaping SEO

2m ClickZ News Staff

How Google's Search Generative Experience (SGE) is...

As the search giant delves deeper into the realm of artificial intelligence (AI), it is clear that SGE will have a profound impact on the future of SE...

View article

The secrets to getting the best SEO traffic without even ranking

10m Daniel Tannenbaum

The secrets to getting the best SEO traffic withou...

Did you know that there are ways to get to the top of Google without ranking your own site? You can still get lots of good organic traffic using alter...

View article

How SEO is changing because of ChatGPT

11m Daniel Tannenbaum

How SEO is changing because of ChatGPT

When ChatGPT was introduced in 2022, it changed the internet. Today, we speak to some startups and experts to understand how ChatGPT is changing SEO R...

View article

Winning at search: why vigilance and strategy alignment are necessary evils

Data-Driven Marketing

Winning at search: why vigilance and strategy alignment are necessary evils

11m Prasanna Dhungel

Winning at search: why vigilance and strategy alig...

As brands and agencies struggle to prioritize visibility of ever-changing SERP features, here's how they can build effective, holistic search strategi...

View article

What role does page speed play for SEO?

SEO

What role does page speed play for SEO?

1y DebugBear

What role does page speed play for SEO?

Page speed has been a ranking factor for a long time, but it has increased in importance over the last two years. Learn about Google’s Core Web Vitals...

View article

iOS 14 uncovers measurement vulnerabilities for business

322023

iOS 14 uncovers measurement vulnerabilities for business

1y Jamie Bolton

iOS 14 uncovers measurement vulnerabilities for bu...

How will marketers handle the advertising industry upheaval in regard to data and measurement? Read More...

View article

Follow us

License to Crawl

Subscribe to get your daily business insights

Read the next article

Explore Tech Talks

Whitepapers

Whitepapers

US Mobile Streaming Behavior

US Mobile Streaming Behavior

Winning the Data Game: Digital Analytics Tactics for Media Groups

Winning the Data Game: Digital Analytics Tactics f...

Learning to win the talent war: how digital marketing can develop its peopl...

Learning to win the talent war: how digital market...

Engagement To Empowerment - Winning in Today's Experience Economy

Engagement To Empowerment - Winning in Today's Exp...

Related Articles

Mastering voice search optimization: Talk like a local, rank like a pro

Mastering voice search optimization: Talk like a l...

How to Create Impactful SEO Reports that Drive Business Success

How to Create Impactful SEO Reports that Drive Bus...

How Google's Search Generative Experience (SGE) is Reshaping SEO

How Google's Search Generative Experience (SGE) is...

The secrets to getting the best SEO traffic without even ranking

The secrets to getting the best SEO traffic withou...

How SEO is changing because of ChatGPT

How SEO is changing because of ChatGPT

Winning at search: why vigilance and strategy alignment are necessary evils

Winning at search: why vigilance and strategy alig...

What role does page speed play for SEO?

What role does page speed play for SEO?

iOS 14 uncovers measurement vulnerabilities for business

iOS 14 uncovers measurement vulnerabilities for bu...