Linking Mistakes to Avoid (Part 2): Removing Orphaned URLs

Even now as you read this, you probably have orphaned URLs you don’t know about, collecting dust in a forgotten pile at the bottom of the search engine indices. It happens to the best of us. Even I, self-proclaimed Link Mensch, was humbled recently to discover several old URLs in AltaVista’s database that no longer physically exist on my Web server. Some expert I am.

During the life span of any Web site, you create, update, delete, and remove URLs on a regular or semiregular basis. New files go up, old ones come down, they may get renamed or archived. Sometimes, entire Web sites with thousands of pages get rehosted on new servers using new content management tools (as ClickZ did recently). I’ve even seen cases in which every URL on a site changed at once.

While you have been diligently running your Web site, adding, deleting, moving, and archiving files and URLs, search engine crawlers have been carousing through the Web. They have been visiting your server, on a hit-and-run basis, since the moment your site went live. Maybe a crawler came across one of your URLs as it scanned a newsgroup post at Deja News a couple of years ago. Maybe a newsletter wrote about your site and archived that edition, just as a crawler wandered by and stumbled onto your URL. There are countless ways a crawler could have found your URLs without ever going near your server. In fact, most URLs in any search engine’s database were found and followed from source other than your own site.

What Matters Most

Of all the URLs your site has ever had, how many of them are still in the database of any given search engine?

Search engines have no idea if the URLs they have recorded and indexed are still in existence at any given moment. You may have updated your site and removed links and URLs, but search engines still think they exist. Search results are nothing but placeholders for the actual page on its server. Search results are a list of links.

Every URL from your site that no longer exists but a search engine thinks does exist is like a lump of coal waiting to be turned into a diamond. With search engines charging for indexing of URLs, it’s even more important to revive dead links before the engines find out they are dead and purge them. A purged URL is lost… forever.

Nearly every marketer tries get its site fully indexed by the search engines. Most site owners wish they could get more of their sites’ pages indexed. If you have old links showing up in search results, count yourself lucky. And get busy making those dead links live again.

Finding and Fixing Them

Here’s one way to find out how many URLs from your site a search engine has indexed. Go to AltaVista and in the search box type “host:your domain” (replacing your domain with whatever your domain is, such as “”).

Look at the results. What you see is every file that AltaVista has in its index and thinks is active. Peruse the list. Put your cursor over the clickable link — but don’t click. Look at the bottom of your browser to see the filename of the URL. Are the file names you see still in existence? Probably not. If those names no longer exist on your site, create a new page with exactly the same filename as the one AltaVista thinks is still around, and get it on your server ASAP.

Let’s say you once had a site-map page named site-map.html, and you see that file among the search results. Six months ago, you changed that file to map.html, and removed the site-map.html file from your server. The search engine doesn’t know you removed the URL and still has a record of the old page and what was on it.

You can also examine your own server logs to find all page requests that result in a 404/file not found server request. This works even if you use custom 404 pages. This is how I discovered on my site there was a file that had been returning 404 error messages about 30 times a day, or almost 1,000 times a month. I created a file that had the same name and content as the one that no longer existed. Bingo. I recaptured every bit of that lost traffic. You can do the same. Start with your server logs and then try some test searches.

If you want to find out what URLs the engines have indexed from your site, Danny Sullivan’s Search Engine Watch site has a section just for this.

Until next time, I remain

Eric Ward,
Link Mensch

Related reading

Brand Top Level Domains