Getting Out of Search

Most Web site marketers are very interested in getting into the major search engines, namely Google, Yahoo, and MSN, currently in that order. But about when bizarre circumstances collide and a page or two or 2,000 accidentally slip into search engine indices? Pages you didn’t necessarily want found — not found by search engines spiders and most certainly not by the general viewing public.

How do you do eradicate something from search engines fast and forever? Do you fax Google, call Yahoo, or e-mail Steve Ballmer? You could try all the above, and I sincerely wish you the best of luck in each fruitless endeavor.

Most Web marketers run screaming to the Web dev or IT team offices and have the pages deleted from the site. The logic is simple: out of the site means you’re no longer going out of your mind. Then you triumphantly high-five the team, e-mail your boss, and say, “Whew! Mischief managed!”

Only it’s not.

Increasingly more Web marketers need to know how to get the heck out of search. Deleting the source code simply doesn’t work anymore; Google’s and MSN’s spiders are much too fast and too greedy to miss out on a Web page faux pas. And there’s the cache. When you include removing links to the offending pages, getting pages out of search engine results can be a Herculean challenge.

But there are steps you can take today to avoid a meta misadventure tomorrow.

Steps to Take

First, make certain your Web site is designed to return a 404 error (define) — and an appropriate error message for users — when a page no longer exists. If your site is designed to default to the home page when the user enters a URL that’s long gone, the search engines actually think the woebegone page still exists. Ergo, there’s no reason for the search engines to naturally allow the page to fall out of their indices. That dead page looks like its alive.

Don’t let zombie pages ruin a perfectly good Web site. Get your 404 errors in order, then take the next logical step: prove to the search engine spiders that you own and manage the site. Validate your existence by authenticating your Web site with Google and Yahoo (and soon MSN Search) Webmaster tools. Doing so can help facilitate the ready removal of URLs gone wild.

If you haven’t authenticated your site yet, do so for speedier removal of rogue pages. If you’ve already authenticated your Web site at Google Webmaster Central or Yahoo Site Explorer, you’re one step closer to being able to readily remove undesirable content from indices forever.

In Yahoo, for example, sign in to Site Explorer, enter the URL/path in the “Explore URL” box, and hit the “Delete” button next to each URL you want removed. Be warned: when a URL is removed in such a manner, Yahoo deletes the specific URL as well as all the subpaths listed under that URL. Delete with caution.

Yahoo does help you, however, because it shows all the subpath URLs to be deleted during the confirmation process. After that, you’ll see a “Pending Delete” status in the “Actions” information page so you know when the URL removal goes into effect. Usually, Yahoo takes care of a request within 48 hours; you can set your Site Explorer preferences to receive an e-mail notification when the deed is done (just in case you need to prove it to the boss).

Remove Web pages from Google in a similar manner after authenticating your site at Webmaster Central.

Preventive Maintenance

Of course, you can keep your content out of the search engines in the first place by using the robots.txt protocol (define). This method will keep new or undesirable content out of the indices and help remove old, stale content already there. It takes some time, though, for the spiders to refresh their content and reflect your content’s removal. How long it takes for content to fall out of the search indices is a direct reflection of your site’s crawl frequencies.

Remember, using the robots.txt protocol to disallow the spiders from accessing certain content within your site ensures legitimate spiders don’t crawl your excluded URLs, but it doesn’t keep the URLs themselves out of the indices. That’s because the search engine spiders tend to discover references to excluded URLs from other Web sources, such as inbound links.

Even though it’s not a particularly speedy way to get content out of search indices, right now using the robots.txt file is just about the only way to get unwanted URLs out of MSN Search. Unfortunately, it may take several weeks for the engine to complete an indexing update that reflects your changes.

MSN also recommends adding a >noindexmeta< tag to pages you don't want indexed, moving the URL to secure status at HTTPS, and removing the pages from your site. But these tactics don't always work in a timely manner. They do work well for preventive maintenance, though. If you have concerns about speedy removal of a Web page, try contacting MSN Search Site Owner Support directly. However, it could take several weeks to get a response.

Wrap Up

If you take a little time now, you can avoid a potential data disaster in the future. Check out your 404 error process, get to know your robots.txt file, become familiar with the use of bot messages in your meta tags, and get your site authenticated in Google and Yahoo — and maybe someday soon MSN Search. That way when the inexplicable happens and you need to make private information truly private again, you can act quickly and efficiently without panicking. In this case, an ounce of prevention is truly worth a pound of cure — especially if it’s your site’s or brand’s online reputation that gets pounded by the blogosphere for a little lapse in contextual judgment.

Join us for SES Chicago on December 3-6 and training classes on December 7.

Want more search information? ClickZ SEM Archives contain all our search columns, organized by topic.

Related reading

Brand Top Level Domains