Large Sites, Small Headaches

An agency-side SEO (define) professional’s dream is to remove the thorn in a site’s paw — that single, hidden problem that, once fixed, opens the floodgate of search traffic. Sadly, this doesn’t happen often. Instead, it’s usually a slow climb up the graph, built on constant content production and spot-checking of architecture and results pages.

I recently examined the damage caused when site owners fail to carefully watch stray domains. This column investigates large site maintenance issues that may not earn a passing thought amid day-to-day demands of most search engine marketers, but if left unattended, they can do slow, often indiscernible damage.

Balancing the Load Correctly

Depending on the type of site you run, traffic demands may be wildly unpredictable. A big story picked up by a popular social media site or spread virally can spike site traffic, leaving you unprepared. To address this possibility, many sites purchase load-balancing solutions that charge for overflow traffic only when used, instead of paying 100 percent of a bandwidth cost when they need it only 5 percent of the time.

Load balancers often take excess traffic demand and spill them over to servers called www2, www3, and so on. While this is a great solution for human visitors, search engines often end up on these servers, crawling indefinitely if the site uses relative URLs for navigation. A quick search for “www2” shows the unintended consequences of load balancing: thousands of URLs indexed on load-balancing subdomains that would be better left invisible to search engines.

To determine if this is your site’s problem, find out whether the site uses load-balancing measures for excessive traffic periods. If it does, investigate what the servers are called. Perform “site:” queries for those subdomains to see if and how many pages are indexed across your load-balancing servers.

Just as engines recommend stripping session IDs from URLs served to robots, rampant indexing of load-balanced content often calls for similar measures. If necessary, use user-agent detection to ensure robots always get content on the main (“www”) server, and that only humans get sent to www2 and beyond.

While 301 redirects (define) might sound enticing at first, you should probably resist the temptation unless you really know what you’re doing. If you redirect a user (or a bot) from www2 to www, but server demand is still high enough to trigger load-balancing, you might cause an endless loop that convinces visitors, both human and electronic, to quickly abandon the site.

GET to Your POST

Of all discussions about the importance of links, probably 98 percent focus on links coming to your Web site, as opposed to the links that point from your site to someone else’s. Did you know that under a certain set of conditions spammers can exploit your on-site search function to build links from your site to theirs?

About a year ago, we looked at a client whose Yahoo index counts began to skyrocket, far surpassing the number of pages that legitimately existed on the site. As we poured through Yahoo Site Explorer to see what was going on, we saw thousands of junk pages masquerading as search results from the site’s internal search feature. We soon learned the site’s search function used the GET method (define) of form handling but failed to strip HTML out of queries.

Problem is GET forms typically include the query string in the resulting URL, such as “/results.asp?query=’free+ringtones'”. On the other hand, POST forms (define) typically result in a single URL, such as “/results.asp.”

Spammers had attacked our client’s site and input search queries that contained the HTML required to link back to their sites. For example, one indexed search results page began, “Following are the results of your search for ‘free ringtones,'” with “free ringtones” linking out to the spammer’s site. Then, the spammers had other sites in their network link to the search results page on the client’s site, which resulted in the page getting crawled and indexed by Yahoo. So Yahoo saw thousands of pages on the client site linking to spammy sites. Needless to say, these sites didn’t fit the client’s criteria for link-worthy sites.

Several safeguards exist for this problem. First, exclude search results pages from spider crawls through your robots.txt file. If you don’t want to do that, strip HTML from the text accepted in search fields. And finally, explore converting your form method from GET to POST to avoid distinct URLs that engines can index.


Persistence is the hardest SEO technique to master. But it enables you to finely tune a site over time and avoid some of the problems that plague less experienced Webmasters.

Want more search information? ClickZ SEM Archives contain all our search columns, organized by topic.

Related reading

Web page access