The first half of this series discussed some helpful Google Webmaster Tools (GWT) reports, including an external link report and ways to find rankings your site may be narrowly missing. Now, additional reports that offer benefits to GWT users.
Robots.txt Verification and Error Checking
A robots.txt file isn’t necessary for a site that performs well organically, but engines are finding more ways a robots.txt file can benefit site owners. To access this report from the main GWT area, click the “Tools” link in the left navigation, then click “Analyze robots.txt” from the submenu.
If you have a robots.txt file, this report helps you determine whether specific URLs are excluded as they should be. In the box labeled “Test URLs against this robots.txt file,” enter an actual URL from your site and click the “Check” button. In the subsequent “URL Results” field, Google will tell you whether the URL is “blocked” or “allowed.” Running tests helps determine how and when to use such characters as wildcards and trailing slashes to most effectively block those URLs you don’t want indexed.
This report is also helpful if you use your robots.txt file to provide engines with the location of your XML sitemap. Remember, however, this page doesn’t validate the XML sitemap itself. It validates only the way you refer to the file. In other words, you can point to the sitemap in a valid way, but the sitemap itself may not validate. Compare this with asking someone for directions to a specific restaurant. The directions may be accurate, but the restaurant could be out of business. Similarly, the sitemap reference in the robots.txt file can be valid, but the sitemap itself might not be.
Fortunately, GWT can also tell you whether your XML sitemap is valid. Find this report in the main “Sitemaps” section. If your sitemap feed is valid, you’ll see “OK” in the “Sitemap Status” column of that report.
One important thing to remember about excluding files via the robots.txt file is that while rare, these pages can technically show up in results pages if they have significant external link popularity. On our company blog, we have a login link for staff members. A Google search for that page shows a link to it but not a valid title or description. Google partially crawled that link but couldn’t fully access it, as it’s password protected.
Overcoming Canonical Issues
Canonical issues on a site are architectural glitches that inadvertently create multiple versions of identical URLs. One example is a site that resolves with or without the “www” prefix. Another example is a page that resolves at both the folder level, such as “/products/,” and the page level, like “/products/index.aspx.”
It’s true engines are getting better at detecting and accounting for canonical issues. It’s also true that you can never provide engines with too much information about the proper way to crawl, index, and interpret a site. So if you’re unsure about whether such a setting is necessary, my advice is utilize it.
GWT has an area that lets you account for the “www” prefix issue. From the “Tools” menu, select “Set preferred domain.” On this page you’ll see three options:
- Display URLs as www.site.com (for both www.site.com and site.com)
- Display URLs as site.com (for both www.site.com and site.com)
- Don’t set an association
Select the appropriate choice, and click the OK button. This takes a while to take effect, and it can take even longer to undo it if you change your mind down the road. So be sure about your needs before you make a choice.
Important points to remember about this feature:
- This setting is only for the “www” prefix issue. Other subdomains, such as shop.site.com, require their own versions of robots.txt, Google verification files, and so on.
- Experienced coders may already have canonical redirects set up for their sites, via either their .htaccess file (for Apache servers) or their IIS (define) control panel. If so, this GWT feature is redundant and likely unnecessary. Just make sure not to send conflicting instructions to Google about this. In other words, don’t tell engines to use the “www” version via your .htaccess file and tell Google to use the “non-www” version via GWT. That’s asking for trouble.
- While this report enables you to determine which sorts of URLs appear in Google results pages, there’s no evidence that the report is a “true” fix for canonical problems. In other words, there’s no reason to believe that link popularity to your “non-www” pages will somehow magically transfer to their “www” counterparts simply by using the tool.
Every second you spend poking around GWT is time well spent. I’ve watched its evolution closely, and I find the GWT team to be very responsive to user requests and concerns and really focused on providing data that’s truly helpful.
Next: I’ll spend some time in Yahoo’s Site Explorer and discuss ways Yahoo is informing site owners about their sites.
Want more search information? ClickZ SEM Archives contain all our search columns, organized by topic.
When you’re just starting out as a business owner it’s easy to become wrapped up in the seemingly endless number of metrics ... read more
Visual search on the web has been around for some time. In 2008, TinEye became the first image search engine to use ... read more
We’ve written an awful lot about Google’s open source accelerated mobile pages project (better know as Google AMP) over that last 12 ... read more