Diagnosing Search Issues from the Query Box, Part 2

  |  August 6, 2008   |  Comments

Additional ways to use query operators to gauge your site's presence.

In my last column, I discussed ways to use the site: and inurl: operators to detect indexing issues with your site. In this column, I discuss additional operators (such as cache:) and the ways in which they help you diagnose search engine issues and view your site the way engines do.

Additional Uses for the Site Operator

While sites such as CopyScape do a nice job of detecting duplicate content around the Web, I sometimes like the flexibility of finding duplicate content myself. My last column showed ways of detecting unwitting duplicate content on your own site (due to canonicalization issues). But what about your content being used on other sites?

To detect this, I recommend using the site: operator to filter out your site. Scan your site to find a string of text that should appear only on your site, then plug it into a query like this:

"this is the unique string of text I found on my site" -site:yourdomain.com

The quotation marks are required to search for the exact text string. The minus sign before the site: operator tells Google to exclude your domain from results. Consequently, the only results on the SERP (define) should be third-party sites using your copy.

Keep in mind that you need to be cautious before shouting "plagiarism" or "copyright infringement." These sites may be quoting yours in a fair-use context, or they might be directory sites that have pulled a description of your site prior to linking back to you. The best text strings to search for are longer, more obscure passages that really should be on your site only.

Searching for Specific URLs

Several years ago, you could simply enter a URL in a Google search box, and the resulting page would give you a short but helpful list of information about that particular URL, including links to related sites, the cached version of the page, links pointing to that page (although this feature is notoriously shallow in its coverage), pages that mention the specific URL text, and so on.

This sort of query was particularly helpful not so much for the links to additional information, but to quickly determine whether a specific engine had indexed a page. In short, a resulting page that said "Sorry, no information is available for the URL [URL]" was a quick way to spot an indexing problem, because that response was reserved for URLs that had either not yet indexed the page, or for pages that purposely avoided indexing (such as via the robots.txt exclusion or a robots "noarchive" meta tag).

Today, searching for a simple URL still works at MSN/Live and Yahoo. A couple years ago, however, Google changed its usage for URL queries. At Google, you must now precede a URL with the text info: to get indexing and informative link information. Make sure that you leave no space between colon and URL when performing this query.

In my opinion, this latter feature is of limited value, although it can represent a link-building opportunity, sometimes turning up less savvy sites that mention your URL as text but not as a link.

The Difference Between Cache and Text Cache

The cache: operator is a terrific tool that helps you determine whether engines see your page. Ironically, it's not an entirely accurate way to show you exactly what engines see. I can't emphasize this enough, so I'll rephrase: The cached version of your page is not necessarily the exact same version of the page that engines see, monitor, and consider in their algorithms.

To see the version of your page that engines see, you must take a technological step backwards and view the text cache of the page. The text cache strips away deceptive script code, rich media, and graphics, leaving only the skeletal remains of your page, the text and links.

Consider, for example, the cache version of www.usanetwork.com. You can see some rich media and graphics and a few links, but the main body section is empty.

Contrast that view with the text cache of the same page.

While the regular cached version of a page "includes" content such as rich media and JavaScript-spawned Flash files, don't assume Google notices or considers such content. In most cases, it's included only because Google has pulled the script and Flash code into its index -- not because it understands or weighs it.

To find the text cache version of a page at Google, you can add &strip=1 to the end of a cached URL, such as in the following:

http://64.233.167.104/search?q=cache:www.usanetwork.com&pws=0
http://64.233.167.104/search?q=cache:www.usanetwork.com&pws=0&strip=1

You can also find a link to the text cache at the top of any cached page in Google. Look for the copy "Text-only version" at the top-right of a cached page, such as this cached version of the ClickZ home page.

Conclusion

Cached pages are available at all major engines, although only Google allows use of the actual cache: operator. For Yahoo and MSN Live, you can search for a URL then find a link to the cached version on the resulting page. Also, Google is the only one of the big three that differentiates and shows an actual text cache.

ClickZ Live Chicago Join the Industry's Leading eCommerce & Direct Marketing Experts in Chicago
ClickZ Live Chicago (Nov 3-6) will deliver over 50 sessions across 4 days and 10 individual tracks, including Data-Driven Marketing, Social, Mobile, Display, Search and Email. Check out the full agenda and register by Friday, Oct 3 to take advantage of Early Bird Rates!

ABOUT THE AUTHOR

Erik Dafforn

Erik Dafforn is the executive vice president of Intrapromote LLC, an SEO firm headquartered in Cleveland, Ohio. Erik manages SEO campaigns for clients ranging from tiny to enormous and edits Intrapromote's blog, SEO Speedwagon. Prior to joining Intrapromote in 1999, Erik worked as a freelance writer and editor. He also worked in-house as a development editor for Macmillan and IDG Books. Erik has a Bachelor's degree in English from Wabash College. Follow Erik and Intrapromote on Twitter.

COMMENTSCommenting policy

comments powered by Disqus

Get the ClickZ Search newsletter delivered to you. Subscribe today!

COMMENTS

UPCOMING EVENTS

Featured White Papers

IBM: Social Analytics - The Science Behind Social Media Marketing

IBM Social Analytics: The Science Behind Social Media Marketing
80% of internet users say they prefer to connect with brands via Facebook. 65% of social media users say they use it to learn more about brands, products and services. Learn about how to find more about customers' attitudes, preferences and buying habits from what they say on social media channels.

An Introduction to Marketing Attribution: Selecting the Right Model for Search, Display & Social Advertising

An Introduction to Marketing Attribution: Selecting the Right Model for Search, Display & Social Advertising
If you're considering implementing a marketing attribution model to measure and optimize your programs, this paper is a great introduction. It also includes real-life tips from marketers who have successfully implemented attribution in their organizations.

Jobs

    • Tier 1 Support Specialist
      Tier 1 Support Specialist (Agora Inc.) - BaltimoreThis position requires a highly motivated and multifaceted individual to contribute to and be...
    • Recent Grads: Customer Service Representative
      Recent Grads: Customer Service Representative (Agora Financial) - BaltimoreAgora Financial, one of the nation's largest independent publishers...
    • Managing Editor
      Managing Editor (Common Sense Publishing) - BaltimoreWE’RE HIRING: WE NEED AN AMAZING EDITOR TO POLISH WORLD-CLASS CONTENT   The Palm...