Deduping Duplicate Content

  |  September 23, 2009   |  Comments

The Web is one tangled mess of equally irrelevant content. Four quick tests to tell if duplicate content is an issue for your site.

Some interesting things always come out at conferences, which is probably why so many folks still go to them, as opposed to some sort of virtual offering. It might be a big announcement from the search engines, an entertaining keynote speech, or a snippet of conversation over lunch, but some tidbit of information always makes you glad you went to the time and expense to participate.

One interesting thing that came out of SES San Jose's Duplicate Content and Multiple Site Issues session in August was the sheer volume of duplicate content on the Web. Ivan Davtchev, Yahoo's lead product manager for search relevance, said "more than 30 percent of the Web is made up of duplicate content."

At first I thought, "Wow! Three out of every 10 pages consist of duplicate content on the Web." My second thought was, "Sheesh, the Web is one tangled mess of equally irrelevant content." Small wonder trust and linkage play such significant roles in determining a domain's overall authority and consequent relevancy in the search engines.

Three Flavors of Bleh

Davtchev went on to explain three basic types of duplicate content:

  1. Accidental content duplication: This occurs when Webmasters unintentionally allow content to be replicated by non-canonicalization (define), session IDs, soft 404s (define), and the like.

  2. Dodgy content duplication: This primarily consists of replicating content across multiple domains.

  3. Abusive content duplication: This includes scraper spammers, weaving or stitching (mixed and matched content to create "new" content), and bulk content replication.

Fortunately, Greg Grothaus from Google's search quality team had already addressed the duplicate content penalty myth, noting that Google "tries hard to index and show pages with distinct information."

It's common knowledge that Google uses a checksum-like method for initially filtering out replicated content. For example, most Web sites have a regular and print version of each article. Google only wants to serve up one copy of the content in its search results, which is predominately determined by linking prowess. Because most print-ready pages are dead-end URLs sans site navigation, it's relatively simply to equate which page Google prefers to serve up in its search results.

In exceptional cases of content duplication that Google perceives as an abusive attempt to manipulate rankings or deceive users, Google will "make appropriate adjustments" to the indexation and rankings of the sites involved, according to Grothaus. Even though Google doesn't consider dilution of link popularity a penalty, anyone who has been on the receiving end of this particular duplicate content issue might say otherwise.

Test and Tune

How do you know if duplicate content is an issue for your site? Simply run a couple of quick tests:

  • If you have multiple URLs for you home page, you have duplicate content.

  • If you go to any page on your site and remove the "www" from the URL and the same content is still served up, you may have duplicate content.

  • If you create an error by appending gibberish to a URL string or remove a directory path and still serve up the same content without cuing a 404 error page, you probably have duplicate content.

  • If you can isolate a URL construct from your print pages and run an advanced indexation check, such as " inurl:/print/" in Google, then you definitely have duplicate content indexed.

Because these are usually accidental duplicate content issues, remedies are relatively easy to apply to your site. Simply read and employ all the best practices delineated on the search engines Webmaster blogs and forums. Here are the big three:

If you make certain that you properly canonicalize your site, 301 (permanent) redirect any duplicate home page URLs to your canonical domain, use robots.txt to eliminate site level content duplication, use meta robots tags to purge page-level duplicate content, and use canonical tags to indicate preferred content, you can readily eliminate much of the duplicate content that was accidentally created.

Trust Me

It's really that simple to employ remedies for accidental content duplication. Build a site out of user-friendly URLs to optimize your branding efforts and usability while eliminating inefficient crawling, and your Web site will be well on its way to earning trust from the search engines.

If any of this is too complex for your circumstance, you might want to call in some professional assistance. If dodgy or abusive levels of duplicate content are issues for your Web site or network of sites, stop back here in a couple of weeks when we continue the conversation about duplicate content and the issues it creates in the search engines. Until then, keep testing and tuning your results.

Search ads and display ads offer a powerful one-two punch for your marketing plan. Join us on Wednesday, September 30, 2009, at 1 p.m., for a free Webinar to hear how recent studies show that search and display advertising used together can drive sales more effectively than either channel by itself.

ClickZ Live Toronto On the heels of a fantastic event in New York City, ClickZ Live is taking the fun and learning to Toronto, June 23-25. With over 15 years' experience delivering industry-leading events, ClickZ Live offers an action-packed, educationally-focused agenda covering all aspects of digital marketing. Register today!

ClickZ Live San Francisco Want to learn more? Join us at ClickZ Live San Francisco, Aug 10-12!
Educating marketers for over 15 years, ClickZ Live brings together industry thought leaders from the largest brands and agencies to deliver the most advanced, educational digital marketing agenda. Register today and save $500!


P.J. Fusco

P.J. Fusco has been working in the Internet industry since 1996 when she developed her first SEM service while acting as general manager for a regional ISP. She was the SEO manager for Jupitermedia and has performed as the SEM manager for an international health and beauty dot-com corporation generating more than $1 billion a year in e-commerce sales. Today, she is director for natural search for Netconcepts, a cutting-edge SEO firm with offices in Madison, WI, and Auckland, New Zealand.

COMMENTSCommenting policy

comments powered by Disqus

Get the ClickZ Search newsletter delivered to you. Subscribe today!



Featured White Papers

Gartner Magic Quadrant for Digital Commerce

Gartner Magic Quadrant for Digital Commerce
This Magic Quadrant examines leading digital commerce platforms that enable organizations to build digital commerce sites. These commerce platforms facilitate purchasing transactions over the Web, and support the creation and continuing development of an online relationship with a consumer.

Paid Search in the Mobile Era

Paid Search in the Mobile Era
Google reports that paid search ads are currently driving 40+ million calls per month. Cost per click is increasing, paid search budgets are growing, and mobile continues to dominate. It's time to revamp old search strategies, reimagine stale best practices, and add new layers data to your analytics.




    • GREAT Campaign Project Coordinator
      GREAT Campaign Project Coordinator (British Consulate-General, New York) - New YorkThe GREAT Britain Campaign is seeking an energetic and creative...
    • Paid Search Senior Account Manager
      Paid Search Senior Account Manager (Hanapin Marketing) - BloomingtonHanapin Marketing is hiring a strategic Paid Search Senior Account Manager...
    • Paid Search Account Manager
      Paid Search Account Manager (Hanapin Marketing) - BloomingtonHanapin Marketing is hiring an experienced Paid Search Account Manager to...