Canonicalization Made Simple
The road to short and sweet search URLs.
The road to short and sweet search URLs.
Technically speaking, canonicalization is “the process of converting data that has more than one possible representation into a ‘standardized’ canonical representation.”
Search engine algorithms include a mathematical equation that compares different representations for similarity, counting the number of distinct data structures, to impose a meaningful, canonical sorting order.
That makes sense… right? Maybe for software engineers, computer programmers, math majors, and the like. But let’s make this a bit simpler.
Plainly speaking, search engines like Google use a canonicalization process to present users with short and sweet URLs. Think about this for a moment and consider which URL the average user would most likely click on when presented with these choices:
If you believe Google’s canonical preference would be www.yourdomain.com, even when all three URLs arrive at the same destination, you can proudly say you understand the fundamentals of canonicalization.
Let’s take a look at the major search engines’ canonical preferences more closely to try to determine what other factors go into determining which URL is presented in search query results.
For the sake of discussion, let’s complete a search for “milwaukee brewers” in Google, Yahoo, and MSN to compare the results.
Google offers the following top results:
The Official Site of The Milwaukee Brewers: Homepage
Features scores, game schedules, roster, news, history and forums.
brewers.mlb.com/ – 78k – Cached – Similar pages
Schedule : 2007 Brewers Schedule – milwaukee.brewers.mlb.com/NASApp/mlb/s…
Active Roster – milwaukee.brewers.mlb.com/…/roster_active.jsp?c_id=mil
Ticket Center – milwaukee.brewers.mlb.com/…/ticketing/index.jsp?c_id=mil
Help : Job Opportunities – mlb.mlb.com/NASApp/mlb/mlb/help/jobs.jsp?c_id=mil
More results from brewers.mlb.com »
Yahoo offers the following top result:
Official site of the Milwaukee Brewers. Features up-to-date stats and results, player bios, minor league information, ticket and merchandise ordering info, player …
Category: Major League Baseball > Milwaukee Brewers
www. milwaukeebrewers.com – 79k – Cached – More from this site
And MSN Live Search offers the following top results:
Milwaukee Brewers : The Official Site
MLB Sites MLB.com Angels Astros Athletics Blue Jays Braves Brewers Cardinals Cubs Devil Rays Diamondbacks Dodgers Giants Indians Mariners Marlins Mets Nationals Orioles Padres Phillies Pirates Rangers …
Note that no one top result is more relevant than the other. All indexed listings resolve to http://milwaukee.brewers.mlb.com/index.jsp?c_id=mil by way of a temporary redirect (302).
Why, then, is one domain displayed in Google and MSN and another in Yahoo for the same result? Are the Milwaukee Brewers spoofing the search engines using temporary redirects and multiple domains?
Not exactly. Canonicalization processes simply level the playing field. These algorithmic elements vary from search engine to search engine.
Google knows the two domains are exactly the same and treats them as such when it comes to inbound links. Using query string commands, Google reveals it acknowledges 2,200 links to both link:brewers.mlb.com and link:www.milwaukeebrewers.com.
A lot of SEO (define) folks have talked about Google’s preference for subdomains. This is proof of that preference because that’s how the site’s actually crawled and indexed. Do a query for “site:brewers.mlb.com” and you’ll get some 7,880 pages. Do the same for “site:www.milwaukeebrewers.com,” and you’ll get “did not match any documents.”
To provide users with its preferred results, Google relegates www.milwaukeebrewers.com to its no man’s land of non-indexation. Google canonically prefers to display the pretty little subdomain, brewers.mlb.com, as its most relevant result for a “milwaukee brewers” search query.
MSN Live Search just isn’t as bright when it comes to algorithmic adjustments. It indexes nearly 1,300 pages of “site:brewers.mlb.com” and six pages of “site:www.milwaukeebrewers.com“. Its algorithms credit “link:www.milwaukeebrewers.com” with nearly 14,000 inbound links and “link:brewers.mlb.com” with over 14,000. MSN Live Search duplicates its own results by including the non-canonical URL in the results.
Getting any bright ideas about MSN Live Search, subdomains, and temporary redirects? Small wonder MSN Live Search has its filters set to “high” to stop spamming itself and present any semblance of canonicalization.
The question that remains is Yahoo’s preference forbrewers.mlb.com over its subdomain counterpart, brewers.mlb.com. Based on information from Yahoo Site Explorer, brewers.mlb.com has 735 pages indexed and 228 inbound links. Meanwhile, www.milwaukeebrewers.com has 45 pages indexed and 6,331 inbound links.
Should Webmasters redesign their sites to include subdomains if they want to make headway in Google and MSN Live Search? Absolutely not. Subdomains are not a secret weapon for improved indexation.
Subdomains do make sense, however, when each subsection of a top-level domain contains completely unique content addressing different topics, such as the collection of baseball teams at mlb.com.
It would be interesting to test the best way to shift canonicalization processes in the major search engines. Would submitting the top-level domain as the preferred result influence Google and MSN Live Search indexation? Could XML sitemap feeds encourage Yahoo to present the subdomain in natural search results? These are questions for another day while we see if mlb.com will play ball.
Join us for Search Engine Strategies in London, February 13-15, at ExCel London.
Want more search information? ClickZ SEM Archives contain all our search columns, organized by topic.