How Social Media Sites Should Use Internal SEO, Part 2

How Social Media Sites Should Use Internal SEO, Part 1” discussed how some social media sites could enhance their own organic performance by implementing some simple architecture changes.

Continuing that theme, let’s look at Facebook and Twitter (again), then change course slightly to talk about the Wolfram Alpha engine.

Facebook and Vanity URLs

Historically, public Facebook URLs have been long, cluttered, and not particularly consumer-friendly, usually ending with a long, random number string. With last week’s announcement that non-numeric vanity pages will soon be available, hopefully this will end soon, with shorter URLs quickly overtaking their aging ancestors.

When and if this happens, hopefully Facebook will do what Google Profiles has already done (and what LinkedIn should do): Work to unify and consolidate those profile pages so that while any URL can be requested, only one will be considered the “authority.” This will allow the older, longer URLs to live forever while allowing people to begin promoting their newer, shorter, more concise URLs.

This can be done in several different ways, including using a 301 redirect (like Google Profiles), excluding old URLs via the robots.txt file, or adding the canonical link element to the pages’ HTML. And the addition of authoritative URLs to the site’s XML site map will supplement any or all of these techniques.

In addition to its potential vanity URL situation, Facebook has a few fairly insignificant issues in its current URL construction, such as duplicates created due to the addition of ref= and viewas= parameters. These params and others like them make it difficult for engines to differentiate between them and cause Google to show you an abbreviated list of pages unless you click the “show all results” link, as seen in this example from Starbucks Coffee here.

While not currently a significant issue, if you add a vanity URL to the mix, the number of pages for each person or company is likely to double if consolidation steps aren’t taken.

Another way that Facebook could help itself is to ensure that within a person’s or company’s profile, the pages are titled and described accurately. Currently, using the previous Starbucks example, all pages have identical, (branded but non-specific) titles and meta descriptions across each of Starbucks’ pages, regardless of whether the page contained an overview, RSS story, photos, and so on.

(Note: Just before publishing time, Facebook clarified its rollout of vanity URLs. I’ll follow up on the discussion in a future column.)

Back to Twitter

Twitter is duplicating much of its content in a couple of ways, including multiple capitalization styles and the significant amounts of its mobile site that has been indexed.

Add to that the split between http: and https: traffic. Pages resolve at both versions of the URL. Only a small fraction of URLs are indexed as https:, but 1.4 million is a big number regardless of its relative percentage.

Wolfram Alpha and Load Balancing

Wolfram Alpha isn’t exactly a social media site. By its own admission it’s a “computational knowledge engine.” But until that niche gets a bit broader, most people will continue to call it a search engine.

With the apparent desire to offer users a fast experience, the Wolfram Alpha site uses a fairly common method of distributing the computational burden across multiple servers. Instead of seeing results on a www server subdomain, for example, you’ll often see a subdomain such as “www37”, which appends one or two digits to the traditional “www”.

This practice results in engines crawling vast amounts of Wolfram Alpha pages that aren’t necessarily on the www subdomain. The site’s “Food and Nutrition” examples page, for example, exists in at least four different locations, and choosing a non-www server at random (I used www93), Google has indexed more than 50 of the pages.

Traditional redirects don’t necessarily work in a situation like this because they go against the purpose of the architecture, which is to spread the traffic across multiple servers instead of consolidating it all onto www. If you find your site in this situation, however, you still have several options.

First, the site could detect engines beforehand and serve them pages only on the www subdomain. (This is really no different from the time-honored wisdom of detecting engines and stripping away session IDs.)

Second, and possibly easiest, would be to apply the canonical link element so that engines know that the www subdomain is the one that should appear in search results.

At least as of this writing, most engines, including Bing and Yahoo are still indexing Wolfram Alpha’s /input/ pages. (These pages are Wolfram Alpha’s equivalent of search results.) This isn’t Wolfram Alpha’s fault, because although at launch its robots.txt file was nearly empty, it’s recently added /input/ to its disallow list. Eventually, engines will stop showing those results.


Due to deadline pressure and the constant need to deliver interface and features improvements, developers put basic SEO (define) items like these on the back burner. After all, the site “works” without them, right? However, they should consider them quick investments in long-tail traffic acquisition.

Related reading

Brand Top Level Domains