http://casino.nf

Canonical issues

Canonical issues

Before I start collecting feedback on the Bigdaddy data center, I want to talk a little bit about canonicalization, www vs. non-www, redirects, duplicate urls, 302 “hijacking,” etc. so that we’re all on the same page.
Q: What is a canonical url? Do you have to use such a weird word, anyway?
A: Sorry that it’s a strange word; that’s what we call it around Google. Canonicalization is the process of picking the best url when there are several choices, and it usually refers to home pages. For example, most people would consider these the same urls:
  • www.example.com
  • example.com/
  • www.example.com/index.html
  • example.com/home.asp
But technically all of these urls are different. A web server could return completely different content for all the urls above. When Google “canonicalizes” a url, we try to pick the url that seems like the best representative from that set.
Q: So how do I make sure that Google picks the url that I want?
A: One thing that helps is to pick the url that you want and use that url consistently across your entire site. For example, don’t make half of your links go to http://example.com/ and the other half go to http://www.example.com/ . Instead, pick the url you prefer and always use that format for your internal links.
Q: Is there anything else I can do?
A: Yes. Suppose you want your default url to be http://www.example.com/ . You can make your webserver so that if someone requests http://example.com/, it does a 301 (permanent) redirect to http://www.example.com/ . That helps Google know which url you prefer to be canonical. Adding a 301 redirect can be an especially good idea if your site changes often (e.g. dynamic content, a blog, etc.).
Q: If I want to get rid of domain.com but keep www.domain.com, should I use the url removal tool to remove domain.com?
A: No, definitely don’t do this. If you remove one of the www vs. non-www hostnames, it can end up removing your whole domain for six months. Definitely don’t do this. If you did use the url removal tool to remove your entire domain when you actually only wanted to remove the www or non-www version of your domain, do a reinclusion request and mention that you removed your entire domain by accident using the url removal tool and that you’d like it reincluded.
Q: I noticed that you don’t do a 301 redirect on your site from the non-www to the www version, Matt. Why not? Are you stupid in the head?
A: Actually, it’s on purpose. I noticed that several months ago but decided not to change it on my end or ask anyone at Google to fix it. I may add a 301 eventually, but for now it’s a helpful test case.
Q: So when you say www vs. non-www, you’re talking about a type of canonicalization. Are there other ways that urls get canonicalized?
A: Yes, there can be a lot, but most people never notice (or need to notice) them. Search engines can do things like keeping or removing trailing slashes, trying to convert urls with upper case to lower case, or removing session IDs from bulletin board or other software (many bulletin board software packages will work fine if you omit the session ID).
Q: Let’s talk about the inurl: operator. Why does everyone think that if inurl:mydomain.com shows results that aren’t from mydomain.com, it must be hijacked?
A: Many months ago, if you saw someresult.com/search2.php?url=mydomain.com, that would sometimes have content from mydomain. That could happen when the someresult.com url was a 302 redirect to mydomain.com and we decided to show a result from someresult.com. Since then, we’ve changed our heuristics to make showing the source url for 302 redirects much more rare. We are moving to a framework for handling redirects in which we will almost always show the destination url. Yahoo handles 302 redirects by usually showing the destination url, and we are in the middle of transitioning to a similar set of heuristics. Note that Yahoo reserves the right to have exceptions on redirect handling, and Google does too. Based on our analysis, we will show the source url for a 302 redirect less than half a percent of the time (basically, when we have strong reason to think the source url is correct).
Q: Okay, how about supplemental results. Do supplemental results cause a penalty in Google?
A: Nope.
Q: I have some pages in the supplemental results that are old now. What should I do?
A: I wouldn’t spend much effort on them. If the pages have moved, I would make sure that there’s a 301 redirect to the new location of pages. If the pages are truly gone, I’d make sure that you serve a 404 on those pages. After that, I wouldn’t put any more effort in. When Google eventually recrawls those pages, it will pick up the changes, but because it can take longer for us to crawl supplemental results, you might not see that update for a while.
That’s about all I can think of for now. I’ll try to talk about some examples of 302′s and inurl: soon, to help make some of this more concrete.
What Moz says about this ?
For SEOs, canonicalization refers to individual web pages that can be loaded from multiple URLs. This is a problem because when multiple pages have the same content but different URLs, links that are intended to go to the same page get split up among multiple URLs. This means that the popularity of the pages gets split up. Unfortunately for web developers, this happens far too often because the default settings for web servers create this problem. The following lists show the most common canonicalization errors that can be produced when using the default settings on the two most common web servers:

Apache Web Server:

  • http://www.example.com/
  • http://www.example.com/index.html
  • http:/example.com/
  • http://example.com/index.html

Microsoft Internet Information Services (IIS):

  • http://www.example.com/
  • http://www.example.com/default.asp (or .aspx depending on the version)
  • http://example.com/
  • http://example.com/default.asp (or .aspx)
  • or any combination with different capitalization
Each of these URLs spreads out the value of inbound links to the homepage. This means that if the homepage has multiple links to these various URLs, the major search engines only give them credit separately, not in a combined manner.
Luckily for SEOs, web developers developed methods for redirection so that URLs can be changed and combined. Two primary types of server redirects exist:
  • A 301 indicates an HTTP status code of "Moved Permanently"
  • A 302 indicates a redirect that is temporary
Though the difference appears to be merely semantics, the actual results are dramatic. Google does not pass link juice (ranking power) equally between normal links and server redirects. The engineers and SEOs at Moz have done a considerable amount of testing around this subject and concluded that 301 redirects pass between 90 percent and 99 percent of their value, whereas 302 redirects pass almost no value at all.
Canonicalization is not limited to the inclusion of alphanumeric characters. It also dictates forward slashes in URLs. If a web surfer goes to http://www.google.com they will automatically get redirected to http://www.google.com/ (notice the trailing forward slash). This is happening because technically the latter is the correct format for the URL. Although this is a problem that is largely solved by the search engines already (they know that www.google.com is intended to mean the same as www.google.com/), it is still worth noting because many servers will automatically 301 redirect from the version without the trailing slash to the correct version. By doing this, a link pointing to the wrong version of the URL loses between 1 percent and 10 percent of its worth due to the 301 redirect. The takeaway here is that whenever possible, it is better to internally linkto the version with the backslash.
One common canonicalization mistake is accidentally creating an infinite loop between http://www.example.com and http://www.example.com/index.html. The solution to this common glitch is discussed in this post about redirecting an index file to your domain without looping.
Another option for dealing with duplicate content is to utilize the rel=canonical tag. The rel=canonical tag passes the same amount of link juice (ranking power) as a 301 redirect, and often takes much less development time to implement.
The tag is part of the HTML head of a web page. This meta tag isn't new, but like nofollow, simply uses a new rel parameter. For example:
<link href="http://www.example.com/canonical-version-of-page/" rel="canonical" />
This tag tells Bing and Google that the given page should be treated as though it were a copy of the URL www.example.com/canonical-version-of-page/ and that all of the links and content metrics the engines apply should actually be credited toward the provided URL.
  • MozBar
    The MozBar SEO toolbar lets you see relevant metrics in your browser as you surf the web.
  • Open Site Explorer
    Open Site Explorer is a free tool that gives webmasters the ability to analyze up to 10,000 links to any site or page on the web via the Mozscape web index.

External Resources

  • The Beginner's Guide to SEO
    Moz’s comprehensive guide to the practice of search engine optimization for those unfamiliar with the subject.

1 comment: