At the recent SMX Advanced Conference in Seattle one of the big sessions was on duplicate content. There is great blow by blow coverage in posts by Vanessa Fox and by Matt McGhee. You can also see an older post about dupe content here by Chris Boggs.
At the start of this session, the search engines all talked about various types of duplicate content. But let’s take a deeper look at the way that duplicate content happens. Here are 12 ways people unintentionally create dupe content:
- Build a site for the sole purpose of promoting affiliate offers, and use the canned text supplied by the agency managing the affiliate program.
- Generate lots of pages with little unique text. Weak directory sites could be an example of this.
- Use a CMS that allows multiple URLs to refer to the same content. For example, do you have a dynamic site where http://www.yoursite.com/level1id/level2id pulls up the exact same content as http://www.yoursite.com/level2id? If so, you have duplicate content. This is made worse if your site actually refers to these pages using multiple methods. A surprising number of large sites do this.
- Use a CMS that resolves sub domains to your main domain. As with the prior point, a surprising number of large sites have this problem as well.
- Generate pages that differ only by simple word substitutions. The classic example of this is to generate pages for blue widgets for each state where the only difference between the pages is a simple word substitution (e.g. Alabama Blue Widgets, Arizona Blue Widgets, …).
- Forget to implement a canonical redirect. For example, not 301 redirecting http://yoursite.com to http://www.yoursite.com (or vice versa) for all the pages on your site. Regardless of which form you pick to be the preferred form of URL for your site, someone out there will link to the other form, so implementing the 301 redirect will eliminate that duplicate content problem for you, as well as consolidate all the page rank from your inbound links.
- Having your on site links back to your home page link to http://www.yoursite.com/index.html (or index.htm, or index.shtml, or …). Since most of the rest of the world will link to http://www.yoursite.com, you now have created duplicate content, and divided your page rank, if you have done this.
- Implement printer pages, but not using robots.txt to keep them from being crawled.
- Implement archive pages, but not using robots.txt to keep them from being crawled.
- Using Session ID parameters on your URLs. This means every time the crawler comes to your site it thinks it is seeing different pages.
- Implement parameters on your URLs for other tracking related purposes. One of the most popular is to implement an affiliate program. The search engine will see http://www.yoursite.com?affid=1234 as a duplicate of http://www.yoursite.com. This is made worse if you leave the “affid” on the URL throughout the user’s visit to your site. A better solution is to remove the ID when they arrive at the site, after storing the affiliate information in a cookie. Note that I have seen a case where an affiliate had a strong enough site that http://www.yoursite.com?affid=1234 started showing up in the search engines rather than http://www.yoursite.com (NOT good).
- Implement a site where parameters on URLs are ignored. If you, or someone else, links to your site with a parameter on the URL, it will look like dupe content.
There are many ways that people intentionally create duplicate content, by various scraping techniques, but there is no need to cover that here.
There are a number of gray area techniques, such as computer generated content. There was a very interesting presentation about this by Mikkel deMib Svendsen at SMX Advanced that talked about Markov Chains as a technique for generating content. One key for doing this well, is to do it well enough so that the content is not seen as duplicate. The second key, is to generate content that is meaningful for an end user.
When search engines look for duplicate content, they start by filtering out all the content on the page which is template based, such as the navigation on the sides, top, and bottom. They recognize this as being in common, and do not hold this against you. They base their evaluation on the content that is intended to be unique to that page.
Search engines will look at and compare each of the pages on your site to other pages on your site, as well as pages on other sites. One of the known techniques for doing this is the Sliding Window technique. Basically, what this does is that it looks at the unique content on your page a fixed number of characters at a time. For example, perhaps it may look at the first 50 characters in the unique content section of your page, starting with the 1st character.
It then compares that snippet with other snippets as a part of its duplicate content check. It then looks at 50 characters starting with the 2nd character in the unique content section of your page, then it starts with the 3rd character, the 4th character, and so forth. One way you can try to see how you are doing is to use a Page Similarity Checker.
In general, search engines do not penalize you for duplicate content. When they detect duplicate content, they simply try to choose only one of the duplicate pages to return in the search results, and they may not choose yours. They can do this by basing it on a page rank like basis, or by whichever copy of the content they detected first.
In extreme cases, I have actually seen algorithmic penalties applied. This is rare, and should only happen to you if your site is crawling with duplicate content, and has basically nothing else.
The last thing I want to note is that the main focus of webmaster should be on delivering pages of unique value. Uniqueness is important for many reasons, because it makes it far more likely that your site can obtain links. The primary value in knowing how to avoid unintentional duplicate content is to avoid the division of your page rank. Links to duplicate pages are wasted, and marketing your site is hard enough without shooting yourself in the foot.
Other suggestions for ways people unitentionally crest duplicate content? Let me know, and I will add it to the above list.