The Cost of Duplicate Content

Let’s talk about the cost of duplicate content. At first blush, it seems like a relatively minor issue. In principle, a search engine wants to include only one copy of a page in its index. So if you have multiple pages with the same content, the search engine picks only one. This means one copy of your content is ignored.

So far it does not sound too bad, does it? However, there are other less obvious consequences to duplicate content. For example, it can’t be good that crawlers some to your site and crawl pages that they will never index. In fact, it’s our understanding that crawlers come to your site with the goal of crawling a certain number of pages. So if they crawl pages that will not be indexed, they are not crawling pages that will. This could result in fewer pages of your site getting indexed.

In addition, there are tons of ways to end up with unintentional duplicate content. Here are just a few:

  1. Running an affiliate program
  2. Syndicating content
  3. Failure to 301 redirect from the non-www version of your site to the www version of your site (or vice versa)
  4. Code implementations that cause sub-domain pages to automatically mirror to your site
  5. Code implementations that lead to different URL paths to render the same content
  6. Pages with “different content”, but that are not different enough – this can happen with database driven sites

I am sure that there are many more ways to create duplicate content. Each of these scenarios has its own issues and problems. But one problem with nearly all of them is link dilution. Your site has a certain amount of page rank to spread around. Links that go to pages that will never be indexed are wasted. This means that less page rank is poured into those pages that are indexed. This will likely result in lower rankings for those pages.

So the bottom line is potentially fewer pages indexed and lower rankings for the pages that are indexed. This sounds like an extremely high cost to me. You can read more about problems with, and solutions for duplicate content here.

Comments

  1. greyhound says:

    Great information. I have a few follow-up comments and questions:

    1. I wasn’t quite sure I understood item 4 “Code implementations that cause sub-domain pages to automatically mirror to your site”. Could you elaborate?

    2. I’m also interested in more details on dealing with problem #5, ” Code implementations that lead to different URL paths to render the same content”. In particular, if I’m understanding correctly, I think this very technique is often recommended to solve the problem that SEs have crawling/indexing dynamic links. The advice I’ve seen is to provide static links that are more “easily digestible” by spiders in addition to the dynamic links.

    For example, I’ve seen recommendations to try to do things like map the dynamic URL “www.mydomain.com?cat=10&?item=7″ to something like “www.mydomain.com/cat10/item7″. The latter shoud be more reliably crawled/indexed, however you may also need to keep the dynamic version available to support your dynamic shopping interface, thus stumbling into the duplicate content problem. How do you solve the one problem without creating the other?

    Thanks for the info.

  2. stonecold says:

    Good questions. With regard to your first question (about subdomains), what I mean is that garbage.yourdomain.com is automatically redirected to http://www.yourdomain.com. This scenario is seen as duplicate content.

    Regarding your second question, we are talking about 2 different things. I am referring to a situation where http://www.yourdomain.com/content12 goes to the same exact content as http://www.yourdomain.com/content14, AND, both URLs can be found as live links (visible to users and search engines) on the web site.

    The scenario you raise for mapping dynamic URLs to static ones is different. Only the static URLs are visible to users and search engines on the site. Then another mechanism, such as mod rewrite on an Apache server remaps the static URL found on the web pages to the dynamic one expected by the web application.

    However, this dynamic URL is not user or search engine visible (it happens behind the scenes.

    Does that help?

  3. greyhound says:

    Yes that helps.

    On the second question, this is somewhat hypothetical to me since I haven’t implemented a large-scale site with dynamic URLs, but let me try to clarify.

    I assume there may be situations where you want BOTH the static AND the dynamic URL. The dynamic URL is there because of the way your site is built for your users (for example, perhaps the users have the ability to search a product database). I assume that this architecture needs to remain in place, even if you want to implement static links for SEO purposes.

    On top of this architecture, you desire to make all of those pages more visible/crawlable for search engines, by creating matching static links. So you create static versions of the pages and corresponding static links for this purpose.

    Do I have it right so far — I mean, is that the right technique for exposing dynamic content via static URLs for SEO purposes?

    If so, this is where I see the problem, because it seems that you’ve now got two different URLs – the static one and the dynamic one – both pointing at the same content. This was where I got confused, and was wondering if, in the course of solving the dynamic page indexing problem, you don’t just trade it for a different (duplicate content) problem.

    Thanks for taking the time to elaborate.

  4. stonecold says:

    Actually no. In general, you would not have the dynamic URL at all visible. For the most part we are talking about showing the world a URL such as “www.yourdomain.com/product_category/product_type/product_id”, instead of a URL such as “www.yourdomain.com?pc=12&pt=45&pi=7″. The parameter based URL does not show up on the site, and there is no really any reason to show it to users (the other URL is more user friendly).

    So we implement the HTML on such sites to show the more friendly URL. Then we usea mod rewrite rule (on an Apache server) to remap the more friendly URL back into the parameter based one for the application.

    The key assumption in error in your question above is that they dynamic URLs are left visible to users and search engines. They are not. Yes, they are still there, because it’s what the web application expects, but the users cannot see it.

  5. greyhound says:

    Aha… the lightbulb finally goes on.

    Thanks again.

Speak Your Mind

*

*