15 Things About How Google Handles Duplicate Content

Duplicate Content is one of the most perplexing problems in SEO. In this post I am going to outline 15 things about how Google handles duplicate content. This will include my leaning heavily on interviews with Vanessa Fox and Adam Lasnik. If I leave something out, just let me know, and I will add it to this post.

  1. Google’s standard response is to filter out duplicate pages, and only show one page with a given set of content in its search results.
  2. I have seen in the SERPs evidence that large media companies seem to be able to show copies of press releases and do not get filtered out.
  3. Google rarely penalizes sites for duplicate content. Their view is that it is usually inadvertent.
  4. There are cases where Google does penalize. This takes some egregious act, or the implementation of a site that is seen as having little end user value. I have seen instances of algorithmically applied penalties for sites with large amounts of duplicate content.
  5. An example of a site that adds little value is a thin affiliate site, which is a site that uses copies of third party content for the great majority of its content, and exists to get search traffic and promote affiliate programs. If this is your site, Google may well seek to penalize you.
  6. Google does a good job of handling foreign language versions of site. They will most likely not see a Spanish language version and an English language versions of sites as duplicates of one another.
  7. A tougher problem is US and UK variants of sites (“color” v.s. “colour”). The best way to handle this is with in-country hosting to make it easier for them to detect that.
  8. Google recommends that you use Noindex metatags or robots.txt to help identify duplicate pages you don’t want indexed. For example, you might use this with “Print” versions of pages you have on your site.
  9. Vanessa Fox indicated in her Duplicate Content Summit at SMX that Google will not punish a site for implementing NoFollow links to a large number of internal site links. However, the recommendation is still that you should use robots.txt or NoIndex metatags.
  10. When Google comes to your site, they have in mind a number of pages that they are going to crawl. One of the costs of duplicate content is that when the crawler loads a duplicate page, one that they are not going to index, they have loaded that page instead of a page that they might index. This is a big downside to duplicate content if your site is not (more) fully indexed as a result.
  11. I also believe that duplicate content pages cause internal bleeding of page rank. In other words, link juice passed to pages that are duplicates is wasted, and this is better passed on to other pages.
  12. Google finds it easy to detect certain types of duplicate content, such as print pages, archive pages in blogs, and thin affiliates. These are usually recognized as being inadvertent
  13. They are still working on RSS feeds and the best way to keep them from showing up as duplicate content. The acquisition of FeedBurner will likely speed the resolution of that issue.
  14. One key think they use as a signal as to what page to select from a group of duplicates, is that they look at and see what page is linked to the most.
  15. Lastly, if you are doing a search and you DO want to see duplicate content results, just do your search, get the results, and append the “&filter=0″ parameter to the end of your search results and refresh the page.

Here is a summary of Ways to Create Duplicate Content, and Adam Lasnik’s post on Deftly Dealing with Duplicate Content that explains how you handle this problem on your site.

Comments

  1. says

    6 – surely this is a no brainer, just don’t code anything
    7 – or whois info, or even just a language definition in the header / relevant tag
    8 – maybe use noindex the other way round – print pages are often great for seo (although terrible for users)
    11 – what do you mean by “internal bleeding of page rank”?
    14 – is this very different from simply looking at the page in question’s pagerank?

  2. says

    Hi Eric,

    Very useful post thank you – that’s gone in my reference folder. But it didn’t cover this one. Can help with it?

    My site at http://www.mind-mapping.org is arranged so that a spider arriving at the front page and then following links will only see one copy of all the material (I hope I’ve got it right, anyway).

    There are some Javascript controls that visitors to the site can use to filter the material shown, or search for specific items by name, but as the robots won’t activate the Javascript, that’s not a problem directly.

    But users sometimes copy the URL generated by the Javascript if they want a subset of the items, and then link to it on a blog. Here’s what it will produce if I use the Javascript control to select mind mapping software, all OSs and currently available software only:
    http://www.mind-mapping.org/?selectedCategories%5B%5D=mind+maps&selectedOSes%5B%5D=all+operating+systems&pastOrPresent%5B%5D=current&datePicker2=&datePicker1=&datePicker1_month=7&datePicker1_year=2007&filterData=Show+selected+items

    All very nice (well not very elegant, but it works), but if someone copies that URL to their site, when a spider crawls their site and follows the link I guess it’s going to see duplicate content: The URL is different, but some of the content will be the same as the spider found on a routine crawl. Of course I can’t get them to add “nofollow”, and anyway, by making a link they are saying my site has useful stuff, and I want that vote.

    If I make all the pages generated by Javascript include a noindex metatag at the beginning, will I get the link recognition but avoid the duplicate content penalty?

    Much thanks for any advice you can give.

    Vic Gee

  3. says

    One key think they use as a signal as to what page to select from a group of duplicates, is that they look at and see what page is linked to the most.

    One of my readers asked about this and I said the select the page with the higher PageRank (the most inbound links). I knew I had read it somewhere, but I could not find where. Can you point me directly to the source?

    This is a really useful article

  4. Pandy says

    Great info thanks. It appears to cover only duplicate content within the same site though.

    I have a series of sites for the same company which offers insurance to different professions (e.g. architects, engineers, surveyors).

    Each profession has it’s own dedicated site but much of the content across all the sites is duplicated (e.g. company info, contact us etc.). These sites do not do particularly well on Google, but do very well on MSN, Yahoo etc.

    Do you think the dupe content could be the cause of this?

    Pandy

  5. says

    SEO Ranter – regarding item 11, my point is that each page votes it’s page rank by linking to other pages. So when you vote for a page that will never rank (because it’s a duplicate) that vote is wasted. You’d be better off on spending that vote on a page that might rank.

    Note: On a slightly different topic, and not one that you raised, this does not mean that we advise our clients to not link to 3rd party sites. In fact, we think they should, for a wide variety of reasons.

    Regarding the question about item 14 – I was just using # of links as a proxy phrase for Page Rank. I believe it probably is page rank based, not just a link count.

    Vic Gee – You are right that the scenario you outlined could create a duplicate content problem. However, in your case, I am not sure that you need to worry about it. After all, you are not wasting any of your link juice on the Javascript generated pages, and who cares if it’s filtered out? You are still getting a link to the page, and the juice from that link will still be passed to the rest of the site through the links on that page.

    The only downside I see is that it may happen at times that your Javascript generated page will show up in the search results instead of the native page.

    Please keep in mind that this is just my opinion based on your comment, without having done a thorough analysis of all the facts, so please take the input with an appropriate amount of grains of salt.

    Hi Hamlet – I believe Vanessa Fox said something along those lines in the interview I did with her (see the article which is linked to in the main post above).

    Hi Pandy – Google works just as hard to find duplicate pages across sites as they do to find duplicate pages within a site. Bearing in mind that I don’t know the details of your situation, I think there is a real chance that you have a duplicate content problem.

    But, it’s hard for me to venture a guess as to whether or not it’s so extensive that you would actually have a penalty being applied. It’s one thing for Google to not show your company info pages because they are duplicates, and it’s entirely something else to penalize the rest of your pages because of it.

    I would think that this would only happen if after Google eliminated the dupe pages it felt that there was not much content left on the site.

  6. says

    I have a question: I’ve been writing articles for article-distribution sites as a way to get links via useful content. I will publish the article on two or three submission sites and then it will get reposted on three or four thematically related sites (apparently they get much of their content from such free articles) – and my link will show up in all of those places. Are those links all valuable to my page, or are only a few of them valuable [speaking strictly about SEO, they are pretty useful traffic-wise on their own]. I haven’t published any of the articles on my own website, if that makes a difference.

  7. says

    Breanna – It is my belief (emphasis on the word belief) that those links are valuable to your page. Duplicate content filters the articles out of the results, but the link juice still has value.

  8. says

    Hi Eric,
    I am so happy to read this article but I am not sure about – you said “Google rarely penalizes sites for duplicate content. Their view is that it is usually inadvertent.”
    I saw first google send this type of (duplicate content page) page in supplemental result and then de-index from it’s datacenter so the page will not come in SERPs.

    DD

  9. says

    I really can’t figure out how duplicate content work. I have a 2 or 3 year old site at http://www.beat-debt.com. I have just noticed that my ranking for “Tiscali Business Opportunity” has slipped from #5 in UK results to #155.

    At the same time, I noticed a website with exactly the same title tag, so I looked a bit deeper. Not only is the title tag the same, but the content is identical, the design is identical and the cheeky *@#* is even using the images on my server.

    Not only that, but he still has my analytics code in his site, so that’s bu99ered up my analytics since he has had his website up.

    I’ve asked him nicely to remove the site immediately, but it pains me that this cheeky bu99er comes along, my original site is penalised and his site that was lifted from mine sits proudly on page 1. How does Google explain that one?

  10. says

    Dave,
    I had someone do the exact same thing to me. Fortunately, one of the things he left in the page was a javascript. I set up a new javascript for my site and then changed the one he was calling to redirect to my page. It’s been that way for a couple of years and he hasn’t noticed that I’m stealing the traffic that he gets from the content he stole from me. :)

  11. says

    #13, I have always wondered how showing RSS could help. True, it is fresh content, but isn’t it duplicate content from the original source? How does that help the site displaying the RSS in the SERP’s?

  12. says

    Hi New Orleans – Normally it doesn’t help in any direct sense. As long as the link path to the original content on your site is crawlable and easily found by search engine robots, that page should be the one that you would prefer to have ranking in the search engine.

    The RSS feed is simply duplicate content. Generally speaking, the search engines including it in the SERPs is a mistake, and they are trying to fix it. Of course, in the world of search there are few hard and fast rules, so I am sure I could think of scenarios in which having your RSS feed indexed would be helpful (such as the content is not readily found by the crawlers on the site), but these would not apply to most webmasters.

  13. says

    I’ve been wondering how Google deals with the tags that are generated with wordpress. They seem to create pages and pages of duplicate

  14. says

    My husband’s law firm has a 15-page website built through one of the law publishing/directory companies that I wrote all content and tags for. It was doing great in the search engines for the last year & a half (page 1 Google for each of his practice areas and geographies), but we found the monthly nut they charge to be, quite frankly, ridiculous.

    So now I’m near completion of what’s basically a duplicate site (some new practice area pages and some content updates but really the same overall site content , including same tags and keywords that worked so well for us when I wrote the first site) at a fraction of the cost. Our plan was to try to get the new site up in the rankings and then cancel the contract on the old site and alias the old domain to the new site.

    However in the last week, while I’ve been adding nearly duplicate pages to the new site, we’re seeing a rather sudden drop in the SE rankings on the old site which led to me doing a bit more research and finding your article. We can try to get the first site shut down fast, or (since it’s out of our control, temporarily take down the new site), but will this “remove” our penalty or are we tagged for life?

  15. says

    Hi Dianne,

    First I need to caution that the situation faced by each individual site owner is a very unique thing. As a result, I don’t really have enough information to provide specific advice.

    The comments below represent my experience of what we have seen on sites we have worked on, but you should get someone to give you specific advice about how to handle your situation.

    That said, search engines want to represent only one version of a piece of content to users. When they see two web pages with identical or nearly identical articles on them, the most common response is for them to ignore one of the web pages, and only include one of them in the index.

    If the search engine chooses this course of action, there is no way to know if the search engine will choose to index content from the new site or the old one.

    Removing one version of the content does of course simplify the picture dramatically, but you then need to worry about how long it will be before the new site begins ranking in Google. This can be many months.

    You can probably shorten this time period if you are able to implement 301 redirects from your old site to the new one.

    Hope this helps.

    Eric

  16. says

    QUOTE [At the same time, I noticed a website with exactly the same title tag, so I looked a bit deeper. Not only is the title tag the same, but the content is identical, the design is identical and the cheeky *@#* is even using the images on my server.]

    A little update. I had some fun with this guy and started changing the images on my server. His Business oppotrunity website proudly displayed a graphic with the wording “I am a thieving b@st*rd” and some mildly amusing, but rather rude images.

  17. says

    The duplicate content issue has long been talked about – but the solution is simple: just write quality articles, and do it with the user in mind. Just imagine as you are writing “Will people benefit from this?” and if your content is indeed beneficial to the user, you’re good. Getting around duplicate content is usually more trouble than just WRITING original content, anyway!!!

  18. says

    @compucast Duplicate content problems can be easily avoided by writing unique content, and keeping nofollow tags in the right place in your site navigation to ensure potentially duplicate pages aren’t indexed.

    @compucast Duplicate content problems can be easily avoided by writing unique content, and keeping nofollow tags in the right place in your site navigation to ensure potentially duplicate pages aren’t indexed.
    :)

  19. says

    Thanks for this list. A lot of people have the wrong impression of duplicate content, and live in perpetual fear of Google punishing them just because there’s another version of their article somewhere else on the Internet.

    Obviously this is nonsense, and your article does a great job of explaining WHY. In fact, much of the Internet is based on duplicate content!

  20. says

    I had a problem like this once because of an accidental posting of duplicate material. The duplicate page was crawled and indexed instead of the “real” one. I had trouble ranking that page since then. Great post, greater blog, be looking forward for more

  21. says

    This is a great post! I’m glad that google is tough on duplicate content otherwise you could just go and copy a successful website and take advantage of them. People who work hard, write their own content, should be rewarded.

  22. says

    This site is excellent and so is how the subject matter was explained. I like some of the comments too although I would suggest everyone stays on the subject matter so that to add value to the message.

  23. says

    This article is really the most informative on this deserving topic. I agree with your conclusions and am eagerly look forward to your future updates. Just saying thanks will not just be enough, for the extraordinary clarity in your views and writing.

  24. says

    this article on duplicate content is now more relevant than ever before. Since the latest Google panda update, sites that had duplicate content once taken from other websites have now seen the significant increase in search engine rankings. This is a fact.

    Search engine spiders are smarter than you think, remember to keep all your content unique and fresh.

  25. says

    With the latest Google algorithm update, people should stop over using the same content again. Google has now eyes on all the content spamming stuff! They have already worked out a method on blog networks. It will not a be a long time when they start working on useless article directories.

Leave a Reply

Your email address will not be published. Required fields are marked *

*