Googlebot Detection and Combatting Copyright Violations

We live in a world where it’s increasingly common that other sites will copy your good content and re-publish it. This causes concerns that you will be flagged for publishing duplicate content, or that the search engines will not correctly recognize your site as the original author of the content. So what can you do about this problem?

Matt Cutts just posted on the Google Webmaster Blog one part of the answer. Google has now specified an official way to recognize the Google Bot. There are a few ways you can use this information. For example, you can choose to allow only the search engine bots to crawl your site (you would want to identify all the ones you care about) and deny access to all other bots.

There is some risk to this strategy, as Google does periodically implement other bots to check for cloaking. You would end up blocking those bots with this strategy. I am going to see what I can find out from Google about this problem, and what they recommend webmasters do about it.

In any event, if you do see someone crawling your site that is not a web crawler, you can block them pretty simply. If you are running Apache on your servers, you can place a command such as “deny from 00.00.158.37″ in your .htaccess file, where the numbers represent the IP address of the bot crawling your site.

You would only know the IP address is you are regularly checking your log files. But this is something you should do. Protecting your valuable intellectual property is important.

In addition, you should regularly check for the presence of copies of your site, or parts of your site. You can do this by searching on long unique strings from the pages of your site. When you find someone who is copying your content, there are a few steps you should take:

  1. Send them a cease and desist letter, warning them that you will sue.
  2. Send their hosting company a cease and desist letter, telling them that you will hold them liable for the actions of their customer. Include clear proof that you are the copyright holder of the content. This is often the most effective. This often results in the hosting account of the offending party being shut down. The hosting company wants nothing to do with it.
  3. If the offending party is involved in some major affiliate partnership, send a similar letter to their partner.

If these steps all fail, then the next step is to file a DMCA complaint with the search engines. The search engines do act on each of these requests. Google provides an outline of the process here. Among other things you need to provide clear proof that you own the copyright. You will also need to identify each search result that brings up the offending site.

Do not take this step lightly, as it’s a lot of work, and be VERY SURE that you are in the right. You don’t want to start this process trivially, as you will make the search engines very upset if you file an invalid request. But if you are in the right, and the cost of the copyright violation is significant, than this approach is worthwhile.

Speak Your Mind

*

*