Drupal and Search Engine Optimization

Drupal is known for being a very SEO friendly content management system (CMS). The way it assembles its pages is crawler friendly. This makes it a popular choice for people looking to build dynamic web sites. However, there are a number of potential SEO problems with Drupal as well. These need to be dealt with to ensure that you get optimal results.

The very fact that Drupal is such a dynamic system is a factor that leads to some of its SEO problems. The content is stored in a database and retrieved at runtime. Almost all information is stored as a “node”, a basic, unstructured unit of content. Often, each “node” is associated with groups of keywords, known as “taxonomies”, and Drupal makes it easy to retrieve and sort information by these taxonomies. Since all content can be retrieved dynamically, Drupal generates generic URLs for the content, such as www.example.com/?q=node/3 or www.example.com/node/3.

These “internal” URLs are always present in Drupal, even though Drupal provides features that allow you to hide them, and instead present much friendlier URLs, known as aliases, to web site users. There are multiple optional modules that may affect the generation of pages and the naming of URLs, and there are many modules that remain aware of the internal naming conventions, even when user-friendly URLs are being used. As a result Drupal may expose both the internal URLs and the user-friendly URLs to users and web crawlers.

As a result of these kinds of architectural issues, many Drupal sites end up exposing content to the web via multiple URLs. When this happens, the multiple URLs can be crawled by the search engines, creating duplicate content problems. Here are some examples of duplicate content issues, and some other problems that can arise in drupal.

1.Problem: duplicate content from aliases

Example: www.example.com/node/5 and www.example.com/content/how-to-surf, both pointing at the same physical document.

Solution: use robots.txt to disallow URLs that include “/node/” For example, you can include the following lines in robots.txt:Disallow: /node

Disallow: /*/node/Considerations: Note that this assumes that all URLs are available via friendly aliases. This should be the case if you’re using the pathauto module.>[?

2. Problem: Drupal’s default robots.txt has errors.

Example: the default robots.txt uses “Disallow: /search”. This disallows only a page ending with /search, but not all of the Drupal internal search results pages, which is desired.

Solution: update the robots.txt to read:Disallow: /search/

3. Problem: Pathauto can create many extra pages on the site if configured incorrectly.

Example: If you turn on “Create index aliases”, and you have a hiearchical alias (e.g., a page with a path containing a slash, such as music/concert/beethoven) Drupal automatically generates index pages that contain all pages in each category — for example all music, and all concerts.

Solution : Do not check the “Create index alias” check box in the Pathauto module.

4. Problem: Incorrect setting of the Pathauto “Update action”, in a production environment, can cause URLs of published pages, which may already be indexed by the search engines, to change.

Solution: In development mode (before exposing the site to the search engines), use “Create a new alias, replacing the old one” to regenerate URLs whenever necessary (for example, if your Pathauto rules change). In production, once the site is exposed, set this to “Do nothing, leaving the old alias intact”.

5. Problem: Some modules, such as Forums and Views, create sortable lists that can generate multiple URLs with duplicate content.

Solution: If you use such a module, be sure to exclude the sorted variations using the following robots.txt rule:Disallow: /*sort=

6. Problem: The Forward module creates a link to a URL, on each page, that allows the page to be forwarded to a friend. You can easily end up with hundreds or thousands of such low quality pages that are essentially boilerplates.

Solution: If you use this module, be sure to exclude the forward pages using the following robots.txt rule:

Disallow: /forward/

These problems can crop up on many Drupal systems, and all Drupal users should review their sites for these issues. Drupal may also have other issues, depending on the site and the degree of customization. For example, on several sites, we’ve seen Drupal generate complex CSS hierarchies that end up building hidden text into the pages. While search engines try to detect hidden text scenarios that are not a result of bad intent, this is a risk you don’t need. As long as you recognize what the issues are, they can be dealt with, and Drupal can be a great choice as a content management system. Most content management systems present even greater challenges to SEO.

15 Things About How Google Handles Duplicate Content

Duplicate Content is one of the most perplexing problems in SEO. In this post I am going to outline 15 things about how Google handles duplicate content. This will include my leaning heavily on interviews with Vanessa Fox and Adam Lasnik. If I leave something out, just let me know, and I will add it to this post.

  1. Google’s standard response is to filter out duplicate pages, and only show one page with a given set of content in its search results.
  2. I have seen in the SERPs evidence that large media companies seem to be able to show copies of press releases and do not get filtered out.
  3. Google rarely penalizes sites for duplicate content. Their view is that it is usually inadvertent.
  4. There are cases where Google does penalize. This takes some egregious act, or the implementation of a site that is seen as having little end user value. I have seen instances of algorithmically applied penalties for sites with large amounts of duplicate content.
  5. An example of a site that adds little value is a thin affiliate site, which is a site that uses copies of third party content for the great majority of its content, and exists to get search traffic and promote affiliate programs. If this is your site, Google may well seek to penalize you.
  6. Google does a good job of handling foreign language versions of site. They will most likely not see a Spanish language version and an English language versions of sites as duplicates of one another.
  7. A tougher problem is US and UK variants of sites (“color” v.s. “colour”). The best way to handle this is with in-country hosting to make it easier for them to detect that.
  8. Google recommends that you use Noindex metatags or robots.txt to help identify duplicate pages you don’t want indexed. For example, you might use this with “Print” versions of pages you have on your site.
  9. Vanessa Fox indicated in her Duplicate Content Summit at SMX that Google will not punish a site for implementing NoFollow links to a large number of internal site links. However, the recommendation is still that you should use robots.txt or NoIndex metatags.
  10. When Google comes to your site, they have in mind a number of pages that they are going to crawl. One of the costs of duplicate content is that when the crawler loads a duplicate page, one that they are not going to index, they have loaded that page instead of a page that they might index. This is a big downside to duplicate content if your site is not (more) fully indexed as a result.
  11. I also believe that duplicate content pages cause internal bleeding of page rank. In other words, link juice passed to pages that are duplicates is wasted, and this is better passed on to other pages.
  12. Google finds it easy to detect certain types of duplicate content, such as print pages, archive pages in blogs, and thin affiliates. These are usually recognized as being inadvertent
  13. They are still working on RSS feeds and the best way to keep them from showing up as duplicate content. The acquisition of FeedBurner will likely speed the resolution of that issue.
  14. One key think they use as a signal as to what page to select from a group of duplicates, is that they look at and see what page is linked to the most.
  15. Lastly, if you are doing a search and you DO want to see duplicate content results, just do your search, get the results, and append the “&filter=0″ parameter to the end of your search results and refresh the page.

Here is a summary of Ways to Create Duplicate Content, and Adam Lasnik’s post on Deftly Dealing with Duplicate Content that explains how you handle this problem on your site.

Vanessa Fox’s Last Google Interview?

Shortly before she left Google, I spoke to Vanessa Fox about what’s going on with Google Webmaster Tools, and, we also spoke for a while about duplicate content problems. While I was working on polishing up the interview transcript Vanessa left Google. So I may have had the last real interview she did while at Google. After another week of vacation, she will officially make the leap over to work at Zillow. Best of luck Vanessa!

We talk at length about key parts of Webmaster Tools, and we also talk about the newest features, and planned features for the product. Vanessa talks quite a bit about they types of features that users have been requesting, and while nothing was committed, it should provide some insight into the types of things you can expect to see in the product in the future.

We also talk quite a bit about duplicate content, what types are easy for Google to detect, and some more advanced duplicate content situations, such as those were actual penalties are put in place.

12 Ways Webmasters Create Duplicate Content

At the recent SMX Advanced Conference in Seattle one of the big sessions was on duplicate content. There is great blow by blow coverage in posts by Vanessa Fox and by Matt McGhee. You can also see an older post about dupe content here by Chris Boggs.

At the start of this session, the search engines all talked about various types of duplicate content. But let’s take a deeper look at the way that duplicate content happens. Here are 12 ways people unintentionally create dupe content:

  1. Build a site for the sole purpose of promoting affiliate offers, and use the canned text supplied by the agency managing the affiliate program.
  2. Generate lots of pages with little unique text. Weak directory sites could be an example of this.
  3. Use a CMS that allows multiple URLs to refer to the same content. For example, do you have a dynamic site where http://www.yoursite.com/level1id/level2id pulls up the exact same content as http://www.yoursite.com/level2id? If so, you have duplicate content. This is made worse if your site actually refers to these pages using multiple methods. A surprising number of large sites do this.
  4. Use a CMS that resolves sub domains to your main domain. As with the prior point, a surprising number of large sites have this problem as well.
  5. Generate pages that differ only by simple word substitutions. The classic example of this is to generate pages for blue widgets for each state where the only difference between the pages is a simple word substitution (e.g. Alabama Blue Widgets, Arizona Blue Widgets, …).
  6. Forget to implement a canonical redirect. For example, not 301 redirecting http://yoursite.com to http://www.yoursite.com (or vice versa) for all the pages on your site. Regardless of which form you pick to be the preferred form of URL for your site, someone out there will link to the other form, so implementing the 301 redirect will eliminate that duplicate content problem for you, as well as consolidate all the page rank from your inbound links.
  7. Having your on site links back to your home page link to http://www.yoursite.com/index.html (or index.htm, or index.shtml, or …). Since most of the rest of the world will link to http://www.yoursite.com, you now have created duplicate content, and divided your page rank, if you have done this.
  8. Implement printer pages, but not using robots.txt to keep them from being crawled.
  9. Implement archive pages, but not using robots.txt to keep them from being crawled.
  10. Using Session ID parameters on your URLs. This means every time the crawler comes to your site it thinks it is seeing different pages.
  11. Implement parameters on your URLs for other tracking related purposes. One of the most popular is to implement an affiliate program. The search engine will see http://www.yoursite.com?affid=1234 as a duplicate of http://www.yoursite.com. This is made worse if you leave the “affid” on the URL throughout the user’s visit to your site. A better solution is to remove the ID when they arrive at the site, after storing the affiliate information in a cookie. Note that I have seen a case where an affiliate had a strong enough site that http://www.yoursite.com?affid=1234 started showing up in the search engines rather than http://www.yoursite.com (NOT good).
  12. Implement a site where parameters on URLs are ignored. If you, or someone else, links to your site with a parameter on the URL, it will look like dupe content.

There are many ways that people intentionally create duplicate content, by various scraping techniques, but there is no need to cover that here.

There are a number of gray area techniques, such as computer generated content. There was a very interesting presentation about this by Mikkel deMib Svendsen at SMX Advanced that talked about Markov Chains as a technique for generating content. One key for doing this well, is to do it well enough so that the content is not seen as duplicate. The second key, is to generate content that is meaningful for an end user.

When search engines look for duplicate content, they start by filtering out all the content on the page which is template based, such as the navigation on the sides, top, and bottom. They recognize this as being in common, and do not hold this against you. They base their evaluation on the content that is intended to be unique to that page.

Search engines will look at and compare each of the pages on your site to other pages on your site, as well as pages on other sites. One of the known techniques for doing this is the Sliding Window technique. Basically, what this does is that it looks at the unique content on your page a fixed number of characters at a time. For example, perhaps it may look at the first 50 characters in the unique content section of your page, starting with the 1st character.

It then compares that snippet with other snippets as a part of its duplicate content check. It then looks at 50 characters starting with the 2nd character in the unique content section of your page, then it starts with the 3rd character, the 4th character, and so forth. One way you can try to see how you are doing is to use a Page Similarity Checker.

In general, search engines do not penalize you for duplicate content. When they detect duplicate content, they simply try to choose only one of the duplicate pages to return in the search results, and they may not choose yours. They can do this by basing it on a page rank like basis, or by whichever copy of the content they detected first.

In extreme cases, I have actually seen algorithmic penalties applied. This is rare, and should only happen to you if your site is crawling with duplicate content, and has basically nothing else.

The last thing I want to note is that the main focus of webmaster should be on delivering pages of unique value. Uniqueness is important for many reasons, because it makes it far more likely that your site can obtain links. The primary value in knowing how to avoid unintentional duplicate content is to avoid the division of your page rank. Links to duplicate pages are wasted, and marketing your site is hard enough without shooting yourself in the foot.

Other suggestions for ways people unitentionally crest duplicate content? Let me know, and I will add it to the above list.

Interview with Google’s Adam Lasnik

Adam Lasnik and I spoke about paid links, duplicate content and more late last week. The paid links conversation was very interesting. One of the things that Adam made clear is that Google is not looking to detect 100% of paid links. Their focus is much more on the links that are being sold for the purpose of passing PageRank.

We also talked about how the authenticated spam report form is going to be used. It turns out that this will not be used to decide on immediate penalties for sites that get reported. The information is simply going to be used for input into the search quality team at Google. Great news for webmasters who were worried about getting reported, as no immediate action will be taken.

Not so good news for those who want to level the playing field with competitors that are getting away with murder. Resolving those issues will still require patience.

There is a ton of good discussion in here about duplicate content, and other aspects of SEO too. Check this great interview with Adam out.

Adam Lasnik Clarifies Google Stance on Duplicate Content

Google’s Adam Lasnik has offered up a post today on Dealing Deftly with Duplicate Content. In it, he offers up the official Google stance on the issue.

In this post, he addresses the following key issues:

  1. What is duplicate content?
  2. What isn’t duplicate content?
  3. Why does Google care about duplicate content?
  4. What does Google do about it?
  5. How can Webmasters proactively address duplicate content issues?

However, there are issues that are not addressed in Adam’s post (fyi – these are issues which it’s not really Adam’s job to point out …). Here are a couple of key ones:

  1. When you pages on your site that are duplicates (or near duplicates) of one another, the crawlers spend time crawling them, instead of other pages on your site. This can result in fewer indexed pages on your site.
  2. Since Google chooses which one of your pages to list, this means that some pages are not listed. However, the page rank that was passed to those pages, still gets passed to them. This means it’s wasted on those pages that do not get indexed, instead of being allocated to pages that are indexed.

There are many other issues with duplicate content, but these are among the biggest. It can be hard to resolve duplicate content problems, particularly if they result from poor site architecture, or the implementation of your content management system. But if your web traffic is a key part of your business, it’s well worth the effort.

Googlebot Detection and Combatting Copyright Violations

We live in a world where it’s increasingly common that other sites will copy your good content and re-publish it. This causes concerns that you will be flagged for publishing duplicate content, or that the search engines will not correctly recognize your site as the original author of the content. So what can you do about this problem?

Matt Cutts just posted on the Google Webmaster Blog one part of the answer. Google has now specified an official way to recognize the Google Bot. There are a few ways you can use this information. For example, you can choose to allow only the search engine bots to crawl your site (you would want to identify all the ones you care about) and deny access to all other bots.

There is some risk to this strategy, as Google does periodically implement other bots to check for cloaking. You would end up blocking those bots with this strategy. I am going to see what I can find out from Google about this problem, and what they recommend webmasters do about it.

In any event, if you do see someone crawling your site that is not a web crawler, you can block them pretty simply. If you are running Apache on your servers, you can place a command such as “deny from 00.00.158.37″ in your .htaccess file, where the numbers represent the IP address of the bot crawling your site.

You would only know the IP address is you are regularly checking your log files. But this is something you should do. Protecting your valuable intellectual property is important.

In addition, you should regularly check for the presence of copies of your site, or parts of your site. You can do this by searching on long unique strings from the pages of your site. When you find someone who is copying your content, there are a few steps you should take:

  1. Send them a cease and desist letter, warning them that you will sue.
  2. Send their hosting company a cease and desist letter, telling them that you will hold them liable for the actions of their customer. Include clear proof that you are the copyright holder of the content. This is often the most effective. This often results in the hosting account of the offending party being shut down. The hosting company wants nothing to do with it.
  3. If the offending party is involved in some major affiliate partnership, send a similar letter to their partner.

If these steps all fail, then the next step is to file a DMCA complaint with the search engines. The search engines do act on each of these requests. Google provides an outline of the process here. Among other things you need to provide clear proof that you own the copyright. You will also need to identify each search result that brings up the offending site.

Do not take this step lightly, as it’s a lot of work, and be VERY SURE that you are in the right. You don’t want to start this process trivially, as you will make the search engines very upset if you file an invalid request. But if you are in the right, and the cost of the copyright violation is significant, than this approach is worthwhile.

Keyword Duplication

Mapping site design to target keywords is a fine art. The goal is to create a site architecture that is mapped to providing keyword rich content that is structured to provide quality content for users, and rank well in search engines. One of the key mistakes people make is that they create multiple pages that end up competing for the same keywords.

For most sites, this is a bad mistake. By competing with each other, there is a strong chance that the search engine will see the pages as duplicate content. In addition, you are dividing your link power with two pages competing for one set of keywords. Best case, you will end up with a double listing (one of those listings where two of your pages are shown together in the search results). Worst case, the search engines will see the pages as being too similar (duplicate), and list only one of your pages.

The problem is that the best case is really unlikely to occur. You might be able to make it occur with a lot of forethought and planning. Some sites do this well, such as Amazon and CD Universe (for example: try searching on Nirvana CDs in Google). They clearly have built an entire site architecture that provides related information on the same topic in a well structured manner.

These companies have access to content generating machines. With an enormous wealth of content, you can architect it in a fashion similar to that done by these sites.

But for most webmasters, a substantial amount of effort is required to get their content together. There will be a finite set of keywords that you want to write content for. Meeting that initial demand for content, and expanding the number of keywords you are competing for later on, are much higher priorities than trying to double up on individual keywords.

Broader search term coverage will usually bring much better results for your site. And, of course, if you are awash in content, you can then indulge in the luxury of doubling up.

The Cost of Duplicate Content

Let’s talk about the cost of duplicate content. At first blush, it seems like a relatively minor issue. In principle, a search engine wants to include only one copy of a page in its index. So if you have multiple pages with the same content, the search engine picks only one. This means one copy of your content is ignored.

So far it does not sound too bad, does it? However, there are other less obvious consequences to duplicate content. For example, it can’t be good that crawlers some to your site and crawl pages that they will never index. In fact, it’s our understanding that crawlers come to your site with the goal of crawling a certain number of pages. So if they crawl pages that will not be indexed, they are not crawling pages that will. This could result in fewer pages of your site getting indexed.

In addition, there are tons of ways to end up with unintentional duplicate content. Here are just a few:

  1. Running an affiliate program
  2. Syndicating content
  3. Failure to 301 redirect from the non-www version of your site to the www version of your site (or vice versa)
  4. Code implementations that cause sub-domain pages to automatically mirror to your site
  5. Code implementations that lead to different URL paths to render the same content
  6. Pages with “different content”, but that are not different enough – this can happen with database driven sites

I am sure that there are many more ways to create duplicate content. Each of these scenarios has its own issues and problems. But one problem with nearly all of them is link dilution. Your site has a certain amount of page rank to spread around. Links that go to pages that will never be indexed are wasted. This means that less page rank is poured into those pages that are indexed. This will likely result in lower rankings for those pages.

So the bottom line is potentially fewer pages indexed and lower rankings for the pages that are indexed. This sounds like an extremely high cost to me. You can read more about problems with, and solutions for duplicate content here.

Affiliate Programs and Duplicate Content

Not too long ago I was working on a site that had a pretty active affiliate program. A very strange thing happened – One of the affiliates unintentionally hijacked the search results of the source site. Let me illustrate what I mean with an example.

The site used to come up very highly in Google for a particular term, let’s call is “discount blue widgets”. The page that Google was listing was “http://www.yourdomain.com”. One day I went back to Google and looked at the current rankings for “discount blue widgets”, and lo and behold, the page listed by Google had changed to “http://www.yourdomain.com?affid=12″ (for example).

Now all of a sudden my client was going to have to pay an affiliate commission for all business that resulted from organic Google search engine traffic. Certainly not what anyone intended.

How could this happen? Evidently, the affiliate who was benefiting from this occurrence had a site that was so authoritative that the one link from this affiliate’s home page caused this affiliate’s version of the home page to be seen as more important than the actual home page of the site.

Note that Google sees the http://www.yourdomain.com?affid=12 as a different page than http://www.yourdomain.com, even though the content is clearly identical. In fact, this creates a duplicate content situation, where Google must choose between one version of the page v.s. the other.

So the one link from this affiliate carried more weight in Google than all the other links this client’s site had (the client had not yet gathered many links). Ouch!

So what to do? The fix actually turned out to be quite easy. We added a “rel=nofollow” tag to the link from this affiliate’s site to our client’s site. It took 90 days for it to ripple through to the Google index, but it did, and the problem was solved. By telling Google to not follow the link, it was no longer placing any value on the http://www.yourdomain.com?affid=12 page, so the original home page became the most important version of the content, and got listed again.

When you implement an affiliate program, make sure you retain the right to require your affiliate to add a rel=”nofollow” tag in their links to your site. You should also retain the right to 301 all their links to your site to the page of your choice in the event that they don’t comply with the request to add the nofollow tag. An interesting example of duplicate content and creative uses of the rel=”nofollow” tag.