Drupal and Search Engine Optimization

Drupal is known for being a very SEO friendly content management system (CMS). The way it assembles its pages is crawler friendly. This makes it a popular choice for people looking to build dynamic web sites. However, there are a number of potential SEO problems with Drupal as well. These need to be dealt with to ensure that you get optimal results.

The very fact that Drupal is such a dynamic system is a factor that leads to some of its SEO problems. The content is stored in a database and retrieved at runtime. Almost all information is stored as a “node”, a basic, unstructured unit of content. Often, each “node” is associated with groups of keywords, known as “taxonomies”, and Drupal makes it easy to retrieve and sort information by these taxonomies. Since all content can be retrieved dynamically, Drupal generates generic URLs for the content, such as www.example.com/?q=node/3 or www.example.com/node/3.

These “internal” URLs are always present in Drupal, even though Drupal provides features that allow you to hide them, and instead present much friendlier URLs, known as aliases, to web site users. There are multiple optional modules that may affect the generation of pages and the naming of URLs, and there are many modules that remain aware of the internal naming conventions, even when user-friendly URLs are being used. As a result Drupal may expose both the internal URLs and the user-friendly URLs to users and web crawlers.

As a result of these kinds of architectural issues, many Drupal sites end up exposing content to the web via multiple URLs. When this happens, the multiple URLs can be crawled by the search engines, creating duplicate content problems. Here are some examples of duplicate content issues, and some other problems that can arise in drupal.

1.Problem: duplicate content from aliases

Example: www.example.com/node/5 and www.example.com/content/how-to-surf, both pointing at the same physical document.

Solution: use robots.txt to disallow URLs that include “/node/” For example, you can include the following lines in robots.txt:Disallow: /node

Disallow: /*/node/Considerations: Note that this assumes that all URLs are available via friendly aliases. This should be the case if you’re using the pathauto module.>[?

2. Problem: Drupal’s default robots.txt has errors.

Example: the default robots.txt uses “Disallow: /search”. This disallows only a page ending with /search, but not all of the Drupal internal search results pages, which is desired.

Solution: update the robots.txt to read:Disallow: /search/

3. Problem: Pathauto can create many extra pages on the site if configured incorrectly.

Example: If you turn on “Create index aliases”, and you have a hiearchical alias (e.g., a page with a path containing a slash, such as music/concert/beethoven) Drupal automatically generates index pages that contain all pages in each category — for example all music, and all concerts.

Solution : Do not check the “Create index alias” check box in the Pathauto module.

4. Problem: Incorrect setting of the Pathauto “Update action”, in a production environment, can cause URLs of published pages, which may already be indexed by the search engines, to change.

Solution: In development mode (before exposing the site to the search engines), use “Create a new alias, replacing the old one” to regenerate URLs whenever necessary (for example, if your Pathauto rules change). In production, once the site is exposed, set this to “Do nothing, leaving the old alias intact”.

5. Problem: Some modules, such as Forums and Views, create sortable lists that can generate multiple URLs with duplicate content.

Solution: If you use such a module, be sure to exclude the sorted variations using the following robots.txt rule:Disallow: /*sort=

6. Problem: The Forward module creates a link to a URL, on each page, that allows the page to be forwarded to a friend. You can easily end up with hundreds or thousands of such low quality pages that are essentially boilerplates.

Solution: If you use this module, be sure to exclude the forward pages using the following robots.txt rule:

Disallow: /forward/

These problems can crop up on many Drupal systems, and all Drupal users should review their sites for these issues. Drupal may also have other issues, depending on the site and the degree of customization. For example, on several sites, we’ve seen Drupal generate complex CSS hierarchies that end up building hidden text into the pages. While search engines try to detect hidden text scenarios that are not a result of bad intent, this is a risk you don’t need. As long as you recognize what the issues are, they can be dealt with, and Drupal can be a great choice as a content management system. Most content management systems present even greater challenges to SEO.

Google Shared Stuff

Tony Ruscoe over at Google Blogoscoped has found out that Google has released a new social link sharing service called Google Shared Stuff. If you have a Google account, this service allows you to get started simply by dragging Google’s Email / Share button onto the bookmark toolbar in your browser.

Once there, all you need to do is click on the button to share any given web page that you have gone to. You can enter in a description and tags for each article. One intriguing thing that the service does is that it tries to find a representative image on the web page to inclide with the data you enter in manually.

You can email the profile to others directly from the service, or you can post it to other bookmarking / sharing oriented services, including: Facebook, Furl, del.icio.us, Social Poster, Reddit, and Digg.

Users can also subscribe to a feed for your Shared Stuff. This can be a direct RSS feed, or you can add it to their Google home page, or to Google Reader.

Keeping in mind all the recent discussion about using more and more signals about page and site quality, it seems likely that this service will be another way that Google can determine whether or not a page is a quality page. If lots of users share your page(s), then these be considered votes for your page and site.

What remains a mystery is the plans for a real public unveiling of the service.

Using NoFollow to Manage PageRank flow.

Recently, in a conversation that Matt Cutts had with Rand Fishkin, Matt confirmed that Google does not see the use of NoFollow on your web sites as a spam tactic. Here are Matt’s exact words:

The nofollow attribute is just a mechanism that gives webmasters the ability to modify PageRank flow at link-level granularity. Plenty of other mechanisms would also work (e.g. a link through a page that is robot.txt’ed out), but nofollow on individual links is simpler for some folks to use. There’s no stigma to using nofollow, even on your own internal links

NoFollow in the Footer Nav

This raises some interesting possibilities for using this as a tool to concentrate PageRank in the places where you want to concentrate it. To see what we can do with this, let’s look at the SEOmoz blog’s footer navigation for an example:

SEOmoz Footer Nav

This is a fairly common looking footer. Note how the “About”, “Our Services”, “Our Clients”, and “Contact” links are in the footer nav, a design element that shows up on every page of the site. When you link to a pages from every page of your site, the search engine is likely to think that you are saying it’s one of your most important pages.

Clearly, from a business perspective, the “Contact” page is one of the most important pages on the site. However, there is no reason to expect that it will rank highly for important search terms, no matter how much link juice you give it. You may, or may not, want the page to be in the index, but you don’t need to spend tons of PageRank on pages that will never rank.

A good solution for this is to use the NoFollow attribute on these four links. Note that you do not want to use the NoFollow metatag, because this will prevent the entire page from passing any link juice to any other page. This is not your goal.

In theory, this should signal Google that these pages should not be getting any link juice from the other pages of the site. If you want the pages to still be in the index, take one page, such as the home page, and do not apply the NoFollow attribute in the links to these pages from the home page. As a result, the search engines will still see the pages.

NoFollow in the Main Nav

Another application of NoFollow pages comes in when you are dealing with sites that cross link between product categories. Let’s look at an example of this scenario:

Digital Camera HQ Nav

In this example using the Digital Camera HQ main navigation menu, you could imagine that the Price Range pages change a lot, and are not likely to rank highly in the engines no matter what you do. In addition, the cameras listed under Most Popular are key pages that you want to pass the most PageRank to.

Assuming that this is true, NoFollowing the links to the Price Range pages would be a smart idea. As a result, you would stop spending PageRank on those pages, and have more to allocate to the other pages in the main nav, such as the Most Popular, and the Camera Brand pages.

As before, if you still want the Price Range pages in the index, just not with so much link juice, then go ahead and find one page and link to it without the NoFollow attribute from the page. The home page is once again a great place to do this from.

Summary

Based on Matt’s statements to Rand, it seems like these strategies should work for your site. As with all things of this type in the SEO world, there is no real guarantee that this will help you, but, intuitively, it makes sense. In addition, given the care that Matt and other Googlers must take in their public statements, it seems likely that there is little risk in trying it out.

Relevance, Importance, and Usefulness

There are so many different aspects to SEO. It’s a complicated topic, and it can often be difficult to explain to clients. We try to use an approach that helps our clients quickly get an understanding of what is driving the various recommendations we make. We do this by breaking down the major aspects of SEO into these 3 components: Relevance, Importance, and Usefulness. Here is how we talk about these things with our clients:

  1. Relevance: What is your site about? What types of information can search engines access to find out what your site is about? This ends up being things like title tags, on page text analysis, inbound link analysis (including anchor text and the topical matter of the sites linking to you), outbound link analysis, etc.
  2. Importance: If you operate a book store site, and are competing with Amazon, how come Amazon always ranks in front of your site? One of the major inputs to this process is link analysis, however, the link analysis is done in a very topical manner. In other words, a page can only be important for subjects which are relevant to the page.The other part of this that is important is that all links are not created equal. Getting links from a recognized authority site (in your topic area) is worth more than a link from another site. The specifics of how the links are implemented (i.e. anchor text, surrounding context) also help in weighting a link’s importance to the topic of your page.
  3. Usefulness: Search engines have much more data available to them about a site. This includes things like bounce rates and tagging data from social media sites. If your site’s bounce rate is measurably higher than that of your competitor, that may mean your site is less useful then the competitor’s site.

These are all great over-simplifications, but many clients don’t want to understand the great inner depths of SEO, yet they still need to understand what is basically at work here. We find that this approach helps the client gain a basic appreciation for what SEO is about, and why it is important to their business.