Complex Site Hierarchies

Large sites have a variety of problems. They are inherently more complex than smaller sites to manage from many perspectives. Some of the major ones include:

  1. Content management is a major challenge
  2. Providing users a quality user experience (e.g. an easy to understand navigation system) may be difficult
  3. And, yes, SEO can be a challenge too

One of the most common SEO problems relates to managing complex site hierarchies. An example of this is when you try to implement a site with a local aspect. For example, let’s say you sell widgets, and like pizzas, it matters where you sell them. Let’s also say that you sell widgets in 200 different cities and towns. So far it’s not too bad.

Now let’s complicate things a bit further and imaging that there are 45 different types of widgets (e.g. you sell auto parts, and there are hundred of different types of parts), but there is still a reason to be concerned about where you sell these widgets – perhaps there is a service aspect to it, or some reason why the customer wants to be able to look at the actual product first.

Now you decide you want to offer your users the navigation of their choice. Want to search on the product type first? Go ahead! Pick widgets, then blue widgets, then pick the city where you want to shop for it. Your breadcrumb bar might look like this: Widgets > Blue Widgets > CityName. Pretty straight forward.

Want to search on the city name first? You probably want your user to be able to do that too. After that you let them pick their product category and specific product. The resulting breadcrumb bar might be something like: CityName > Widgets > CityName.

The problem is pretty easy to detect: Widgets > Blue Widgets > CityName and CityName > Widgets > CityName have the exact same content. This makes them duplicate content. This is not a good thing. Let’s review some options for dealing with it:

  1. One possible solution is simply to not offer both navigation paths. For many sites, this is pretty viable. Unless you offer a lot of very specialized content completely unrelated to your products that you are selling, you aren’t likely to rank for a city name anyway. Unless there is a compelling reason to offer both methods of navigation, just don’t do it.
  2. You could also allow someone to pick their city first, and then when they pick a product, send them over to the other copy of the page. Basically, what you are doing here is sending someone from City > Widgets to Widgets > City. This is a pretty good solution for eliminating the duplicate content, but it can have a high usability cost. The nature of the cost is that your breadcrumb will be confusing to the user.The problem is that they selected a City first, and then a product, but the breadcrumb indicates the opposite. You can build the breadcrumb dynamically on the page, but from an SEO perspective, the breadcrumb bar is something that you want to use to reinforce the link hierarchy of your site.
  3. Next up, you can offer the complete path, but NoFollow one path entirely. This provides a valid path for a user to follow, without any of the search engine issues outlined above. This is a pretty good option for managing the link juice flow too. Now the search engine only sees one “City Blue Widgets” page, and more of your link juice flows to that one version of the page.I don’t see any major down side to this option.

Related to all this is the issue of developing content. In what we have outlined above (hundreds of products and hundreds of cities) you are likely to have tens of thousands of pages. How are you going to get content for all those pages?

While that’s not today’s topic, knowing the answer to that is important in tackling these types of hierarchy questions. You are not doing anyone, including yourself, any good by publishing thousands of low content pages. You just up end up with loads of pages in the supplemental index of Google and provide lots of low quality signals to all the search engines.

You would be better off having a smaller site where the pages are all of high quality, and then grow the site over time as you develop more content. Even in this scenario though, it’s a good idea to know what direction you are headed in with before you design your initial architecture.

Drupal and Search Engine Optimization

Drupal is known for being a very SEO friendly content management system (CMS). The way it assembles its pages is crawler friendly. This makes it a popular choice for people looking to build dynamic web sites. However, there are a number of potential SEO problems with Drupal as well. These need to be dealt with to ensure that you get optimal results.

The very fact that Drupal is such a dynamic system is a factor that leads to some of its SEO problems. The content is stored in a database and retrieved at runtime. Almost all information is stored as a “node”, a basic, unstructured unit of content. Often, each “node” is associated with groups of keywords, known as “taxonomies”, and Drupal makes it easy to retrieve and sort information by these taxonomies. Since all content can be retrieved dynamically, Drupal generates generic URLs for the content, such as www.example.com/?q=node/3 or www.example.com/node/3.

These “internal” URLs are always present in Drupal, even though Drupal provides features that allow you to hide them, and instead present much friendlier URLs, known as aliases, to web site users. There are multiple optional modules that may affect the generation of pages and the naming of URLs, and there are many modules that remain aware of the internal naming conventions, even when user-friendly URLs are being used. As a result Drupal may expose both the internal URLs and the user-friendly URLs to users and web crawlers.

As a result of these kinds of architectural issues, many Drupal sites end up exposing content to the web via multiple URLs. When this happens, the multiple URLs can be crawled by the search engines, creating duplicate content problems. Here are some examples of duplicate content issues, and some other problems that can arise in drupal.

1.Problem: duplicate content from aliases

Example: www.example.com/node/5 and www.example.com/content/how-to-surf, both pointing at the same physical document.

Solution: use robots.txt to disallow URLs that include “/node/” For example, you can include the following lines in robots.txt:Disallow: /node

Disallow: /*/node/Considerations: Note that this assumes that all URLs are available via friendly aliases. This should be the case if you’re using the pathauto module.>[?

2. Problem: Drupal’s default robots.txt has errors.

Example: the default robots.txt uses “Disallow: /search”. This disallows only a page ending with /search, but not all of the Drupal internal search results pages, which is desired.

Solution: update the robots.txt to read:Disallow: /search/

3. Problem: Pathauto can create many extra pages on the site if configured incorrectly.

Example: If you turn on “Create index aliases”, and you have a hiearchical alias (e.g., a page with a path containing a slash, such as music/concert/beethoven) Drupal automatically generates index pages that contain all pages in each category — for example all music, and all concerts.

Solution : Do not check the “Create index alias” check box in the Pathauto module.

4. Problem: Incorrect setting of the Pathauto “Update action”, in a production environment, can cause URLs of published pages, which may already be indexed by the search engines, to change.

Solution: In development mode (before exposing the site to the search engines), use “Create a new alias, replacing the old one” to regenerate URLs whenever necessary (for example, if your Pathauto rules change). In production, once the site is exposed, set this to “Do nothing, leaving the old alias intact”.

5. Problem: Some modules, such as Forums and Views, create sortable lists that can generate multiple URLs with duplicate content.

Solution: If you use such a module, be sure to exclude the sorted variations using the following robots.txt rule:Disallow: /*sort=

6. Problem: The Forward module creates a link to a URL, on each page, that allows the page to be forwarded to a friend. You can easily end up with hundreds or thousands of such low quality pages that are essentially boilerplates.

Solution: If you use this module, be sure to exclude the forward pages using the following robots.txt rule:

Disallow: /forward/

These problems can crop up on many Drupal systems, and all Drupal users should review their sites for these issues. Drupal may also have other issues, depending on the site and the degree of customization. For example, on several sites, we’ve seen Drupal generate complex CSS hierarchies that end up building hidden text into the pages. While search engines try to detect hidden text scenarios that are not a result of bad intent, this is a risk you don’t need. As long as you recognize what the issues are, they can be dealt with, and Drupal can be a great choice as a content management system. Most content management systems present even greater challenges to SEO.

12 Quick Site Architecture Tips

Site Architecture is a critical component of SEO. When you are starting on a new site design, you have to begin by thinking through the SEO plan as part of the design. Here are 12 quick tips on how to get your site architecture right:

  1. Complete the steps below before writing 1 line of code for your site. It will save you time and money.
  2. Have an experienced person do keyword analysis for your business. Keyword tools, such as Wordtracker, and Keyword Discovery provide wonderful insight into the mind set of your potential customers.What terms to they search on when looking for products like yours? Keyword tools can tell you that. You need to map your web site copy to these terms, because these are the terms that will engage them the most. This would be great tools to use even if SEO did not exist.
  3. Use your keyword analysis to define what content you need for the site. Each major keyword is a potential topic for you to write about. If your potential customers use these phrases when looking for products like yours, then you want to grab their attention by writing content related to those phrases.
  4. Search engines (and users) like sites that have a simple hierarchical structure. These types of sites are also easier to maintain, so everybody wins when you build a site with a simple tree like structure.
  5. Search engines (and users) like to see a simple, clean global navigation scheme, that uses the same approach across all the pages of your site.
  6. Users also benefit from the use of breadcrumb bars that help them understand where they have been, and how they got to the current page. Not so much for search engines this one, but still a really good idea to implement.
  7. Keep your site relatively flat. Search engines look to us for clues as to what content you find important on our sites. If the content is 4 clicks from the home page, how important can it be?
  8. Keep the link density low. A good rule of thumb is to have less than 200 links on your most link dense page. Unless your site is considered quite important by the search engines, they probably don’t look at much more than that on the page. It’s also not very user friendly.
  9. Avoid parameters on your URLs. If you are generating your site from a database, use Mod Rewrite (or equivalent) to re map the URLs. Map www.yourdomain.com?catid=1345&prodid=164 to something more like www.yourdomain.com?catid=cars&prodid=ford+taurus. Once again, it’s also more user friendly to look at too.
  10. Oldies, but goodies #1: Every single page on your site should have one unique URL that brings you to that page, and no more than one. This saves you the enormous headache of duplicate content problems, and you don’t want to go there. This means you can’t use session IDs. You also need to make sure that your web site application does not allow a given page to be described by 2 or more different URLs. It also means that you need to:
    1. 301 redirect all your http://yourdomain.com/* pages to http://www.yourdomain.com/* pages (OR vice versa)
    2. 301 redirect http://www.yourdomain.com/index,html to http://www.yourdomain.com
    3. 301 redirect http://www.yourdomain.com/index.htm to http://www.yourdomain.com
    4. 301 redirect http://www.yourdomain.com/index.shtml to http://www.yourdomain.com
  11. Oldies, but goodies #2 Don’t bury your pages in Flash. If you are going to use Flash, then read this article for tips on implementing Flash in a search engine friendly manner.
  12. Oldies but goodies #3: Don’t bury your pages in Javascript either. It’s fine to use it, but don’t have 5000 lines of Javascript on a page with 20 lines of search engine crawlable text.

I am sure there are more. I am trying to keep the above list focused on site architecture, so I stayed away from things like link building. If anyone wants to make suggestions, I will be happy to add them.

Using a Site Map to Increase Traffic

Building a good site map can be challenging (for clarity’s sake, I need to point out that we are not talking about the Site Maps protocol here, but an on site, site map). But I find that there is a lot of confusion out there about what makes a good one. People are locked down on the notion that they need to have one page on their site that has links to every other page on their site. There is also confusion on when one is needed – not all sites need one.

Let’s start with the objective: Minimize the number of clicks from your home page to your content. That’s it. Many sites have simple enough architectures, and really clean navigation systems so that adding a site map does not reduce the number of clicks to your content. If that’s the case with your site, then you don’t need to waste your time building a site map.

However, other sites are more complex in their very nature. You may have a site that has content that takes 4 or more clicks to reach through your standard navigation. If this is the case, you may be a candidate for a site map file.

With all the “design for users, not search engines” discussion going on, I do have to note that a site map file is an example of where you actually do want to design something for the search engine. You can think of it as spider food. Done properly, it has never proved to be a problem in our experience.

There are pitfalls – I have seen tons of site maps with many hundreds, or even thousands of links on them. Not going to fly! You still want to limit the number of links on the page to 200 or less (other people say its 100 – your mileage may vary)

The way to deal with this is to divide your site map into multiple files, on a topical basis. So if you have a site selling thousands of different kinds of “widgets”, you might end up with multiple site map files:

  1. Widgets by color
  2. Widgets by size
  3. Widgets by location
  4. Widgets by manufacturer

This gives you a site maps files that are topically relevant, and potentially valuable search tools for use by your user.

These types of site maps files work. It was a major element of our Videomaker Site Redesign that resulted in 60+% traffic gain in the first 3 months, and traffic appears to have more than doubled since then. Cool stuff, and it’s cheap and easy to do for most sites.

Building Multi-Million $ Web Sites from Scratch (Part 5 of …)

Code Architecture

When building a large web site from scratch, it’s critical that you have the right code architecture. A weak code architecture can torpedo your entire plan, or can leave you in a position where you can’t scale your site. Given the type of content we have been talking about in this series, you may needs tens of thousands of web pages, or even hundreds of thousands.

You want to be able to implement this in a way where each page is unique and different to avoid duplicate content problems, yet the cost to maintain these pages is low. You also want to make sure you implement a code structure that is very clean from a search engine perspective, so let’s deal with this second issue first.

Clean Code

You want the unique content of your page to show up immediately below the BODY tag. If your BODY tag is at line 30, you want a DIV tag for the main section of your content at line 31, and then your H1 tag at line 32. By unique content, we mean the stuff that shows up only on that page, beginning with an H1 tag that labels the page. Unique content does not include standard navigation or menus, any Javascript or Flash, or standard footers, etc.

Then you must use absolute positioning in your CSS file. The CSS for this will look something like this:

#main {
position: absolute;
left: 100px;
top: 100px;
padding-left: 10px;
padding-top: 10px;
}

where “main” is the name of your div tag where the content related to this statement will appear.

The key then is to organize all of your sections of your page using DIV tags, and use absolute positioning definitions of each of these DIV tags in the layout of your web page. Once you have done this, you can easily move the main body of your unique content right under your BODY tag. You can read more about how to use CSS for SEO in this article.

A last note on this. Some technical people may raise concerns that absolute positioning statements are not interpreted the same way by all browsers, and that there is a risk that it will not look right in some browsers. There is some truth to this concern, but our testing shows that IE, Firefox, Mozilla, and Netscape all have no problem with it.

And we have been told by senior Google people that this type of code architecture will probably lead to higher rankings for your site. So it seems to me that if you double your site volume, but 1% of your visitors have trouble with the layout in their browser, you are still way ahead of the game. Get over it and take the doubled traffic!

Scalable Code

So the next big issue is having highly scalable code. Assuming that you have assembled a ton of content, you need to get this content into a database of some sort. We typically use My SQL, but any database you are comfortable with is fine, provided that your hosting company can set it up for you, or you can get it set up yourself on your web server. Of course, it’s critical that the database be filled with unique and interesting content

.Next you need to think about your code as something that dynamically generates and renders static pages. We are talking about a web application here. You can, of course, render the pages offline, and then push them live. Alternatively, you can do this on the fly as the requests from users comes into the web server.

This is the most important part of the architecture! You want the maintenance of your site to be focused on developing content and updating the database. The better you get at this, the easier it is to grow massive, content rich, web sites.

Another key element is to pick a programming language that you are good at (or can get good at) that makes these sorts of manipulations easy (well easier). Regardless of what you pick, you are going to using the language for things that it may not have been intended for. We use Perl here, but there are certainly other valid choices.

It’s hard to provide more details, because those additional details are dependent on the unique content you have pulled to together and the desired presentation. The concept we have outlined above is not easy. We invested many man-years in developing a flexible, yet industrial strength, code base that could meet our needs.

But once you have it, it becomes an engine that can be used for launching large sites as quickly as you can manage to assemble unique data sets. Unfortunately, building these large databases of content is not easy either. But we never said it would be easy. But here is what it is – it’s reliable and it works. That’s a good thing to know if you are interested in making millions of dollars.

Next up

  1. How to get links
  2. How to monitor results, and what to do about it

Already Published Articles in the Series

  1. Picking a Market and Content Strategy
  2. Using PPC to Enhance your Organic Traffic Strategy)
  3. Site Hierarchy and Keyword Selection
  4. Content Development

del.icio.us tags: , , ,