29 Tidbits from my Interview of Matt Cutts

It is always a pleasure when I get a chance to sit down with Matt Cutts. Google’s Webspam chief is always willing to share what he can for the benefit of webmasters and publishers. In this interview we focused on discussing crawling and indexation in detail.

Starting with this interview, I have also decided to provide the interview series with a bit of a new look. I am going to continue to publish the full transcript of interviews in the STC Articles Feed and on the articles page on our site, but I am going to use the related blog posts as a way of highlighting the most interesting points from the interview (for those of you who want the abridged version).

One of the more interesting points was their focus on seeing all the web’s content, regardless of whether or not it is duplicate, an unreadable file format, or whatever. The crawling and indexing team wants to see it all. You can control some of how they deal with it, but they still want to see it. Another interesting point was that listing a page in robots.txt does not necessarily save you anything in terms of “crawl budget”. (But wait there’s more!)

What follows are some of the more interesting statements that Matt made in the interview. I add my own comments to the end of each point.

  1. Matt Cutts: “there isn’t really any such thing as an indexation cap”
    My Comment: Never thought there was one, but it’s always good to confirm.
  2. Matt Cutts: “the number of pages that we crawl is roughly proportional to your PageRank”
    My Comment: Most experienced SEO professionals know this, but it is a good reminder how the original PageRank defined in the Brin-Page thesis still has a big influence on the world of SEO.
  3. Matt Cutts: “you can run into limits on how hard we will crawl your site. If we can only take two pages from a site at any given time, and we are only crawling over a certain period of time, that can then set some sort of upper bound on how many pages we are able to fetch from that host”
    My Comment: This will likely be a factor for people on shared (or under-powered) servers.
  4. Matt Cutts: “Imagine we crawl three pages from a site, and then we discover that the two other pages were duplicates of the third page. We’ll drop two out of the three pages and keep only one, and that’s why it looks like it has less good content”
    My Comment: Confirmation of one of the costs of duplicate content.
  5. Matt Cutts: “One idea is that if you have a certain amount of PageRank, we are only willing to crawl so much from that site. But some of those pages might get discarded, which would sort of be a waste”
    My Comment: More confirmation
  6. Eric Enge: “When you link from one page to a duplicate page, you are squandering some of your PageRank, correct?
    Matt Cutts: “It can work out that way”
    My Comment: Yes, duplicate content can mess up your PageRank!
  7. Matt Cutts: “If you link to three pages that are duplicates, a search engine might be able to realize that those three pages are duplicates and transfer the incoming link juice to those merged pages”
    My Comment: So Google does try to pass all the PageRank (and other link signals) to the page it believes to be canonical.
  8. Matt Cutts: re: affiliate programs: “Duplicate content can happen. If you are operating something like a co-brand, where the only difference in the pages is a logo, then that’s the sort of thing that users look at as essentially the same page. Search engines are typically pretty good about trying to merge those sorts of things together, but other scenarios certainly can cause duplicate content issues”

    and

    Matt Cutts: re: 301 redirect of affiliate links: “People can do that”, but then “we usually would not count those as an endorsement”
    My Comment: Google will take links it recognizes as affiliate links and not allow them to pass juice.

  9. Matt Cutts: re: link juice loss in the case of a domain change: “I can certainly see how could be some loss of PageRank. I am not 100 percent sure whether the crawling and indexing team has implemented that sort of natural PageRank decay”
    My Comment: In a follow on email, Matt confirmed that this is in fact the case. There is some loss of PR through a 301.
  10. Matt Cutts: No HTTP status code during redirect: “We would index it under the original URL’s location”
    My Comment: No surprise!
  11. Matt Cutts: re use of rel=canonical: “The pages you combine don’t have to be complete duplicates, but they really should be conceptual duplicates of the same product, or things that are closely related”
    My Comment: Consistent with prior Google communication
  12. Matt Cutts: “It’s totally fine for a page to link to itself with rel=canonical, and it’s also totally fine, at least with Google, to have rel=canonical on every page on your site”
    My Comment: Interesting way to protect your site from unintentionally creating dupe pages. Just be careful with how you implement something like this.
  13. Matt Cutts: “the crawling and indexing team wants to reserve the ultimate right to determine if the site owner is accidentally shooting themselves in the foot and not listen to the rel=canonical tag”
    My Comment: The canonical tag is a “hint” not a “directive”
  14. Matt Cutts: re using robots.txt to block crawling of KML files: “Typically, I wouldn’t recommend that. The best advice coming from the crawler and indexing team right now is to let Google crawl the pages on a site that you care about, and we will try to de-duplicate them. You can try to fix that in advance with good site architecture or 301s, but if you are trying to block something out from robots.txt, often times we’ll still see that URL and keep a reference to it in our index. So it doesn’t necessarily save your crawl budget”
    My Comment: One of the more important points of the interview: listing a page in robots.txt does NOT necessarily save you crawl budget.
  15. Matt Cutts: “most web servers end up doing almost as much work to figure out whether a page has changed or not when you do a HEAD request. In our tests, we found it’s actually more efficient to go ahead and do a GET almost all the time, rather than running a HEAD against a particular page. There are some things that we will run a HEAD for. For example, our image crawl may use HEAD requests because images might be much, much larger in content than web pages”
    My Comment: Interesting point regarding the image crawler.
  16. Matt Cutts: “We still use things like If-Modified-Since, where the web server can tell us if the page has changed or not”
  17. Matt Cutts: re faceted navigation: “You could imagine trying rel=canonical on those faceted navigation pages to pull you back to the standard way of going down through faceted navigation”
    My Comment: Should conserve PageRank (and other link related signals), but does not help with crawl budget. Net-net: sites with low PageRank cannot afford to implement faceted navigation because the crawler won’t crawl all of your pages.
  18. Matt Cutts: “If there are a large number of pages that we consider low value, then we might not crawl quite as many pages from that site, but that is independent of rel=canonical”
    My Comment: Lots of thin content pages CAN kill you.
  19. Eric Enge: “It does sound like there is a remaining downside here, that the crawler is going to spend a lot of it’s time on these pages that aren’t intended for indexing”.
    Matt Cutts: ” Yes, that’s true. … You really want to have most of your pages have actual products with lots of text on them.”
    My Comment: Key point is the emphasis on lots of text. I would tweak that a bit to “lots of unique text”.
  20. Matt Cutts: “we said that PageRank Sculpting was not the best use of your time because that time could be better spent on getting more links to and creating better content on your site”
  21. Matt Cutts: more on PR sculpting: “Site architecture, how you make links and structure appear on a page in a way to get the most people to the products that you want them to see, is really a better way to approach it then trying to do individual sculpting of PageRank on links”
    My Comment: Google really does not want you to sculpt your site.
  22. Matt Cutts: “You can distribute that PageRank very carefully between related products, and use related links straight to your product pages rather than into your navigation. I think there are ways to do that without necessarily going towards trying to sculpt PageRank”
    My Comment: Still the best way to sculpt your site – with your navigation / information architecture.
  23. Matt Cutts: on iFrame or JS sculpting: “I am not sure that it would be viewed as a spammy activity, but the original changes to NoFollow to make PageRank Sculpting less effective are at least partly motivated because the search quality people involved wanted to see the same or similar linkage for users as for search engines”
    My Comment: An important insight into the crawling and indexing team’s mindset. Their view is that they want to see every page on the web, and they will sort it out.
  24. Matt Cutts: “I could imagine down the road if iFrames or weird JavaScript got to be so pervasive that it would affect the search quality experience, we might make changes on how PageRank would flow through those types of links”
    My Comment: Even though a particular sculpting techniqe may work now, there is no guarantee that it will work in the future.
  25. Matt Cutts: “We absolutely do process PDF files” … “users don’t always like being sent to a PDF. If you can make your content in a Web-Native format, such as pure HTML, that’s often a little more useful to users than just a pure PDF file” … “There are, however, some situations in which we can actually run OCR on a PDF”
    My Comment: Matt declined to indicate if links in a PDF page will pass PageRank. My guess is that they do, but they may not be as effective as HTML links.
  26. Matt Cutts: “For a while, we were scanning within JavaScript, and we were looking for links. Google has gotten smarter about JavaScript and can execute some JavaScript. I wouldn’t say that we execute all JavaScript, so there are some conditions in which we don’t execute JavaScript. Certainly there are some common, well-known JavaScript things like Google Analytics, which you wouldn’t even want to execute because you wouldn’t want to try to generate phantom visits from Googlebot into your Google Analytics”.

    and

    Matt Cutts: We do have the ability to execute a large fraction of JavaScript when we need or want to. One thing to bear in mind if you are advertising via JavaScript is that you can use NoFollow on JavaScript links”
    My Comment: You can expect that their capacity to execute JavaScript will increase over time.

  27. Matt Cutts: “we don’t want advertisements to affect search engine rankings”
    My Comment: Nothing new here. This is a policy that will never change.
  28. Matt Cutts: “might put out a call for people to report more about link spam in the coming months”
  29. Matt Cutts: “We do a lot of stuff to try to detect ads and make sure that they don’t unduly affect search engines as we are processing them”
    My Comment: Also not new. Google is going to keep investing in this area.

So if you got this far, you must be really interested in Matt’s thoughts on search and webspam. Check out the rest of the interview for more!

Latest Interview: Matt Cutts

During my recent trip to SMX Advanced, I sat down with Google’s Matt Cutts to discuss link building. In the discussion, we covered a wide variety of white hat link building techniques. What makes this interesting is the insight is provides on what Matt considers to be White Hat.

Check it out, and then comment below if you want to discuss it.

Discussion with Matt Cutts

During my recent trip out to California I had the opportunity to site down and speak with Google’s head of the Webspam team, Matt Cutts. It was enjoyable to chat with Matt, and you can see a transcript of the Matt Cutts interview here.

The major topics we covered were:

  1. Google’s tracking of Javascript encoded or redirected links
  2. NoFollow, NoIndex, and Robots.txt
  3. Hidden text
  4. Signals that Google can use to rate site quality, other than links

Check out the interview, and if you would like to comment on the post, you can do so here.

Using NoFollow to Manage PageRank flow.

Recently, in a conversation that Matt Cutts had with Rand Fishkin, Matt confirmed that Google does not see the use of NoFollow on your web sites as a spam tactic. Here are Matt’s exact words:

The nofollow attribute is just a mechanism that gives webmasters the ability to modify PageRank flow at link-level granularity. Plenty of other mechanisms would also work (e.g. a link through a page that is robot.txt’ed out), but nofollow on individual links is simpler for some folks to use. There’s no stigma to using nofollow, even on your own internal links

NoFollow in the Footer Nav

This raises some interesting possibilities for using this as a tool to concentrate PageRank in the places where you want to concentrate it. To see what we can do with this, let’s look at the SEOmoz blog’s footer navigation for an example:

SEOmoz Footer Nav

This is a fairly common looking footer. Note how the “About”, “Our Services”, “Our Clients”, and “Contact” links are in the footer nav, a design element that shows up on every page of the site. When you link to a pages from every page of your site, the search engine is likely to think that you are saying it’s one of your most important pages.

Clearly, from a business perspective, the “Contact” page is one of the most important pages on the site. However, there is no reason to expect that it will rank highly for important search terms, no matter how much link juice you give it. You may, or may not, want the page to be in the index, but you don’t need to spend tons of PageRank on pages that will never rank.

A good solution for this is to use the NoFollow attribute on these four links. Note that you do not want to use the NoFollow metatag, because this will prevent the entire page from passing any link juice to any other page. This is not your goal.

In theory, this should signal Google that these pages should not be getting any link juice from the other pages of the site. If you want the pages to still be in the index, take one page, such as the home page, and do not apply the NoFollow attribute in the links to these pages from the home page. As a result, the search engines will still see the pages.

NoFollow in the Main Nav

Another application of NoFollow pages comes in when you are dealing with sites that cross link between product categories. Let’s look at an example of this scenario:

Digital Camera HQ Nav

In this example using the Digital Camera HQ main navigation menu, you could imagine that the Price Range pages change a lot, and are not likely to rank highly in the engines no matter what you do. In addition, the cameras listed under Most Popular are key pages that you want to pass the most PageRank to.

Assuming that this is true, NoFollowing the links to the Price Range pages would be a smart idea. As a result, you would stop spending PageRank on those pages, and have more to allocate to the other pages in the main nav, such as the Most Popular, and the Camera Brand pages.

As before, if you still want the Price Range pages in the index, just not with so much link juice, then go ahead and find one page and link to it without the NoFollow attribute from the page. The home page is once again a great place to do this from.

Summary

Based on Matt’s statements to Rand, it seems like these strategies should work for your site. As with all things of this type in the SEO world, there is no real guarantee that this will help you, but, intuitively, it makes sense. In addition, given the care that Matt and other Googlers must take in their public statements, it seems likely that there is little risk in trying it out.

Matt Cutts and Tim Mayer – tidbits from SMX

While I was out at SMX Advanced, I sat through the penalty box summit. As always, Tim Mayer and Matt Cutts had a few interesting things to say.

Tim Mayer started by noting that Yahoo! remains committed to search. Evidently, there was a rumor floating around in the industry that Yahoo! was going to reduce their investment in search. I had not hear the rumor myself, but it’s not true.

Also of interest was that Yahoo! has now added to Site Explorer a way to disclaim and inbound link. I think this is really cool. It may not be something that most webmasters need, but there are definitely times when it would be useful. For example, if you ever bought a link from the Washington Times, who used to sell text links that people bought for SEO purposes, you probably discontinued it once you learned that they don’t count any more.

However, you would also find that the Washington Times never takes pages down, so the links are persistent. Given that this would be a possible black mark for your site from an SEO perspective, you would want them removed. But the Washington Times is not going to remove them for you, they are a newspaper with other things to do. So disclaiming that old purchased link could be a great idea. The good news is that Yahoo! now makes that easy.

Tim also noted that there are legitimate uses for almost every technique. For example, cloaking is considered OK, if you are doing it for delivery of content on a geographic basis (i.e. the Spanish language site to Spanish speaking countries). Ultimately, it’s all about Intent and Extent. The intent with which you do something, and the extent to which you do it. So beware the consequences if a simple examination of your site or it’s linking pattern will reveal that you obviously tried to disguise something from the engines.

When Matt got up, one of the more interesting things he has to say was that Google was not averse to manual action. His comment was intended to address the perception that Google does everything by algorithm. In fact, Google does have a process by which sites get flagged for manual review. Tie that in to Google’s recent efforts with spam reporting forms, and Google’s statement that they review all spam reports filed using the Google Webmaster Tools authenticated version of the form, and it’s clear that they are willing to act on these reports, if it’s appropriate to do so.

Also of interest was Matt’s comment that Google does send proactive emails to webmasters who violate the Webmaster Guidelines, if they believe the violation is unintentional. In some cases, these emails actually detail the specific problem,

Matt also said that Google does make use of 30 day penalties as a warning. This means if your site disappears for 30 days, and then magically pops back in, that this is a warning. Don’t relax, you do have a problem, and there is something you need to go looking for, and fix it, because the penalty will most likely come back.

The most interesting suggestion made by the audience during this session was that Google Webmaster Tools, and Yahoo! Site Explorer should add functionality to mark sites that are currently being penalized or banned with some sort of readily visible flag. This would help webmasters understand when that drop in rankings is due to a penalty, as opposed to changes in the algorithms.

30 SEO Tips from the Matt Cutts Videos

In this post, we are going to provide an SEO’s summary of all the Matt Cutts videos. This is done in rough chronological order. Note that we are going to itemize only those items that might affect your site implementation in a direct way.

This means that we are going to include little information on Data Center updates, Supplemental Results, and updates from SES, even though these are discussed by Matt in various videos. We are also leaving out advice such as “develop good content”, even though we agree with it heartily. This is just the stuff that affects implementation.

Release Date: 7/30/6
Qualities of a Good Site

1. Make your site fully crawlable. One idea is to use a text only browser (such as the Lynx browser) and make sure you can crawl the entire site.
2. Google decides what description text to display for a search query at query time, and picks the description that best matches the query.
3. If you want to disable the DMOZ description for your site, use the “NoODP” metatag.
4. Stated that they favor the bold tag slightly more than the strong tag. This was recanted in a later video.

7/30/6
Optimize for Search Engines or Users?

5. Google does not penalize sites for coding errors, as they are too common.

7/30/6
Some SEO Myths

6. OK to operate more than one site, provided that the content is substantively different. This is true, even if there are common code elements, such as Javascript elements, and CSS.
7. Sneaky Javascript redirects are a problem.
8. Launching Sites will millions of pages will raise a flag and likely be a problem. Launch more softly, perhaps with thousands of pages at a time. That this is an issue was later confirmed in the Robert Scoble post on Matt’s blog (make sure you read the comments too).

7/31/6
How to Structure a Site

9. It’s OK to acquire a related domain, and simply 301 its existing links to a new site. Emphasis is on related. Don’t try this with unrelated sites. Note an example of this is the Amish GoKarts site.
10. Google takes a hard stance on cloaking. Don’t do it for any reason. This includes re mapping pages with too many parameters to simpler URLs. Solve this in your code.
11. In addition, if you want to do A/B Testing, do this on pages that can’t be seen by Google.

7/31/6
Static v.s. Dynamic URLs

12. Your URLs should have no more than 2 or 3 parameters on them.
12. Keep your parameters short. Long numbers may be interpreted to be Session IDs.
13. Geo targeting using IP delivery is OK, because you are treating the crawler the same as the user.
14. Official cloaking definition: Delivering different content to the Googlebot than the end user.

7/31/06
Does WebSpam Use Google Analytics?

15. The Google WebSpam team does not look at any data from Google Analytics.
16. If you run a porn site, and you want to be filtered out so you don’t show up in Safe Searches, the best way to flag this is in the keyword metatags.
17. Putting links in an option box is a bad idea.

8/1/6
Lightning Round

19. Google treats the strong and the bold tags the same.
20. Google treats the em and italics tags the same.
21. Google considers content to be duplicate if it’s an exact copy, or “too similar”.
22. Translated versions of content are not considered duplicate.
23. In the case of Canadian and US sites with minimal differences, they will pick one to show in the results.
24. Google does not weight blogs differently than web sites.
25. .gov and .edu links to not provide an inherent boost. They are weighted the same.

8/5/6
Reinclusion Requests

26. The best way to do a recinclusion request is through Google Webmaster Tools.
27. You can also use Google Webmaster Tools to identify some of your problems, such as sneaky Javascript redirects, Doorway pages, and hidden text.
28. Get clean (fix all your sins) before submitting a Reinclusion request.
29. Include in your Reinclusion request something that reassures Google that you will sin no more.

8/7/6
Google Webmaster Tools

30. You can use the “Preferred Domain” feature of Google Webmaster tools to indicate whether you prefer that Google represent your site as http://www.yourdomain.com or http://yourdomain.com. It also will pass link credit for link to the non-preferred version of your domain to the preferred version automatically. However, Matt says it takes weeks to take effect and you should leave your canonical 303 redirects in place.

Matt also recorded several other videos that cover other aspects of the search space for which we extracted no hard core implementation tips. These are:

  1. 7/31/6: Supplemental Results
  2. 7/31/6: Google Data Centers
  3. 8/7/6: My Tips for SES
  4. 8/23/6: Data Center Comments
  5. 8/28/6: Recap of SES San Jose 2006
  6. 9/6/6: Crawl Dates in the Google Cache
  7. 7/31/6: Google Terminology