It is always a pleasure when I get a chance to sit down with Matt Cutts. Google’s Webspam chief is always willing to share what he can for the benefit of webmasters and publishers. In this interview we focused on discussing crawling and indexation in detail.
Starting with this interview, I have also decided to provide the interview series with a bit of a new look. I am going to continue to publish the full transcript of interviews in the STC Articles Feed and on the articles page on our site, but I am going to use the related blog posts as a way of highlighting the most interesting points from the interview (for those of you who want the abridged version).
One of the more interesting points was their focus on seeing all the web’s content, regardless of whether or not it is duplicate, an unreadable file format, or whatever. The crawling and indexing team wants to see it all. You can control some of how they deal with it, but they still want to see it. Another interesting point was that listing a page in robots.txt does not necessarily save you anything in terms of “crawl budget”. (But wait there’s more!)
What follows are some of the more interesting statements that Matt made in the interview. I add my own comments to the end of each point.
- Matt Cutts: “there isn’t really any such thing as an indexation cap”
My Comment: Never thought there was one, but it’s always good to confirm.
- Matt Cutts: “the number of pages that we crawl is roughly proportional to your PageRank”
My Comment: Most experienced SEO professionals know this, but it is a good reminder how the original PageRank defined in the Brin-Page thesis still has a big influence on the world of SEO.
- Matt Cutts: “you can run into limits on how hard we will crawl your site. If we can only take two pages from a site at any given time, and we are only crawling over a certain period of time, that can then set some sort of upper bound on how many pages we are able to fetch from that host”
My Comment: This will likely be a factor for people on shared (or under-powered) servers.
- Matt Cutts: “Imagine we crawl three pages from a site, and then we discover that the two other pages were duplicates of the third page. We’ll drop two out of the three pages and keep only one, and that’s why it looks like it has less good content”
My Comment: Confirmation of one of the costs of duplicate content.
- Matt Cutts: “One idea is that if you have a certain amount of PageRank, we are only willing to crawl so much from that site. But some of those pages might get discarded, which would sort of be a waste”
My Comment: More confirmation
- Eric Enge: “When you link from one page to a duplicate page, you are squandering some of your PageRank, correct?
Matt Cutts: “It can work out that way”
My Comment: Yes, duplicate content can mess up your PageRank!
- Matt Cutts: “If you link to three pages that are duplicates, a search engine might be able to realize that those three pages are duplicates and transfer the incoming link juice to those merged pages”
My Comment: So Google does try to pass all the PageRank (and other link signals) to the page it believes to be canonical.
- Matt Cutts: re: affiliate programs: “Duplicate content can happen. If you are operating something like a co-brand, where the only difference in the pages is a logo, then that’s the sort of thing that users look at as essentially the same page. Search engines are typically pretty good about trying to merge those sorts of things together, but other scenarios certainly can cause duplicate content issues”
Matt Cutts: re: 301 redirect of affiliate links: “People can do that”, but then “we usually would not count those as an endorsement”
My Comment: Google will take links it recognizes as affiliate links and not allow them to pass juice.
- Matt Cutts: re: link juice loss in the case of a domain change: “I can certainly see how could be some loss of PageRank. I am not 100 percent sure whether the crawling and indexing team has implemented that sort of natural PageRank decay”
My Comment: In a follow on email, Matt confirmed that this is in fact the case. There is some loss of PR through a 301.
- Matt Cutts: No HTTP status code during redirect: “We would index it under the original URL’s location”
My Comment: No surprise!
- Matt Cutts: re use of rel=canonical: “The pages you combine don’t have to be complete duplicates, but they really should be conceptual duplicates of the same product, or things that are closely related”
My Comment: Consistent with prior Google communication
- Matt Cutts: “It’s totally fine for a page to link to itself with rel=canonical, and it’s also totally fine, at least with Google, to have rel=canonical on every page on your site”
My Comment: Interesting way to protect your site from unintentionally creating dupe pages. Just be careful with how you implement something like this.
- Matt Cutts: “the crawling and indexing team wants to reserve the ultimate right to determine if the site owner is accidentally shooting themselves in the foot and not listen to the rel=canonical tag”
My Comment: The canonical tag is a “hint” not a “directive”
- Matt Cutts: re using robots.txt to block crawling of KML files: “Typically, I wouldn’t recommend that. The best advice coming from the crawler and indexing team right now is to let Google crawl the pages on a site that you care about, and we will try to de-duplicate them. You can try to fix that in advance with good site architecture or 301s, but if you are trying to block something out from robots.txt, often times we’ll still see that URL and keep a reference to it in our index. So it doesn’t necessarily save your crawl budget”
My Comment: One of the more important points of the interview: listing a page in robots.txt does NOT necessarily save you crawl budget.
- Matt Cutts: “most web servers end up doing almost as much work to figure out whether a page has changed or not when you do a HEAD request. In our tests, we found it’s actually more efficient to go ahead and do a GET almost all the time, rather than running a HEAD against a particular page. There are some things that we will run a HEAD for. For example, our image crawl may use HEAD requests because images might be much, much larger in content than web pages”
My Comment: Interesting point regarding the image crawler.
- Matt Cutts: “We still use things like If-Modified-Since, where the web server can tell us if the page has changed or not”
- Matt Cutts: re faceted navigation: “You could imagine trying rel=canonical on those faceted navigation pages to pull you back to the standard way of going down through faceted navigation”
My Comment: Should conserve PageRank (and other link related signals), but does not help with crawl budget. Net-net: sites with low PageRank cannot afford to implement faceted navigation because the crawler won’t crawl all of your pages.
- Matt Cutts: “If there are a large number of pages that we consider low value, then we might not crawl quite as many pages from that site, but that is independent of rel=canonical”
My Comment: Lots of thin content pages CAN kill you.
- Eric Enge: “It does sound like there is a remaining downside here, that the crawler is going to spend a lot of it’s time on these pages that aren’t intended for indexing”.
Matt Cutts: ” Yes, that’s true. … You really want to have most of your pages have actual products with lots of text on them.”
My Comment: Key point is the emphasis on lots of text. I would tweak that a bit to “lots of unique text”.
- Matt Cutts: “we said that PageRank Sculpting was not the best use of your time because that time could be better spent on getting more links to and creating better content on your site”
- Matt Cutts: more on PR sculpting: “Site architecture, how you make links and structure appear on a page in a way to get the most people to the products that you want them to see, is really a better way to approach it then trying to do individual sculpting of PageRank on links”
My Comment: Google really does not want you to sculpt your site.
- Matt Cutts: “You can distribute that PageRank very carefully between related products, and use related links straight to your product pages rather than into your navigation. I think there are ways to do that without necessarily going towards trying to sculpt PageRank”
My Comment: Still the best way to sculpt your site – with your navigation / information architecture.
- Matt Cutts: on iFrame or JS sculpting: “I am not sure that it would be viewed as a spammy activity, but the original changes to NoFollow to make PageRank Sculpting less effective are at least partly motivated because the search quality people involved wanted to see the same or similar linkage for users as for search engines”
My Comment: An important insight into the crawling and indexing team’s mindset. Their view is that they want to see every page on the web, and they will sort it out.
My Comment: Even though a particular sculpting techniqe may work now, there is no guarantee that it will work in the future.
- Matt Cutts: “We absolutely do process PDF files” … “users don’t always like being sent to a PDF. If you can make your content in a Web-Native format, such as pure HTML, that’s often a little more useful to users than just a pure PDF file” … “There are, however, some situations in which we can actually run OCR on a PDF”
My Comment: Matt declined to indicate if links in a PDF page will pass PageRank. My guess is that they do, but they may not be as effective as HTML links.
- Matt Cutts: “we don’t want advertisements to affect search engine rankings”
My Comment: Nothing new here. This is a policy that will never change.
- Matt Cutts: “might put out a call for people to report more about link spam in the coming months”
- Matt Cutts: “We do a lot of stuff to try to detect ads and make sure that they don’t unduly affect search engines as we are processing them”
My Comment: Also not new. Google is going to keep investing in this area.
So if you got this far, you must be really interested in Matt’s thoughts on search and webspam. Check out the rest of the interview for more!