What follows is an interview with Adam Lasnik, who is an SEO Strategist at Google. Adam has become extremely well known in the community as a new voice for communications from Google to the webmaster community. Here is his bio:
Before there was a public Internet, Adam was e-mailing. Before there was Netscape or Internet Explorer, he was surfing the Web. He’s written comprehensive search engine optimization reports, managed sponsored ad campaigns for Fortune 500 companies, and provided broad communications consulting to successful startup companies. Adam earned MBA and law degrees — focused on Global Electronic Communications and Commerce issues — and then moved to Germany to serve as an entrepreneurial consultant to a multinational IT company.
Grateful for the international experience but fascinated by the burgeoning American dot.com scene, he hopped over to San Francisco and joined the high tech PR firm Niehaus Ryan Wong as an Interactive Strategist, helping clients understand and leverage the power of online communities. When the dot.com boom turned to bust, Adam spent the next years broadening his online communications and advertising chops with a mix of small and large companies. In 2006, Adam became Google’s first Search Evangelist, dedicated to building stronger relationships between Google and Webmasters.
Eric Enge: One thing that was interesting recently is that Stefanie from Google, Dublin posted about spam reporting. And then, Matt sent out a follow up post more recently about how to use that form to report paid links.
It looks like this is a practice that Google is actively encouraging now. How will this information to be used to affect search quality?
Adam Lasnik: It’s really being used as an adjunct. As you can imagine throughout the years we’ve developed and refined our algorithms that help us to improve search quality and maintain good search quality. But some times we still hear from people: “I am really annoyed with some of the things that I see in your index. Overall it’s great, but I see this piece of spam in the results, and its just one piece, but it drives me nuts”. We can’t catch everything with our algorithms, although we continue to improve them. And, so we wanted to extend the way they were able to get additional inputs from people.
We’ve had our spam report for a while, and we also brought a new version internally into our webmaster tools to allow it to be authenticated. But, Matt’s recent post along with Stephanie’s was definitely driven in part by the comments we’ve gotten from the recent conferences we have attended. But, as for your question, we use the data we get from these reports to both validate and extend our algorithms and help us better understand how to tune these algorithms. So, we look at this in a way to find things that we might have missed. We look at the reports, and say “this is some substantive spam, how did we miss that and how can we do a better job of catching that during the next go around?”
What typically happens is that we go through those reports on periodic basis, and engineers from different areas of search quality will sit down and try and figure out what they mean. One of the most important points to make of this is that this is used to augment our data, and it’s not used to specifically take actions against sites. This can be both reassuring and frustrating for people. It’s frustrating in that sometimes people will let us know about actual spam sites or pages that are violating our guidelines, and they will then wonder why the next day or the next week they still see it is still there. Again, that’s because this is used for periodic testing and review. But, on the reassuring side, we’ve had some people voice concerns “what happens if my competitor decides to report my pages, even though they are completely following the webmaster guidelines. Is some action going to be taken?” The answer to that is definitively not, because this does not lead directly into an action pipeline, but rather it is used only for review. It is not going to directly or adversely affect any pages or sites that are following the webmaster guidelines.
Eric Enge: It seems to me that one of the more challenging aspects of all of this is that people have gotten really good at buying a link that show no indication that they are purchased.
Adam Lasnik: Yes and no, actually. One of the things I think Matt has commented about in his blog; it’s what we joking refer to as famous last words, which is “well, I have come up with a way to buy links that is completely undetectable”.
As people have pointed out, Google buys advertising, and a lot of other great sites engage in both the buying and selling of advertising. There is no problem with that whatsoever. The problem is that we’ve seen quite a bit of buying and selling for the very clear purpose of transferring PageRank. Some times we see people out there saying “hey, I’ve got a PR8 site” and, this will give you some great Google boost, and I am selling it for just three hundred a month. Well, that’s blunt, and that’s clearly in violation of the “do not engage in linking schemes that are not permitted within the webmaster guidelines”.
Two, taking a step back, our goal is not to catch one hundred percent of paid links. It’s to try to address the egregious behavior of buying and selling the links that focus on the passing of PageRank. That type of behavior is a lot more readily identifiable then I think people give us credit for.
Eric Enge: What about the whole Nofollow business that people are upset about? Is Nofollow something that you still want people to use, and in what situations?
Eric Enge: If someone buys a link on a highly relevant site and it’s arguably for purposes of getting traffic; is that necessarily a bad thing?
Adam Lasnik: That’s one of those things where typically you know it when you see it. As I mentioned, our interest isn’t in finding and taking care of a hundred percent of links that may or may not pass PageRank. But, as you point out relevance is definitely important and useful, and if you previously bought or sold a link without Nofollow, this is not the end of the world. We are looking for larger and more significant patterns.
I know that Matt is going to be posting a follow up note in his blog. And, I wouldn’t be surprised if you also see further clarifications at conferences, or perhaps on the Official Google Webmaster Central blog. As long as there are questions and concerns about these types of issues, we will continue to talk about them and clarify the issues.
Eric Enge: Great. I am going to shift into the second major topic for today, that of duplicate content. Can you just start us with an overview of how Google addresses duplicate content?
Adam Lasnik: The most important thing to realize about duplicate content is that duplicate content in and of itself is not a horrible thing. We realize that there are many cases in which content is duplicated inadvertently, or for other valid reasons. In the vast majority of cases duplicate content is something that is done innocently. It is actually a rare occurrence that we see people engaging in practices that create duplicate content for the express purpose of manipulating their rankings. The real problem with duplicate content, is that it creates the conundrum “which page do we show in our index”. And, unfortunately sometimes that ends up not being the page that the webmaster wants us to show. One simple example is that many different sites on the web have a regular article page, and also a printer-friendly page. If they don’t robots.txt out that printer page, there is significant chance that people will link to it, and we may end up deciding that page is the more relevant page to return for a particular query.
With duplicate content webmasters have great opportunities to help us know which page is the version they want us to show. There are other situations in which duplicate content is occasionally problematic. For instance, some CMS software, or blog software, will create many different versions of the same content. Maybe there is a weekly version, a weekly archive, and a monthly archive. And, for many bloggers, the weekly could be equivalent to the monthly, and so we will look at that, and we won’t know which one to show. Along these lines though, because this tends to be inadvertent, we are not looking to penalize in these situations. Sometimes we also see duplicate content across domains, and in some cases this is egregiously done, i.e. someone will have the same content on thirty-seven different domains, and that obviously can look a little fishy.
But, more commonly we will simply see the UK version having the same text as the dotcom version. Now, we will endeavor to show the UK page to people that are browsing from UK, and the US page to people that are browsing in and near locations in the United States. There is another example of duplicate content in the way in which we try to deal with it. So, my core suggestion to webmasters would be to use Noindex and robots.txt to help us know what pages you’d prefer to not have indexed. And, beyond that, frankly I wouldn’t worry too much, although I am betting you may have some scenarios in mind that you want to ask me about.
Eric Enge: It seems to me that one of the costs to the duplicate content — and you can tell me if I am wrong — is that of a site’s crawl budget. For example, if the Googlebot comes to a site, it may have decided to crawl a thousand pages on the site. The bot has a thousand pages in mind, and if a hundred of those are duplicate, then essentially that crawl budget is being wasted on pages that will never rank because they are duplicate.
Adam Lasnik: That’s actually a very good point. I haven’t seen that very frequently, but I have seen that type of situation at times. Often times that’s a specific issue faced by dynamic websites, where the Googlebot gets caught in a loop, or we will get caught as you’ve suggested, looking at a whole lot of pages that are either near-identical or identical. And, you are right that we do have a particular limit per site on how much we will crawl them each time.
We do that largely because we know that webmasters have limited bandwidth resources. So, this can be particularly a concern for those webmasters who may have use Google Webmaster Central to tell us to crawl them more slowly, in which case after I see a certain number of pages we will have to move on. What I think is a larger problem is the area in which the content on one webmaster’s site may be identical to that on many other sites.
These are sometimes what we call thin affiliates. We also see this in the context of syndicated articles and, we hear from the original author of the article. The author then says “hey, I wrote this and every site that lists this article is showing up except for mine”. That’s another area of duplicate content that can be of concern, but often times can also be effectively addressed.
Eric Enge: Yes, and I am going to get into that more in a minute, I have one more question on hidden cost of duplicate content. It also seems to me that if you have a thousand pages site, and there are hundred which are duplicate, that you are also wasting some of your PageRank on those duplicate pages that once again will never rank.
Adam Lasnik: Well, I think there might be some cases in which there could be some consequences in the context of PageRank. But, I would say that that would be comparatively a very minimal concern. I do know that webmasters have often times spent quite a bit of time thinking about where PageRank goes. How it flows within their site and from other sites to theirs. In general, that is not something that I would suggest the webmaster should be particularly concerned about. The first issue you mentioned I think is a more important and more significant issue that of a limited crawl bandwidth per domain. And also, the issue I mentioned before of not showing the optimal page. But, PageRank flow should be a relatively minimal concern in this context.
Eric Enge: Is there a duplicate content scenario which could lead to larger penalties, such as the one you referred to before, where someone has duplicated things from sites all over the web? If it’s at an egregious level, can you actually get a much larger penalty?
Adam Lasnik: That’s a good question. I will reiterate my initial comment, that in the context of duplicate content, penalties tend to be relatively rare. In the majority of cases it is innocent and unintentional. But, in cases where it’s very extreme, there can be penalties applied. It comes back to what I mentioned earlier which is whether or not the duplicate content passes the smell test. It’s very much related to issues of quality as well. If the degree of content duplication is such that it impairs the user’s experience, it can indicate a site that is generally of low quality. In that case that site probably isn’t going to do very well in our index. Looking at that particular case, the duplicate content is more of a symptom than a cause of any low ranking. It is a symptom of low quality, or it’s a symptom of overall poor user experience.
Eric Enge: Another issue relating to duplicate content is that RSS feeds get indexed by default. I understand that there is a mechanism now in RSS feeds to Noindex your feed.
Adam Lasnik: That’s interesting. I don’t think that I am familiar with that; that’s not to say that it’s not there or it’s not a standard, but it’s not something that I have explored.
Eric Enge: I have heard actually that Google supports it, but perhaps that is not true. Regardless, the goal is to reduce unintentional duplicate content much like with the printer page example you gave before.
Adam Lasnik: I will be happy to check offline with some of my colleagues to see if they are familiar with the Noindex directive within RSS. But an equally efficient way of going about that would be to put your RSS feed within a directory that is itself not crawled because of robots.txt.
Eric Enge: So, if you make the access to the feed through a directory which robots.txt indicated should not be crawled, as you suggested, you might be set.
Adam Lasnik: Yes. It’s my understanding that this would also work for Google blog search as well. But, let me do some checking on that, and also I know that in my time at Google, I have not seen this to be an issue, where RSS content has in a negative way affected sites in the area of duplication, largely because it is rare that we actually list RSS feeds within our core index. I am fairly sure this wouldn’t really rank as a significant concern with regards to duplicate content.
Eric Enge: Okay, good. But, if you have a good site, and you discover that in your implementation that nearly every page is unintentionally duplicated, and, maybe you have printer pages too, so unintentionally you have created massive duplicate content. That could create poor site quality signals.
Adam Lasnik: The first question you would ask is whether it significantly impact the user experience. In most cases I would say, it probably does not. And so, from there it wouldn’t be in our interest to devalue or to trust that site less.
Now, it would probably make sense for the webmasters in these situations upon discovering this type of mass duplication to engage in two things: One, to use robots.txt to prevent crawling of the pages they really don’t want indexed, and two, if they are seeing a lot of identical or practically identical pages, but there really is only one canonical page thing, this is a great time for 301 Redirects.
Eric Enge: Yes indeed. Okay. So, I am going to shift into the third phase unless you have a last thing to say about duplicate content.
Adam Lasnik: Just one quick comment on that, make sure that you check out the blog post I wrote, called deftly dealing with duplicate content.
Eric Enge: Let’s talk about some other things that I think people are confused about. Let’s talk about a scenario with a webpage that is built in some dynamic fashion and maybe there are two thousand lines of code, for what ultimately is a relatively simple page. And, the unique content to that page, the text and links and things that are really specific to that page are buried seventy percent or eighty percent of the way down in the file. Can this kind of thing hurt your rankings?
Adam Lasnik: I have both bad news and good news in this area. The bad news is that every time you create a page that is this crufty, someone up there kills a kitten. There are so many great reasons to decruft your web pages. They will load faster, and in some cases a lot faster. It will likely improve your users’experience. And, as you can imagine, when the user’s experience improves, it is not inconceivable that you will get more links to your site. With that said, we do not take that into account in our own indexing and ranking, unless it is so incredibly challenging for the Googlebot to follow. And, by that I actually mean really bad HTML with lots of accidentally open tags. But, I don’t think that’s what you are getting at.
But, here is the core problem why we cannot use this in our scoring algorithms currently: There are a ton of very high quality sites, pages and sites from universities, from research institutions, from very well respected ecommerce stores, of which I won’t name any, that have really crufty sites, and sites that won’t validate. On some of these you can view the source and cry. And, because this is quality content, we really can’t use that as an effective signal in search quality. So, you can quote me a saying, I would be thrilled, it would make my day if people would decruft their sites, but it’s not going to directly affect their Google ranking.
Adam Lasnik: Again, I want to make sure I use the phrase no direct affect, because you make your users happier; you are going to have more return visits, you are probably going to get more links, even if your site loads ten percent faster, and three percent of your users are happier. If that results in three percent more quality links, or three percent more visits, that’s great. It’s worthwhile enough to the user experience because more sites will link to you, and that will improve your Google ranking and indexing.
Eric Enge: What about sites that have wild shifts in traffic, where they pop in and out of the index?
Adam Lasnik: As algorithms get adjusted, most sites are not affected. There is not a huge shift for most sites, but then some sites happen to be right on that border. I sympathize and I understand that it can be painful for those sites that are on a particular threshold on one or more algorithms.
Eric Enge: Fair enough. So, this popping in and out then doesn’t relate specifically to the one type of problem, it’s anything that would put your site on the edge of some Google quality signal factor, and as the algorithm gets tweaked on a regular basis, sometime you are in, and sometime you are out.
Adam Lasnik: Yes
Eric Enge: Well, great. I very much appreciate it, it was a good conversation.
Adam Lasnik: Great, thanks!