John Mueller is currently a Webmaster Trends Analyst at Google Z’rich. Prior to working at Google he became well known for his active participation in Google Groups and a variety of SEO forums.
Eric Enge: Can you provide me with your definition of cloaking?
John Mueller: The standard definition of cloaking is to show Googlebot something different than you would show your users. So, in a worst case situation, you would show Googlebot a nice family-friendly homepage, and when a user comes to visit that page, they would see something completely different.
Eric Enge: Like porn or casino ads or something of that nature?
John Mueller: Exactly. So if the user was searching for something and finds what he thinks is a good result, he clicks on it, and then there is nothing even related to what he was searching for on that page.
Eric Enge: Right. So that’s clearly an extreme form of cloaking. There are many different levels of cloaking, and I’d like to explore some of those.
Some people, for example, may have a content management system that just insists on appending session IDs or superfluous parameters on the URLs. They may not be superfluous from the CMS’ point of view because they are using the parameters to pull information from a database or something like that. And given the content management systems that they have, it’s actually very difficult and very expensive to fix this problem at its core. So one solution would be to serve the same content to users and to Googlebot, but to modify the URL seen by Googlebot to remove the superfluous parameters and the session IDs.
John Mueller: That’s something that we’ve seen a lot of in the past. We currently have a great new tool that can really help fix that problem without doing any redirects or without really changing much at all, and that’s the rel=”canonical link element. You can place it in the header of your pages and specify the canonical URL that you would like to have indexed. So you could take away all the session ID parameters or anything else that you don’t need, and just specify the one URL that you want to have indexed.
Eric Enge: Right. And that’s something that you announced with the other search engines just a few weeks ago, correct?
John Mueller: Yes, it’s fairly new. It’s something that not a lot of people have already implemented, and there are a lot of people who are already using it to clean up this problem. Crawling a website and finding many duplicate versions of the same content with different URL parameters such as session IDs can confuse search engines. Using this link-element helps to make it a bit clearer and can help to resolve this problem.
Eric Enge: So you basically implement the canonical tag on various pages and you tell people what the canonical URL is. If, for example, somebody has different sort orders for their products in the e-commerce catalogue (e.g. by price, brand, size, color, …), you can basically point Googlebot back to the canonical version of the URL, it’s supposed to behave much the same way the 301 redirect would, except for it does not actually take the user to the different URL specified? Is that a fair summary?
John Mueller: Yes. It’s not a command that you would give a Googlebot, it’s more like a hint that you would give us. One thing we’ve also seen is that people try to use it, but they use it incorrectly. For instance, they specify their homepage as a canonical for the whole site. And if we were to follow that as a 301 redirect, we might completely remove their website. So we have to take that information and determine if it is really a canonical for the other URL, or if the user may be doing something incorrect.
Eric Enge: And of course one way you could do that is by making sure the content on the two pages is identical.
John Mueller: Yes.
Eric Enge: So if you make a mistake and use canonical tag to send everyone to the home page of your site, presumably the content will differ from the other pages. And, as I understand it, the gold standard solution is to fix the problem at its core and not have to rely on the canonical tag.
John Mueller: If you can move to the cookie-based session tracking, then that would really help. But we know it’s not always easy to change to a system like that. There might be a lot of money involved. So at least with this system there is fairly simple way to fix that problem.
Eric Enge: Right. So it’s the backup plan that should be used if you can’t fix it at its core or if it’s just too expensive to fix it at its core?
John Mueller: Exactly.
Eric Enge: Yes, that makes sense. Now I imagine there are also people out there who served a different URL to Googlebot and its users before the canonical tag existed. Is that problematic?
John Mueller: I would suggest doing that for all new users who come to the site without cookies, instead of just for Googlebot. This way, if a user accesses an old URL that has a session ID, you can just redirect him to the proper canonical. That would treat users and search engines in the same way, and it would still help solve this problem.
Sites that are currently showing prettier URLs to Googlebot should not panic, as long as their intent is genuine and it is properly implemented. But I’d advise against this for sites that are in the process of a redesign or sites that are being newly created. Using rel=”canonical” is the current best practice for tackling this problem.
Eric Enge: But if the system is relying on the session IDs, then it’s there for a reason, right?
John Mueller: Yes, but usually most CMSs resort to session IDs if they can’t access a cookie. So if you see that a user doesn’t have a cookie, you can redirect them away from the session ID. And I think the important thing here is that you find a way that you can treat users and search engines the same.
John Mueller: Exactly. So, the clue here is that the intent matters, as is generally the case with Google. If the intent is really that the webmaster wants to test the various versions of the same content, then that’s no problem. And if the intent is there to show the user something completely different, then that would be on the border. You would have to look at that.
Eric Enge: I mean, you can always take any technique that was created with good intentions and find ways to abuse it. So let’s say somebody is testing out four different versions of a key landing page on their site to see which performs the best for them. Maybe they are changing the logos and moving elements around, they might be changing the messaging a bit to see if one tagline is more effective than another, or they may be changing the call to action.
John Mueller: If you are doing that with good intent to find the best solution for your users, and you are showing more or less the same content, then I wouldn’t really worry about that.
Eric Enge: Say you have a graphic of some sort, an image file on your site that might be a menu link or a logo. And there are various techniques for showing the search engine’s robot or any specific user agent’s text instead of the graphic. What are your general thoughts in that area?
Eric Enge: So, there are various grades of this, correct? One level is where the text matches up a hundred percent with what is in the image. And there is a notion of substantially similar, and then you could actually several more grades and have somewhat similar, and then completely different. And, I think you just highlighted an example that’s completely different. So, an identical is an easy case, I think you already addressed that. What if something is substantially similar, but is not word-for-word identical?
John Mueller: I would say it depends on the case, but if you are not trying to deceive the search engine crawler or the user, then it’s generally okay, but in general I would be cautious as soon as the content is not identical.. So if you have a link that goes to your homepage and it has a graphic of a house, then you wouldn’t have to use house as an all-text. You could just say “go to homepage,” or something like that, and it’s fine.
Eric Enge: So again it gets back to the notion of intent that you’ve already raised?
John Mueller: Exactly.
Eric Enge: And, of course, one flavor of this is sIFR, which stands for Scalable Inman Flash Replacement. sIFR uses text input to render what is shown in Flash so it is guaranteed to be identical.
John Mueller: Exactly. Where we start to see problems is when a website has a completely Flash-based interface and a lot of different pages all on the same URL hidden behind it. Then it would be hard to include ten pages of HTML on a single page that match exactly what is written in the Flash file. So you have to find a solution for yourself there; how much really makes sense and how much you might have to cut back and just leave the basics in HTML and keep the bulk of your content in Flash.
Eric Enge: Right. And of course when you get to that scale, you are past what you do with sIFR, which is really intended for putting anti-aliased fonts on your page, which is a more limited technology. But I think once you get into the more complex situations, you can use SWFObject, correct?
John Mueller: Yes, it would be something like that.
Eric Enge: That technology doesn’t guarantee that the alternate version shown in text is identical to what is in Flash.
John Mueller: Exactly.
Eric Enge: So it is open for potential abuse, but I would imagine that the policy again gets back to what you actually do and what your intent is in doing it.
John Mueller: Yes. And there are two other things that also play a role in that. The first factor is that we have started crawling and indexing Flash files. If you have a lot of content in your Flash file, we will try to at least get to that and include it in our search results.
The second is that there are still a lot of devices out there that can’t use Flash. So if you have a website that relies on Flash and you suddenly notice that there are a bunch of mobile users who are trying to use their iPod, iPhone or Android Phone to access your website, then you would start seeing problems because they wouldn’t see the Flash content at all. , And if the HTML content doesn’t match up with what you are trying to bring across to the user, they will simply leave the site.
Eric Enge: One grade of this problem occurs when you try to implement something in Flash, but you are not going to be doing it with the intent of rendering the same thing that you can easily render in HTML. You are probably using it because you want to create a highly graphical type experience. It is not always the case of course, but certainly one of the things that’s appealing about Flash is that you can create a really attractive visual experience. Say you have a man driving a fast car on the German autobahn, the Flash isn’t going to narrate the course of the drive.
But in your text rendering of what is in the Flash, you would want to describe what is happening. For example, “it’s a nice day and a man gets into his expensive car and heads out onto the Autobahn”. So you are actually implementing text that isn’t in Flash, but the content essentially is.
John Mueller: Yes, that’s generally fine. If the intent is okay and it matches up so you can see that there is a car and a man driving on the autobahn, then that would be fine.
John Mueller: Yes. If you can think about it from a user-experience point of view; if the user sees the HTML content in the search results and clicks on that page, does that match up what he would be expecting?
Eric Enge: So what about serving different content based on an IP address to address things like language and national or even regional issues? Just to think of a regional issue, the products that your customer base in Florida buys could be quite different than the products your customer base in Minnesota buys. So you want to serve up the Florida user one set of offerings and the Minnesota user a different set of offerings.
John Mueller: That is something that I see a lot as a European user, because in Switzerland we have four different official languages, and as soon as you start using a web site, it automatically tries to pick a language that they think is right. They are wrong most of the time, and it is something that really bothers me a lot. So I guess I might be a little bit emotional about that.
One thing that I have noticed that you have differentiate between whether or not your content is really limited to a specific language or geographic location. For example, you have a casino website that you can show to users in Germany and in France, but you can’t show it to users in the US. That’s kind of an extreme situation, but in a situation like that you would still have to treat Googlebot like any other user that would come from that same location.
So if we crawl your website from the US, and your website recognizes us as an American visitor, then you should show us exactly the content that an American visitor would see. And it would be a little bit problematic if the website started blocking all American users because of legal reasons. So what you would do then is make a public website that everyone can access and then just link to your private website that has been limited to users in a specific region.
So, for example, you would have a general homepage that tells what your website does, gives some information and provides something that search engines can crawl and index. Then when users get to the right location they can click through to your actual content.
Eric Enge: So are you suggesting that if a user accesses that website from Germany, they come to some initial page and then they have to click further to get through to page they are actually looking for?
John Mueller: Exactly.
Eric Enge: So it is not acceptable to just simply serve them?
John Mueller: Yes, that might cause problems when Googlebot visits. The other problem there is that IP location and language detection is often incorrect. Even at Google, we run into situations where we think, an IP address is from Germany so we would show German content. But in reality, the user maybe based in France, and it is really hard to get that right. So if you try to do that automatically for the user, you are almost guaranteed to do something wrong at some point.
That leads to leads to the other version of this problem, where users in the wrong location can still access your website. And in a case like that, we would be able to crawl and index the website normally, but I recommend that you include elements on your website that help the user find the version of the website that they really want to use.
The important thing there is that you use different URLs for the different locations or different languages so that we would be able to crawl all of the specific content. So when I go to Amazon.com from Germany, for example, I have a little banner on top that says “Hey, don’t you want to go Amazon Germany? We are much closer; we have free shipping.” And that way, the search engine would still be able to see all the content, but users would still find their way to the right website.
Eric Enge: So this of course is a little bit different than the scenario where you implement a website at casino.co.de, or .co.uk, or .com, or .co.us, where you really are creating versions that are meant to be indexed in the local version of the search engines?
John Mueller: Exactly, yes.
Eric Enge: So that’s a different scenario that someone could use if they wanted to.
John Mueller: I think the key point is whether or not users are allowed to access the wrong version of the website, or if there is a legal reason why it is blocked completely.
Eric Enge: So if the legal reason isn’t there and it is just that you want the default language that a German user sees, and you are willing to accept the fact that you are right about 90% of the time and you are wrong about 10% of the time, they can click the French link if they are really from France?
John Mueller: Yes. I think that the important part, especially with languages, is that you really provide separate URLs so that Google can crawl all language versions. And this way you also don’t have to do language detection on your site. The user will search for something using a German or French-speaking Google, and we will show the French or German-speaking pages appropriately.
Eric Enge: So they end up in the right place through that mechanism?
John Mueller: Yes. And you don’t even have to do anything on your side. Maybe if you have a homepage you could show a little drop-down and let the user choose. Or you could have it pre-populated with the determined location by default, but you are still giving the user a choice between the different language versions. You give the search engine a choice and we will try to send the users directly to the right version.
Eric Enge: What are your thoughts on serving up different content based on cookies, such as explicit or inferred user preferences.
John Mueller: I think the general idea is also to make sure that you are not trying to do anything deceptive with that. Say, for example, you have a website where you just have general information. If a normal unregistered user comes there and you show that same general information to Googlebot, that is fine, because even a logged in user finds more information when he accesses the same URL. So if you make sure that it matches up with what a user would see, then that’s generally not a problem.
Eric Enge: And since we are talking about cookies, presumably we are talking about a user who has been at the site before. So if they come back, their expectations may be for somewhat of an enhanced experience based on their interactions.
John Mueller: Exactly. So if you have it setup in a way that logged in users or users who have preferences get to see more detailed content, then that’s fine in general. But if you have it in a way that users who were logged in see less content or see completely different content, then that would be problematic.
Eric Enge: Right. Can you give us an overview of First Click Free and what its purpose is?
John Mueller: We started First Click Free for Google News so that publishers could provide a way to bring premium content to their users. For example, if you have a subscription based model for your website, you could still include those articles in the Google News search results and a user who goes to those articles would still be able to see them and read that article normally. But as soon as they are trying to access more on your website, they would see a registration banner, for example.
Now, we have extended that to all websites, because we know not everyone can be accepted into Google News; it is kind of a special community. So if you have some kind of subscription or premium content, you can show that to Googlebot and to users who come in through search results. But as soon as something else is accessed on that site, you are free to show a registration banner so that users who are really interested in this content have a way to signup and actually see it.
Eric Enge: So the idea here is you have subscription-based content and Google wants to make its users aware that that content is there and it exists.
John Mueller: Exactly.
Eric Enge: So the user goes to Google, they see the article, they decide to go read it, the site implementing First Click Free checks the referrer and makes sure it is from Google, in which case they show the full article including all pages of a multi-page article, not just the first page?
John Mueller: Yes.
Eric Enge: And then the user potentially gets the registration banner when they go on to or a subscription box on a different article?
John Mueller: Exactly.
Eric Enge: Now, can a user just go back to Google and search on something and try to find that same article somewhere else in the search results?
John Mueller: Theoretically, yes. That would be possible, but we found that most users don’t do that. It is more work that way, and if it is content they are really interested in, they will figure out a way to access it normally. When you like the content, you might say a subscription and say “Okay, this is a good website. I want to come back and read more of this content. It is fine if I just pay a small amount for it.”
Eric Enge: I would imagine that for most subscription-based sites that it is an effective program to expose their content and increase their subscriptions.
John Mueller: Yes, absolutely.
Eric Enge: Exposure is really good. To do this, you basically bypass the login screen and give it access to all the content that you do want to index when Googlebot comes to the site.
John Mueller: Exactly, yes. I would expect that you could probably do the same for other search engines. You might want to check with them, but I think that is generally acceptable if the user sees the same content as the search engine crawler would see.
One thing that I have noticed when I talk to people about this is that they are kind of unsure how they would actually implement it and if it would really make a difference in their subscription numbers. It is generally fine to run a test and take a thousand articles and make them available for First Click Free, make them available for Googlebot to crawl and make them available for users to click on.
You can leave the rest of your articles blocked completely from Googlebot and from users. Feel free to just run a test and see if it is going to make a difference or not. If you notice it is helping your subscriptions after a month or so, then you can consider adding more and more content to your First Click Free content.
Eric Enge: Right. You can take it in stages. Are there other questions on these topics that you hear from people at conferences or out on the boards?
John Mueller: Another thing about cloaking is that we sometimes run into situations where a website is accidentally cloaking to Googlebot. That happens, for example, with some websites that throw an error when they see a Googlebot user agent. It is something that can happen to Microsoft IIS websites, for example, and that would technically also be cloaking. But in a case like that, you are really shooting yourself on the foot because Googlebot keeps seeing all these errors and it can’t index your content.
The problem here is that Googlebot will only find one language, and we will just crawl the whole website in that one language. So, for example, we have seen cases where Googlebot was accidentally recognized as a German-based user, and we re-crawl the whole website in German and suddenly all the search results were only showing up to German users.
Eric Enge: So people in the UK couldn’t see the UK-English version of the site, because, the Googlebot wasn”t aware the content was there?
John Mueller: Users in the UK would be able to see that content, but since the Googlebot was recognized as a German user, it was seeing the content in German only. In this case, the old pages would be re-indexed in German, so if someone was searching for an English term, they wouldn’t even find that site anymore.
The lesson here is to really make sure you have separate URLs for your content in different languages and locations.
Eric Enge: Right, for purposes of this example, we are assuming that the content is identical but translated.
John Mueller: Exactly.
Eric Enge: And, you want to have separate pages for the different language versions of the content.
John Mueller: Exactly.
Eric Enge: Excellent, thanks John!!
John Mueller: Excellent, thank you!