Vanessa Fox talks about Webmaster Tools and Duplicate Content
Published: July 2, 2007
In what may have been her last interview before leaving Google, I sat down with Vanessa Fox, and we talked about Google Webmaster Tools, and we also spent some time digging into duplicate content.
When at Google, Vanessa Fox created and developed Google Webmaster Central, which provides tools and information to help site owners understand their sites in the context of the Google index and with search engine optimization. She also managed the Google webmaster central blog and the Google webmaster discussion forums. She will soon be heading up ad and email-related efforts at Zillow. She regularly participates in the conversation about search in the blogosphere and forums on the web. She speaks at search-related conferences throughout the year and is an avid Buffy The Vampire Slayer fan. You can find her at her blog, vanessafoxnude.com.
Eric Enge: Google Webmaster Tools is a great way to go get information about your site. We recommend it to people all the time, because it's got great diagnostic information in it. Just knowing what the crawler thinks about your site is smart thing to go do. So, what is Google's goal with Webmaster Tools?
Vanessa Fox: We have a few goals with it. The primary thing is to have a good partnership with the webmaster community, and to provide as much information as we can. And, the thing about having a partnership like this is that we can really help give site owners information that will help make their sites better. And, that will in turn ultimately improve our search results, and improve the web overall. We think it's a win-win situation for everyone. So, I mean there are a lot of good reasons for doing it. For one, getting more input makes us smarter and our index smarter as well.
Eric Enge: Do you have any examples of how people are using the Webmaster Tools to improve their website?
Vanessa Fox: We hear stuff all the time, but there are two things that we hear a lot. One is the crawler diagnostic information that we offer has helped a lot of people pinpoint issues with their sites. There are a lot of times when pages aren't indexed and people jump to the wrong conclusion as to why. They think that there must be something wrong with their site, or it may be the penalty or something.
Often it's that there were some crawl related problems, where the Googlebot was unable to access the page. Being able to see that type of data in Webmaster Tools really helps people. One instance where it's really helpful is when people move servers, move sites, or change the structure of their site. They can see when things get broken, and work to quickly get those fixed. At SMX Advanced this past week, Danny Sullivan mentioned that it had helped him resolve some issues with the Search Engine Watch site (when he was still there) when that site changed servers. It's cool to hear of things like that.
Another thing that I hear people use a lot are the search queries - especially the report we offer of the queries that didn't lead searchers to click through to the site. Those are the things that you can't really see in your server logs. It's interesting to see the searches that your site may come up for, that you didn't really know about. I've heard that a lot of people optimize on that. They take a look and they go "oh, I had no idea that my site was being returned for these queries maybe I should beef up the content a little bit to make it more attractive for these queries, and I may get more visitors that way".
Eric Enge: That is cool. One of the things I really like is I always look at the page not found reports, because I can find cases where people have linked to something on my site, but messed up the URL somehow. Once I see that data, I take that broken inbound link and I 301 it to the correct page.
Vanessa Fox: Yes. This is an excellent thing for people to do, because it not only helps the Googlebot, but also people who are coming across these broken pages by following the links from the other sites, when without this they would just get to a 404 page. That's really a great thing for people to implement on their sites.
Eric Enge: I really love the link report. I think it is just fabulous that you can not only get your links, but I can get a spreadsheet and I can download it.
It allows you to do a lot of very interesting things, because now you are working in Excel or whatever you use and manipulate the data and see some interesting patterns. But, I think you mentioned to me that you had some plans to expand the functionality. Can you let us know a little bit about that?
Vanessa Fox: We've gotten a lot of good feedback from people on the things that they would like related to that. We are not sure exactly what we are going to be implementing yet, but some of the things that people have asked for is the ability to just see a list of domains linking to your site, rather than all the web pages.
Eric Enge: That makes a lot of sense. When I work with clients I focus on the number of domains linking to you rather than the number of web pages.
Vanessa Fox: Right, so being able to show some other report like that, and providing you the ability to drill down more if you wanted to do that. People have also been asking for more anchor text information. Now we are showing you the top two hundred anchors to your site which people find really helpful. But, they would like to see for each link the text that was used in the anchor. We are looking at that to see if there is anything that we could do there.
People would like to see the day we first found the link, and I believe right now we are showing you the last time we saw the link. Some people also have asked for a little indicator to show which of the links were no followed, because right now we just show you all the links in a list, and we don't really specify which ones were no followed on the site.
Also, we could potentially show you other links going out from your site, and which ones are broken.
Eric Enge: An outbound report would definitely be of interest.
Vanessa Fox: Yes, because I think a lot of times sites grow over time, and you forget where everything is. One other thing that people have asked for is the ability to disclaim inbound links. They just want to be able to click a button that says I don't really want this link. I don't know that there is a need for that necessarily, because I wouldn't worry too much about sites that link into your site, but people do worry about that anyway.
Eric Enge: Let me give you an example of where this could be useful. I've seen sites where they have purchased links from news sites, such as the Washington Times. Then over time they learn it's a bad thing to do, and they stop buying the links. But, the problem is that some of these magazines archive these pages forever. So, you can have a situation where you have been flagged for doing something you shouldn't do, and you are trying to come out from under the rock. That is a situation where disclaiming a link would be very useful.
Vanessa Fox: That might be a situation where that would be helpful for sure. I suppose right now you could just put in a reconsideration request, but it would be a lot easier just to be able to have a checkbox to disclaim the link.
So there is a long list of things that people have asked for, and whenever we hear this stuff we always take the feedback back, and see what is it that we could do and prioritize it against the other things that we are planning. We are always listening to all this stuff. Those are just a few things that we have heard about.
Eric Enge: Another thing that you added recently was the preferred domain feature. Do you think that everybody should use this, even if they do Accurate Canonical redirects?
Vanessa Fox: I would say that if you are doing 301 redirects with an .htaccess file or something like that, that's probably the best bet. We launched this feature more for people who didn't have access or maybe didn't quite know how to implement something like that. If you do it as a 301 redirect, it's going to be helpful for all search engines and not just for Google. So, you don't actually need to do use the preferred domain feature if you do the redirects properly, but it's not going to hurt you.
Eric Enge: Are you planning any updates to that feature potentially?
Vanessa Fox: Right now it's available just for the non www to www redirect (or vice versa), but it would be nice if we could expand it out in a few ways. Sometimes people have aliases, so they have multiple versions of a domain that point into one version. Another thing that would be potentially helpful is when you move from one domain to another. And so, there may be something that we could do if you were able to verify ownership of both of the domains that you might be able to point one to another to help with migrations and that type of the thing.
We are looking into the details, but obviously, anytime you do anything that has to do with crawling you have to be careful, and make sure that you do it in such a way that it's going to makes things better and not somehow cause any indexing issues.
Eric Enge: Right. Another thing I really like about Webmaster Tools is the query stats feature. Can you talk a little bit about that?
Vanessa Fox: That's a great feature. There is the top searches report that shows the search terms entered by people that have clicked through to your site, and then it also shows the searches that your site shows up for most often, but that didn't cause a click to your site. That is done with straight volume, so it's searches that have been done the most, compared to all searches. A lot of times you might see a search that brings up your site and you realize that's not the thing my site is most about -- my site is most about these other things.
[Editors Note: Here's a blog post as to why that might be.]
Sometimes you can see search terms where you are ranking highly, and you are not getting any click throughs, and tweak your site, for example, how your site's title and description shows up in the search results, to improve your results. But, there maybe other things that we can do, like one thing that would be neat is to show the position over time. Let's say your site was number 8 one week, and number 5 another week. It might be nice to see how the trends would go over the time, and so that's something that I think would be nifty.
Eric Enge: That would give you an idea as to how your overall optimization strategy is working in a visual timeline fashion. What about other changes, are there other things that you see happening in Webmaster Tools in the near future?
Vanessa Fox: Well, we just did a launch this afternoon (June 11th). It is just a minor thing, but you may have seen Matt Cutts blog a couple of weeks ago, where he was talking about how people have been asking us for a paid links form. He said that for now, you can go ahead and use the spam reporting form, and this is the information that we would want to get. But today we launched a paid links form specifically for reporting paid links. It just makes it a little bit easier, and so I did a blog post that explains how to use it.
Note that there is nothing wrong with buying links for advertising and for traffic. It's really only an issue if you are buying links to deceive search engines. That's what this form is for specifically. The other thing that we did, which isn't really a new feature, is change the name of the re-inclusion request form to reconsideration request, to make it a little broader. Since you are not always requesting re-inclusions since the site may or may not actually be completely out of the index. So, it's just a minor thing that makes sure that we have the broadest view of that possible.
Eric Enge: Right. So, the notion here is that there are penalties other than full banishment from the index that could potentially benefit from a manual review by Google, if you rectified the thing that caused the penalty to take place.
Vanessa Fox: That form is really for situations where you have violated the guidelines in some way, and you've fixed it. Then you can use those forms to have someone take a look at it as opposed to just waiting over time for things to naturally pickup again. You can say hey, I know that there is this issue and I have made the fix, so please do our review. That's exactly what that form is for, and the other thing you can use that form for is if you picked up a domain that was expired and you don't know the history of the domain. You think that maybe there was an issue before, but now you have looked it over now that you own it and it seems okay. You can use the form for that too to say "hey, I acquired this, it's been around for a while I don't know what happened with it before, but it's not been indexed".
Eric Enge: Please don't hold its past against me.
Vanessa Fox: Yes. Those are the two things that the form is used for. A lot of times people will use it just because they are having ranking issues or they want more pages indexed, but it's really specifically for guideline violations.
Eric Enge: Yes, and guidelines violations can range from suppressed rankings to outright punishments. Is that correct?
Vanessa Fox: Yes.
Eric Enge: One of the things that came through loud and clear at the SMX duplicate content session was that people would really like to get a report of what pages on their site that the crawler sees as duplicate content. What's your thinking on that?
Vanessa Fox: I think it's a great idea. I was really hoping to get input from the audience on the types of things that would help, so it was great to get that feedback] It would be nice to show where are the pages within your own site, that either we think are identical. For example, where many URLs seem to point to the same page, with even some way for you to say "this is the canonical version that I want you to index". It would also be nice to say "here are some pages within your site that we think are mostly the same, and that we are not probably going to return both of them in a search result because we think that they are mostly the same." And, it would also be nice to say "hey, here are some pages on your site that look to be exactly the same as pages of this other site."
Eric Enge: Right. And then, also show the page it appears to be a duplicate of.
Vanessa Fox: Exactly. I am also working on a blog post about the duplicate content summit. I am trying to get some more feedback on the suggestions that we got from the audience there, from a larger audience. What are the kinds of things that you would like to have in a duplicate content report? What would you want to be able to do with it? Obviously, we would want to make it actionable as much as we can.
Eric Enge: You are going to use this blog post you are putting out tomorrow (June 13) to recruit feedback from the Webmaster community. Is that correct?
Vanessa Fox: Yes. The more we get feedback, the better. You don't have to guess what webmasters want, you can just ask them, and they are pretty vocal about it.
Eric Enge: We will see if we can help get more feedback for you.
Vanessa Fox: That would be great, the more the better.
Eric Enge: Some kinds of duplicate content must be really easy for you to detect. For example, there are unintentional kinds of dupe content, like blog pages and their various categories, RSS Feeds, Archive Pages, and Print Pages. How does Google do in understanding those cases for the most parts?
Vanessa Fox: We do a pretty good job in general, but with RSS Feeds we are still working on the best way to do that. I think we will continue to improve there.
But, we do a pretty good job over all those types of things. One thing we can do is track what pages get linked to the most. We are assuming that that's the primary version of the page that you are interested in having indexed when of course you can help us out with that if you want by roboting out a Print Page, just to make it really obvious that that's not the version that you want indexed. It may also evolve over time like if you have a blog and you do a post and a lot of people are linking to that post on your homepage, because that's where it is. And so, we index that page, and then maybe a couple of months down the road, when that page is no longer on the homepage and it's somewhere else. We may end up linking to the Permalink version. I mean indexing the Permalink version over time makes the most sense. So we can evolve as the sites evolve, and pick what the best version is over time. Most people have said that they didn't seem to have a problem with their blogs being indexed. I think we mostly do a pretty good job with that.
Eric Enge: Right. I've talked with both Adam Lasnik and with Rick Klau over at FeedBurner, about whether or not webmasters should no-index their RSS Feeds. There is a tag which you can put inside your RSS Feed to specify that it now be indexed.
Vanessa Fox: I talked to Rick a little bit about this too a while back. And I know Matt was also looking into that a little bit more to see. I mean I think we can probably improve a little bit on how we handle RSS Feeds.
Eric Enge: I would think that it would be really easy to just identify something as an RSS Feed, and not index it, because Feeds are syndications of content not original content itself. It should be relatively simple to do.
Vanessa Fox: For the most part it seems that even without doing a No index, if you look at our web search results, we do a pretty good job finding the canonical version of a page.
Eric Enge: Right. Another situation is the case of thin affiliate sites. Sites with ten of thousands of pages and content are which is largely the same except for simple words substitutions to try to rank for different search terms.
That's one type of thin affiliate site, and another type is where they are just replicating content from the master marketing agency that they are being an affiliate too. They are just copying all that content and not writing anything new. I imagine that those are really easy to detect as well.
Vanessa Fox: You are right. Those are really easy to detect.
Eric Enge: Right. Then there are people who are a little bit more sophisticated about it. These people do much more sophisticated stuff, maybe a remapping of the text they got from the marketing agency that they are working with or computer generating content using Markov chains. That would be a little bit tougher to deal with wouldn't it?
Vanessa Fox: Well, it is, but we have pretty sophisticated methods of detecting things. So, it's certainly more of a challenge, but we are up for the challenge. People are always going to be trying new things and we are always going to be trying new things and that will keep happening.
Eric Enge: It's an arms race.
Vanessa Fox: We do a pretty good job of it, and we are always evolving and we've got a lot of smart people here who are working on these problems, so I think we do pretty well.
Eric Enge: But of course, the issue at the end of the day is not really whether the content is computer generated, it's whether or not there is end user value, unique end user value in the page.
Vanessa Fox: Well that's ultimately what the goal is with all these things that we do, is to have the best search results possible. When someone does a search, we want to give them the most relevant and useful results across the whole web. Everything that we do is with that aim in mind.
Eric Enge: The normal response to duplicate content is that you pick the best one whether it is across multiple sites, third-party sites or just on one of your own sites, you pick the best one, but are there egregious situations where duplicate content will result in a penalty?
Vanessa Fox: Well, you need to look at the intent of the site, if the intent of the site is to be deceptive and manipulative then certainly we always reserve the right to take action on those types of things. A large percentage of the times these kinds of duplicate issues aren't the result of deceptive behavior and so there is not going to be any kind penalty involved. It's just a very small percentage of the time that someone is really out to manipulate the search engines that you are going to see any action.
Eric Enge: There is this notion of intent and, the extent. What's the intention of the person doing it, and how extensive it is?
Vanessa Fox: We deal with such a high volume all the time, and we do a pretty good job of understanding the intent. What we are trying to identify is manipulative behavior. I really think most people shouldn't worry about this. I think the people who should worry already know that they should worry because they know that what they are doing.
Eric Enge: Right, indeed it's a case of, when you think it, you know it.
I've also encountered some situations with people who have a dozen sites or so, and they have encountered a penalty on all of their sites. They are working really hard on rectifying the problems. Is there a situation where, let's say they everything done needed to fix one of the sites in their group of sites, but there is a cumulative network effect that says that the issue has to be successfully addressed across all or most of the sites before a single site could recover?
Vanessa Fox: Sites are reviewed on their own merits, and so I would think in that situation if there is like five sites or ten sites and one site now has cleaned up everything that the site would be evaluated on its own. Of course, we have got to see if there is still anything else going on, or if it's a duplicate issue. I guess one question would be how are the other sites reflecting on that site. But if they are not at all, then it would be an independent evaluation, but if you had an article still on the other sites that's also on this cleaned up site, we're still probably only going to return one version of that article but it will be from that site or one of the other sites. But yes, each site is evaluated on its own merits.
Eric Enge: So when you have this situation with a group of sites, when does it become appropriate to put in a reconsideration request?
Vanessa Fox: I would look at each site individually, and if this one site can distinguish itself from the others and it doesn't violate the guidelines, then you should file the request for that one site, and you should be alright if everything is cleaned up on it. I would, of course, say try to clean up the other sites, but if that one site is cleaned up you should be able to file a request for that one and then have it evaluated on its own merits. I mean if you have some link exchanges going on between that site and the other sites, then you may have some connection, but if that site is standing on its own then yes it should be fine to file a request search for that.
Eric Enge: Well thanks Vanessa!
Vanessa Fox: Thank you too!
Have comments or want to discuss? You can comment on the Vanessa Fox interview here.
Other Google Interviews
- Mark Lucovsky on the Google Feed API
- Google's Adam Lasnik on Webspam topics
- Rajat Mukherjee on Custom Search Engines
- Brett Crosby on Google Analytics
- Mark Lucovsky on the Google AJAX API
- Shashi Seth on Custom Search Engines
About the Author
Eric Enge is the Founder and President of Stone Temple Consulting (STC). STC offers Internet marketing optimization services, including SEO, Social Media and PPC optimization, and its web site can be found at: http://www.stonetemple.com.