A Holistic Look at Panda with Vanessa Fox

photo of Vanessa FoxVanessa Fox, called a cyberspace visionary by Seattle Business Monthly, is an expert in understanding customer acquisition from organic search. She shares her perspective on how this impacts marketing and user experience and how all business silos (including developers and marketers) can work together towards greater search visibility at ninebyblue.com. She’s also an entrepreneur-in-residence with Ignition Partners, Contributing Editor at Search Engine Land, and host of the weekly podcast Office Hours. She previously created Google’s Webmaster Central, which provides both tools and community to help website owners improve their sites to gain more customers from search and was instrumental in the sitemaps.org alliance of Google, Yahoo!, and Microsoft Live Search. She was named one of Seattle’s 2008 top 25 innovators and entrepreneurs. Her book, Marketing in the Age of Google, provides a blueprint for incorporating search into organizations of all levels.

Key Interview Points

I really enjoy speaking with Vanessa about search because of her perspective about how to do things. As readers of mine know, I am a fan of the trite old way of doing it – producing a great web site, making it search friendly, and then promoting it well. Vanessa is truly an industry leader in promoting this type of thinking.

This is a great interview for you to read if you want to get a strong feeling for the philosophy that drove the Panda algorithm, and the implications of that philosophy going forward. Here are some of the major elements that I extracted (and paraphrased except in those situations which are quoted) from the discussion we had:

  1. Like any business, Google seeks to maximize its profitability. However, Google believes that this is best done by providing maximum value to end users, as this helps them maintain and grow market share. They make more money this way than trying to squeeze extra CPM out of their web pages at the cost of user experience.
  2. The AdWords team does not have access to the organic search team, and as a result the engineers working on organic search are free to focus on delivering the best quality results possible.
  3. (Vanessa) “Panda isn’t simply an algorithm update. It’s a platform for new ways to understand the web and understand user experience”.
  4. Panda is updated on a periodic basis, as opposed to in real time. This is similar to updates to the PageRank displayed on the Google Toolbar, except it is a whole lot more important!
  5. It is easier to reliably detect social spam than link spam.
  6. (Eric) “If you’ve got twelve different signals and someone games two of them and the other ten don’t agree, that’s a flag.”
  7. Don’t focus on artifical aspects of SEO. If it seems like a hokey reason for a web page to rank higher, it probably isn’t true. If by some chance it is true, first it is most likely a coincidence, and second and more importantly, you can’t count on it staying that way.
  8. (Vanessa) “I suggest you get an objective observer to provide you feedback and determine if there are any blind spots you’re not seeing.”
  9. (Vanessa) “The question then becomes if someone lands on your site and they like that page, but they want to engage with your site further and click around your site, does the experience become degraded or does it continue to be a good experience?”
  10. Added value is key. Search engines are looking more and more for the best possible answer to user’s questions. Even if your article is original, if it covers the exact same points as hundreds of other articles (or even 5 other articles) there is no added value to it.
  11. Reviews can be a great way to improve web page content provided that they are contextually relevant and useful.
  12. Crowd sourced content is also potentially useful, but must also be relevant and valuable.
  13. One of the challenges facing both UGC and Crowd Sourcing is the editorial challenge of making sure it is useful and relevant.
  14. Branding can be very helpful too, as it helps people trust the content more. Search engines recognize this as a differentiator as well.
  15. (Vanessa) “I think social media levels that playing field a bit. In the past, you had to hire a publicist, do press releases, have relationships with reporters, and get on Good Morning America, or something on that order, to get your name recognized.”
  16. SEO is still important! Making sites that are easily understood by search engines is still something you need to do. Effective promotion of your web site remains critical too.
  17. Unfortunately, for many sites that have been hit by Panda, there is no quick fix. There are exceptions, of course, but they will be relatively rare.

Motivations of Google

Eric Enge: Let’s talk about what Panda was from a Google perspective and what they were trying to accomplish rather than the mechanics of what they did.

Vanessa Fox: I like that you addressed it that way because many people simply want to know mechanically what they did.

This update took many people by surprise and, certainly, there are things to be worked out. However, Google has never been secretive about what it’s trying to accomplish and, specifically, what it’s trying to accomplish with Panda.

Ever since Google launched, its primary goal has been to figure out what searchers want and give them that. This encompasses a lot of things. It encompasses answering their question as quickly and as comprehensively as possible. It involves all the things you think about in terms of making the searcher happy and providing a good user experience.

In the early days of the web, the only way Google knew if people found something valuable was if there was a link to it. Today, the web is more sophisticated and Google has much more information available to it. The bottom line is that Google is trying to provide the best results for searchers and, for them, Panda was a major step forward in accomplishing this.

Eric Enge: Yes, some people believe that Google made these changes because it favors their advertisers and their objective is to make more money in the short term. I don’t believe this. To me, the value of market share far outweighs the impact you could get by jacking up your effective CPM by a few percent on your pages.

It is short term and shortsighted to think Google is now focused on improving CPMs or trying to drive people … to advertise via AdWords.

Vanessa Fox: That’s absolutely right. It is short term and shortsighted to think Google is focused on improving CPMs or is trying to drive people, who lost ranking in the organic results, to advertise via AdWords. Google is looking for long term market share which is the best way for them to maximize profitability.

The root of their market share is the fact that they get so many people searching all the time. The best monetary decision for the company is to ensure that searchers experience excellent search results. That’s the core that’s going to help Google maintain their market share which, in turn, is what will help them grow.

Eric Enge: I’ll paraphrase it simply and say they are totally selfish and they are being selfish by working on their market share.

Vanessa Fox: That is exactly right. Many people don’t believe that there is a wall between the organic search people and everything else at Google. If they didn’t have such a wall you would have a situation where someone on the AdWords team would be approached by a large advertiser saying “I am having problems with the organic results, can you help me?”

Of course, that person would want to help the advertiser. By having that wall, the AdWords person doesn’t have access to the organic search people. There is this protectiveness around organic search, which enables those engineers to focus on the search experience. They don’t have to think about AdWords, they don’t have to think about how Google is making money, or what the CPMs are. They don’t have to think about any of those things and are able to concentrate on making the best search experience.

The whole environment was built that way which is unlike many other companies. In other companies, no matter what part of the organization you work in, you have to always think about how does this impact our revenue. At Google this is not part of the search engineers’ focus, which is great. Another reason is that many of the search engineers have been at Google since the beginning. They don’t have to work there anymore.

Island Eric Enge: At this point they could easily retire and buy an island.

Vanessa Fox: They continue to work there because they love data and love working with large amounts of data and improving things. I think if someone said to them,”I know you work on organic search, but we’ve decided it’s really important to either give advertisers preference or hold advertisers down. Could you tweak the algorithms?” They would probably say, “I am going to buy my island now, see you later.”

That’s not why they are at Google. They are there because they get to do cool things with large pieces of data. I think these two big factors make it basically impossible for anything other than a search experience to infiltrate what’s going on there.

Think of Panda as a Platform

Eric Enge: What is Panda?

Vanessa Fox: Panda isn’t simply an algorithm update. It’s a platform for new ways to understand the web and understand user experience. There are about four to five hundred algorithm updates a year based on all the signals they have. Panda updates will occur less frequently.

Eric Enge: Right. In the long run it will probably be seen as significant as the advent of a PageRank update.

Vanessa Fox: Yes, absolutely.

Link Graph Eric Enge: At SMX Munich Rand Fishkin heard from Stefan Weitz and Maile Ohye that it’s a lot easier to recognize gaming of social signals than it is to recognize link spam.

Vanessa Fox: The social signals have more patterns and footprints around them. Also, the code that search engines use has gotten more sophisticated, and they have access to more data.

Eric Enge: Another thing I hear people talking about is that over time Google is looking to supplant links with other signals. My take on this is that links are still going to be a good signal, but they are not going to be the only signal.

Links will continue to be augmented with more data, which will make the value of links less important because there are other signals now in the mix.

Vanessa Fox: Google has been saying that for years. I don’t think the value of links will ever go away. They’ll continue to be augmented with more data, which will make the value of links less important because there are many other signals now in the mix.

Google never intended to be built solely on links. We didn’t have social media and Facebook like buttons, and all these things in the past. We only had links. Google was based on how can we build an infrastructure that algorithmically tells us what content people are finding most valuable on the web.

Google and Bing as black boxes

Eric Enge: I think another key component of this story is that Google and Bing are increasing the obscurity of the details of the algorithm. That’s not perfect phrasing, but I think you know what I mean.

Vanessa Fox: I think it becomes harder to reverse engineer for a number of reasons. There are so many moving parts that it’s hard to isolate. People who have systems that attempt to reverse engineer different parts of the algorithm for different signals may come to conclusions that are, or are not, accurate. This is because it’s impossible to isolate things down to a single signal.

You find cases where people think they have but, in reality, it’s the tip of an iceberg because you can’t see everything that’s under the surface. By having more signals and knowing so much more about the web the artificial stuff becomes more obvious.

Eric Enge: Absolutely. If you’ve got twelve different signals and someone games two of them and the other ten don’t agree, that’s a flag.

Vanessa Fox: Right. Which is why it’s so disheartening to me to see that some SEOs continue to react to this by saying, “okay, how can we figure out the algorithmic signals for Panda so we can cause our pages to have a footprint that matches a good quality site.” This is very short term thinking because the current signals are in use only during this snapshot in time.

At this point it’s going to be as difficult to create a footprint of a site with a good user experience as it would be to just create a site with a good user experience. This, of course, is not only a better long term perspective and more valuable, but it will result in a better rate of conversion for most businesses.

I’ve heard some people say things like, they’ve done some analysis and found that you have to vary the length of your articles on pages, so make sure that all of your articles are variable in length. And this is craziness. Even if it works this minute, next week it won’t work and then they will say the sky is falling again.

I read an article where a person said Seth Godin writes really short blogposts so he is going to be impacted by Panda, and how does Google know that if an article is short, it’s not valuable. But Google’s algorithms are not as simplistic as that. Seth Godin has not said he’s lost ranking because of Panda.

I commented on the post, and said this is not true. Google isn’t saying that a short article is not a valuable article. Publishers should make blog posts or articles as short or long as they need to be.

There will be plenty of cases where the best article is a short article.

Eric Enge: There will be plenty of cases where the best article is a short article.

Vanessa Fox: Absolutely and those will continue to rank.

How Publishers should think about Panda

Eric Enge: What would you say to a publisher if they believe they were unfairly affected by Panda? This is a tough question because 98% of the people affected by Panda will say they are in this category. They believe they were a drive by victim rather than something that fell out of the algorithm.

Vanessa Fox: That is a complicated question. I will not dispute, and I don’t think Google would dispute, any algorithmic change from any search engine has the potential of causing some collateral damage. If what you are doing as a search engine is asking, ” are the search results better?” then if the search results are better that doesn’t mean that a site with good content doesn’t accidentally end up lower.

That’s going to be the case with any change a search engine makes. From a content-owner perspective that is not good, which we’ll talk about in a second. However, I talked to many people affected by this and 75% to 80% of the time they said I’ve been hit and I shouldn’t have been hit. There have been only a few occasions where people say, “yeah, I’ve gotten away with it for a long time and they cut me off.”

Eric Enge: You appreciate their honesty, don’t you?

Vanessa Fox: Oh, absolutely. But most of the time people say I shouldn’t have been hit. If you’ve been working on a site for a long time, you may not see the areas it can be improved. I suggest you get an objective observer to provide you feedback and determine if there are any blind spots you’re not seeing. I think that would be a good first step.

It’s not one signal that’s been used. You need to determine does this page answer the question, does this help someone accomplish something.

Essentially, this has become a holistic thing. It’s not one signal that’s been used. You need to determine does this page answer the question, does this help someone accomplish something?

As a business you have to make money. You also have to understand that if a site is optimized for making as much money per visitor from ads as possible, as opposed to being optimized at being useful to the searcher, this site is probably not what a search engine wants to show as the best search results.

You have to balance that. Does it answer a searcher’s question, but also does it answer that questions better than any other site and is the answer easy to find? Look at the quality of what’s being said versus the quality of the other pages that are ranking. Is it better or worse? Then you have to determine if the content is awesome and is that obvious to the searcher.

From a user experience perspective, when they land on that page is the content they need buried? The user experience becomes important because Google wants the searcher to be happy and easily find their answer.

Let’s say the content and the user experience are good for that page. Then you run into the issue of quality ratio of the whole site. The question then becomes if someone lands on your site and they like that page, but they want to engage with your site further and click around your site, does the experience become degraded or does it continue to be a good experience?

For example, last year Google had this emphasis on speed, because their studies found that people are happier when pages load faster and abandon sites that load slowly. I’ve worked with companies whose pages take fifteen seconds before they load. No one will wait around anymore for fifteen seconds to load a page.

I don’t think this is a big part of Panda, it is just for illustration purposes.

If you isolate that as a signal you can have the best content in the world and the best user experience in the world. However, if someone does a search and lands on your page but it takes fifteen seconds for anything to appear, they’ve had a bad experience and they are going to bounce off.

You have to look holistically at everything that’s going on in your site. This is what you should be doing, as if search engines didn’t exist.

Eric Enge: Right. There is another element I want to get your reaction to which I refer to as the “sameness” factor. You may have a great user experience. You may have a solid set of articles that cover hundreds of different topics, and they may all be in fact original. However, it’s the same hundred topics that are covered by a hundred other sites and the basic points are the same, even though it’s original, there is nothing new.

Vanessa Fox: Right. I think that’s where added value comes into play. It’s important to look and see what other sites are ranking for. What are you offering that is better than other sites? If you don’t have anything new or valuable to say then take a look at your current content game plan.

Eric Enge: So, saying the same thing in different words is not the goal. I like to illustrate this by having people imagine the searcher who goes to the search results, clicks on the first result and reads through it. They don’t get what they want so they go back to the search engine, they click on a second result and it’s a different article, but it makes the same points with different words.

They still didn’t find what they want so they go back to the search engine, they click on the third result and that doesn’t say anything new either. For the search engine it is as bad as overt duplicate content.

Vanessa Fox: That’s absolutely right.

Eric Enge: It may not be a duplicate content filter per se, which is a different conversation than this one, but the impact is the same. It’s almost like an expansion of query deserves diversity, right.

The search engines have always said they want to show unique results, diverse results, valuable results.

Vanessa Fox: Right. These concepts have all been around for a long time, but we are seeing them perhaps played out with different sets of signals, but they are not anything new. The search engines have always said they want to show unique results, diverse results, valuable results, all these things.

Adding Diversity to your site with User Generated Content

Eric Enge: One thing I hear people talk a lot about regarding diversity is doing things with user-generated content. In my mind that can be a useful component provided it is contextually relevant and has something useful to say. Do you have some thoughts on that?

Vanessa Fox: Yes. I agree with you, it could go either way. Since Google’s goal is to provide useful, valuable results then you can certainly find pages where user-generated content provides that. If you look at TripAdvisor, which may have its faults, one benefit is that there are numerous first person accounts of hotels and other experiences.

Any hotel or vacation destination you are thinking of going to, you will find authentic, real information from people who’ve actually gone there.

stackoverflow Forums are another example where user-generated content is great. For instance on stackoverflow people are interested in answering questions and having discussions and that’s valuable content. You might have other forums where people aren’t saying anything or are there to spam and put their links.

I think it depends on both the topic and how much you are moderating things, how much time you are spending in curation, how much time you are spending organizing things in a useful way so it’s easy to find.

For instance, let’s say you have a recipe site and people tag their recipes with different variations. If you have a curation process that cultivates that and puts it into topics that people could land on a landing page and see all of the recipes about a particular topic, that will be more useful than things scattered everywhere with random tag pages.

I think there can still be work involved in UGC, although it can be useful and valuable. When you begin looking at health information, for instance, it might become harder. If it’s a site about sharing your experience about an illness, that’s one thing.

If it’s a site about diagnosing people and telling them what they should do to fix their illness, that’s another thing. If it is a group of people as opposed to doctors, you get into this authoritative issue and how do you know it is credible.

Crowd Sourced Content

Eric Enge: There is a related topic that has a different place in the picture, which is the notion of crowd sourced content. Essentially, using crowd sourced data to draw a conclusion, for example, with surveys and polls.

Vanessa Fox: This boils down to the same thing. Is it useful, valuable, credible, authoritative, and comprehensive? Is it all the things people are looking for and does it answer their question better than anything else out there on the web? We can look to TripAdvisor as an example of a site that’s been able to create valuable content on a large scale.

At a larger scale you have to move towards automated processes and, at that point, the curation process becomes harder.

At a larger scale you have to move towards automated processes and, at that point, the curation process becomes harder. Wikipedia has editors that are aggressive towards making sure the content is accurate. However, not all sites have that.

When you do surveys it can be fine, but if you are not manually reviewing the results, because of the large volume of data, that’s when something can potentially go awry, so you have to be careful with it.

walkscore The same thing can happen with aggregating data from different sources. If you look at something like Walk Score, they’ve been able to aggregate the data of how close are schools, bars, and other facilities from your house. Of course, you see other examples where it goes poorly, and you look at the page and it doesn’t make any sense.

Eric Enge: Right. It’s a matter of the context, the effort, and the level at which you are trying to do it.

Vanessa Fox: Yes. I think ultimately there will be a fair amount of work involved with running a business that adds value for people. With this age of technology, you see many cases where people say, “look at all the cool things I can do with technology and it’s very little work on my part.” This is sort of the four-hour work week syndrome.

Often, that does not produce the most valuable results. For instance, if we examine travel and look at a site like Oyster, which was started by Eytan Seidman who used to work on the search team at Microsoft, they pay full-time staff writers with a travel background to travel to hotels, write reviews, and take pictures. They aren’t in every city in the world, and they don’t have every hotel in the world.

That’s a corporate example, but there are travel bloggers, and food bloggers, and other people who only write ten blog posts. However, those ten posts are very comprehensive on the topic.

At a large scale, if you attempt to cover every topic in the world, you are not necessarily going to be able to compete with someone who has written something manually.

At a large scale, if you attempt to cover every topic in the world, you are not necessarily going to be able to compete with someone who has written something manually, gone there, and spent time editing their article. It wouldn’t make sense that your automated content would outrank them.

Fox News Eric Enge: Absolutely. It reminds me of another thread which I am not sure fits in the interview, but I am going to say it anyway. When I grew up I watched the news with Walter Cronkite. He was completely trusted and authoritative. Today we have Fox News, which is entertainment.

That’s the design of Fox News and more power to them; however, you have to imagine that as a culture we are going to have a drive towards getting news from a source that you can trust.

Vanessa Fox: Right. Google did a blog post recently where they talked about the trust element. They said it is certainly one of the questions you should ask yourself when you are evaluating a site. Can you trust it?

Eric Enge: Right. Will you give it your credit card or will you trust it for medical advice?

Vanessa Fox: Would you follow the instructions to save your life? This is where brand comes in. I don’t think it has to be a huge brand, but brand does help the trust factor. Building a brand that people see over and over makes a difference.

This is a major reason why I do not recommend microsites. I know many people who want to do a bunch of micro sites but lack of a brand is one reason I tell them it’s probably not a good idea.

It’s hard to build a brand with a bunch of micro sites that aren’t branded in a unified way. If you build one site under one brand you can build brand engagement; however, you can’t do that with a bunch of micro sites that are branded separately.

Social Media and Branding

Eric Enge: Do you think an effective tactic for beginning to build the brand would involve social media?

Vanessa Fox: It depends on the topic and audience. Where is your audience, are they on social media? If you can engage that audience and build up authority with them that is great. I think social media levels that playing field a bit. In the past, you had to hire a publicist, do press releases, have relationships with reporters, and get on Good Morning America, or something on that order, to get your name recognized.

It still takes work but you can go out on social media, see where people are talking about your topic area, answer their questions, and be that authoritative source. I think it can be great but it doesn’t fit every situation.

SEO still matters

Eric Enge: One last question since we’ve been talking about holistic marketing. The search engines still have mechanical limitations because of how they crawl web pages. So being search engine savvy is still important,

Search Engine Robot Vanessa Fox: Absolutely. Search engines crawl the web and they index the web. Technical aspects, such as how the server responds, how the page URLs are built, and what the redirects are, make a huge impact. You can have the best content in the world but if search engines can’t access that content it’s never going to be indexed to rank. So, absolutely, all that stuff is vitally important.

Eric Enge: The other component is the promotional component which is to go out and implement programs to make people aware of your site and draw links to it, and social media campaigns.

Vanessa Fox: Yes. That’s absolutely the case. I think it goes with the idea you’ve heard from the search engines for a long time which is what would you do if search engines didn’t exist? You need to build your business and part of that is building awareness about your business.

I think the web makes it easier but you need to raise awareness so people know that it’s there. Whether it is through social media or other types of PR, there are many things you can do. You can’t think of your audience engagement strategy as simply SEO. All these other components help SEO, but there are things you need to do in business even if you weren’t doing it for SEO.

The Scope of Panda

Eric Enge: Any last thoughts on Panda?

I talk to many people who have sites that have been hit and I certainly sympathize with their plight. However, there is no quick fix in these cases.

Vanessa Fox: I talk to many people who have sites that have been hit and I certainly sympathize with their plight. However, there is no quick fix in these cases.

I talked to a site owner two weeks ago that said, “maybe if we change our URL so that they are closer to the root of the site instead of having folders in them that will get us back in.” This is the wrong way of looking at it.

Eric Enge: Yes. That’s a clear “no”. For sites who have been hit by Panda, I don’t think, for the most part, there is a quick fix.

Most sites will not be lucky enough to have one section of their site that is a total boat anchor that they can just not index and be done with it. Most sites probably have a real process to go through.

Vanessa Fox: Yes. It’s hard to hear because this is affecting people’s businesses. I think it is going to be a lot of work to figure out who your audience is, what they are they looking for, are you engaging them well, and are you providing value beyond all the stuff that we talked about. It is a process.

Eric Enge: Thanks Vanessa!

Other Recent Interviews

Jim Sterne, July 5, 2011
Stephan Spencer, June 20, 2011
SEO by the Sea’s Bill Slawski, June 7, 2011
Elastic Path’s Linda Bustos, June 1, 2011
SEOmoz’ Rand Fishkin, May 23, 2011
Bing’s Stefan Weitz, May 16, 2011
Matt Mickiewicz, January 8, 2011
ex-Googler Adam Lewis, October 10, 2010
Wordtracker’s Ken McGaffin, August 16, 2010
Bing’s Mikko Ollila, June 27, 2010
Yahoo’s Shashi Seth, June 20, 2010
Majestic SEO Briefing, June 14, 2010
SEOmoz Briefing, June 9, 2010
Localeze Briefing, June 2, 2010
Google’s Carter Maslan, May 6, 2010
Google’s Frederick Vallaeys, April 27, 2010
InfoGroup’s Pankaj Mathur, April 5, 2010
Matt Cutts, March 14, 2010

The Mechanics of Panda

The Panda algorithm hit the SEO world in a big way back on February 23rd /24th. Here is the general update history of Panda:

  1. Panda 1.0: February 23/24, 2011 – The initial launch.
  2. Panda 2.0: April 11, 2011 – added Chrome Blocklist Extension data to impact eHow, plus global English coverage.
  3. Panda 2.1: May 10, 2011 – general algorithm tweaks.
  4. Panda 2.2: June 16, 2011 – improved scraper site detection, probably to reduce the incidence of scraper sites outranking source sites that got hit by Panda.
  5. Panda 2.3: July 23, 2011 – some sites recover due to algo changes in Panda.
  6. Panda 2.4: August 12, 2011 – Panda rolled out internationally.
  7. Panda 2.5: September 28, 2011 – Appears to have affected many sites, including sites with lower levels of traffic.
  8. Panda 2.5.1: October 9, 2011 – minor update.
  9. Panda 2.5.2: October 13, 2001 – minor update.
  10. Panda 3.0: October 19/20, 2011 – a major update that let many sites recover. Evidently, this was intended to help those who had been unfairly hit by Panda back in the game.
  11. Panda 3.1: minor update.

Today I will present a visualization of the basic structure of how this works. I am basing this on the many hours of reading I have done on the topic, Google’s statements that Panda is a document classifier, and the indications by Matt Cutts that it is a process that is run periodically.

First though, a disclaimer. I am not a machine learning expert, and this should be used as a basic conceptualization of the workflow. Major elements are likely to differ from what you see here. However, I believe that this visualization is accurate enough to help you develop a solid mental model for how the algorithm is being applied.

Possible Panda Workflow

As a first step, Google is likely to have defined an initial test set of sites. These sites would then have been classified manually by human raters. The process would look something like this:

Manual Site Classification

This would allow Google to have a strong test database of manually rated sites, which therefore is accurate with a very high degree of probability, perhaps with a 99% degree of accuracy. As you can see sites would have been separated into buckets, such as “Good Sites” and “Bad Sites”.

As a next step, Google may have then spent time analyzing these sites to profile the characteristics of the Good Sites, and also of the Bad Sites, as follows:

Extracting Ranking Parameters

The idea is to develop a model for both types of sites. Of course, you can also have a continuous scale of Goodness, from Bad, Not so Bad, OK, Pretty Good, Very Good, and so forth. Once you have a model for Goodness vs. Badness, you can then step back and analyze what types of parameters you can evaluate algorithimically to get the same results as your human raters did during their evalation.

One key factor in this is the noisiness of the signal. In other words, is there enough data available on all the sites you want to test for the data to be statistically significant? In addition, is it possible that the signal can be ambiguous? For example, does a high bounce rate always mean it is a bad site? Or are there scenarios where a high bounce rate is an indicator of quality? Consider a reference site where a faster bounce might mean that the person got their answer faster.

There are lots of signals you could consider. Here are just a few examples:

User Behavior Content Attributes Searcher Ratings
Brand Searches Reading Level Chrome Blocklist Extension
Site Preview Editing Level +1
Ad CTR Misspellings/Grammar Blocked Search results
Bounce Rate Something new to say
Time on Site Large Globs of Text
Page Views Per Visitor High Ad Density
Return Visitor Rate Keyword Stuffing
Scroll Bar Usage Lack of Synonyms
Pages Printed

Of course, the correlations between good sites and bad sites may use even more obscure signals. A machine learning algorithm may determine that articles that use the word “oxymoron” more than 5 times are inherently poor quality (note to algorithm, this article uses oxymoron only once … oops twice). I personally think that Google would try to constrain the breadth of signals used, but it is certainly possible that the algorithm came up with some unusual correlations.

Once the signals have been decided upon, the algorithm can then set out to test the performance of those parameters with a variety of weights, and can also vary the signals used:

Running the Panda Algorithm

That is the first step. But how did the algorithm do? The next step is to score the results:

Scoring Panda

Once you have your score, the algo can try to figure out what tweaks to make to the parameters used and the weighting of each one to create a better match between the manual classification they did of sites and the algorithmic output. You can also test the results on a larger data set using your validating signals. This allows you to look beyond the limited test set you worked on manually. Together, these comparisons lead to a feedback loop:

Tuning Panda

To finish the process, the machine learning engine would simply repeat the tuning loop until the results were of acceptable quality.

Summary

As mentioned above, this is just my mental model for what took place, and it is likely that the exact course of events was somewhat different.

Ultimately, the key lesson is that publishers need to focus the great majority of their efforts on building sites which offer deep, unique, rich user experiences. The search engines want to offer these types of experiences to their users, and Google and Bing are battling for market share. Focus on giving them what they want in the long run because this battle for market share will surely make roadkill of those that don’t.

The algorithm will certainly be tuned more and more over time, so don’t get too wrapped up in trying to find out the specific factors in use by Google. Even if you succeed in finding it and artificially manipulate your site to score well on those factors, the next set of factors that will get applied may be entirely different. It is simpler to just focus on producing high quality content that is not only non-duplicate, but also differentiated, and then promoting that effectively through a variety of channels.

How Panda Reshapes the Ranking Factors Picture

Much attention has been paid to the recent Google algorithm change that Danny Sullivan called the Farmer Update and that Google in a Wired article referred to as Panda. A lot of focus has been paid to the types of sites impacted, and the nature of the signals that Google has available to it to use. There are plenty of articles on both of these topics (one of the best ones is this one by Vanessa Fox. In fact, in the months leading up to this I predicted the downfall of Content Farms (which is still a work in progress), and much about this change. To me, this is only part of the story.

Farmer / Panda is a Fundamental Shift in Search Ranking

That’s a strong statement. But, think about it for a moment. We now have confirmation that Google is doing what it can to evaluate content quality. The major tools it has to do this are:

  1. Uniqueness of the content
  2. User Engagement with the content

In contrast, consider the way we used to think of ranking. What follows is the summary chart from the SEOmoz SEO ranking factors survey:

SEO Ranking Factors

Note how 66% of the factors relate to linking. Now, however, we know that we have shifted more weight to social engagement and ontent quality. What does the new reality look like? Here is my guess at it:

SEO Ranking Factors 2011

It is important to emphasize – this is just my guess. But one thing we do know is that we have seen a significant ranking algorithm change. I have represented that by showing SEOmoz’s social graph metrics growing from 6% to 20%, with my renaming it social engagement metrics.

Google’s Panda change purportedly impacts 12% of search queries, keep in mind that this is their first foray in this direction. They will collect data, and they will get better at this. As they do, the impact will broaden. This is not something that they will back off on, but rather it is something that they will evolve and grow.

Interview with ex-Googler Adam Lewis

Today I am releasing an interview that I did with ex-Googler Adam Lewis. Adam worked in a variety of roles within Google, including as an optimization specialist on the AdWords team. Suffice it to say, he knows his stuff!

The interview covered a wide range of opportunities for advertisers, including:

  • The “see search terms” feature in AdWords.
  • Negative Matching
  • AdWords filters
  • Conversion Optimizer
  • Mobile ads
  • Local Business ads
  • Doubleclick Ad Planner
  • Google Insights for search
  • Google’s recently released broad match modifier

Lots of good information, so check it out!

Google’s Carter Maslan on Local Search

Today I am publishing the transcript of my recent interview with Carter Maslan. This post summarizes some of the main points of the discussion. We spoke quite a bit about the new service area business tool from Google. Some of the major points made by Carter about this were:

  1. Many businesses don’t want to have their address listed in local search results. For example, a plumber who works out of his home, but always goes to the customer to provide his services. Carter also noted that service area businesses are “primarily defined by whether or not the business brings its services to the customers”. However, service areas businesses can include consultants that work from home that the customer can call to obtain their services.
  2. The ability to set oneself up as a service area business is “pretty broadly accessible”.
  3. “Even though we don’t have specific numbers to share … there area ton of home-based and service businesses in this country”, and “I think it is at least a third of the overall total”.
  4. “giving the end user a PO Box as a pin on the map is not really helpful”, and “If a business really cares about its customers knowing where its PO Box is, I think it’ll be clear that it should be a service area business”.
  5. I also asked Carter about spam. He indicated that they do more or less the same thing they do with other types of spam. For example, when I asked him about a plumber who declared they would serve anyone within 1,000 miles, he indicated that “There are a lot of signals regarding whether or not this is suitable”.

We also discussed Place Pages. What emerged from the discussion was that the purpose of this was to provide access to all the available information about a given place. This would include, but not be limited to, businesses that have no web site. The information on a Place Page could include information provided by the business, but will also include information found by Google in crawling the web.

I asked Carter whether or not having individual landing pages for each location of a business with many locations was preferred. The answer was yes, provided that there was meaningful information that differentiated on locaiton page from another (inventory info, driving directions, etc.)

Google has also made it quite a bit easier for people to report errors. This is basic crowdsourcing in action. They are happy to take reports even if all they specify is that something is wrong with a listing. Note that Google can also look at user interaction data (with a particular listing) to get signals to possible problems as well.

We talked about training the local search algorithm. One concept we bandied about was that of having humans build a mini-data set (e.g. some number of tens of thousands of hand researched listings), and then running the algorithm to see how its results compared to the hand crafted test set of data.

As always, in the interest of providing a short synopsis, I have passed over many details and other points from the interview. Read the full interview for more.

Interview with Google’s Frederick Vallaeys

A couple of weeks back I interviewed Frederick Vallaeys, who is a Senior Product Specialist for Google AdWords. We covered a wide range of topics, with a review of some recent product announcements, and also some tips and tricks. In this post, I will summarize the main points of the interview, but do read the full interview if you want the details.

Universal Search for AdWords: The AdWords team is actively looking at the types of things that have worked in the organic search results. Clearly the introduction of images, videos, maps, and other elements has been a great success in web search, so the AdWords team is beginning to incorporate similar elements. So if you search for a movie, you may get an associated video clip as part of an ad.

Another feature that has been ported over is Sitelinks. As an example of this, check out the search results for Orbitz. This feature will come up in particular when you do branded searchers (and the advertiser has turned it on).

Additional pricing models have been added as well. Try a search on mortgage to see an example of one of them. Google calls these “Comparison Ads”. They offer the user a simple for to fill out. Once the users fill the form out, the information is shared with a few lenders. Note though, one cool additional feature – Google anonymizes the user’s phone number and provides an alternative phone number to the lenders, and when the lenders call that number, Google redirects it to the user’s actual phone number.

Another new pricing model is called “Product Listing Ads”. This is a model for retailers to list products on Google and pay on a cost per acquisition basis. Google pulls matching vendors from the Google Affiliate Network, and given the CPA model this presents little risk to the advertiser. Pretty cool.

As we switched into “tips and tricks”, Frederick led off with the Content Network. As I noted during the interview, the Content Network got off to a bad start because it was bundled so tightly with regular web advertising. This is a problem because the usage of keywords is completely different. Keywords in web search relate to actual user queries. In the world of the Content Network, Google uses keywords to find web pages of participating publishers that have those words on their web pages. If the match is strong enough, the AdSense box with your ad in it will be displayed. A completely different algorithm with many implications.

The other truth about the content network is that the user is in a different mindset. The search user is already looking for something. Someone visiting a web site is probably looking for something else, and you are now trying to get them to look at your product or service. A pretty different mentality, and the best results are obtained if you create pretty different looking ads.

But, if you do these two things well, you can be well off to the races. Frederick reports: “we found that those using the network would typically get 20% of all their leads and conversions from the Content Network”. That is pretty significant. Also, Google has added the ability to show you View Through Conversions. This is essentially analytics data showing you how many of your buyers saw a Content Network ad prior to making a purchase. With this data you can see what sales the Content Network “assited” in getting for you.

Frederick’s next tip was about making use of Conversion Optimizer. This is a free tool that allows you to manage your keywords on a cost per acquisition basis. This is the type of thing that bid management tools do for you, but it is free. In addition, Google can leverage data it has more easily than the pay for bid management tools can, such as geographic data (where the searcher is located) and adapt the bids for your keywords on a per query basis. The tool does not allow management on an ROI basis yet, and the pay for tools offer other features, but for many advertisers, Conversion Optimizer will be enough.

Last up was the search based keyword tool. The tool identifies missed opportunities, such as cases where a company has a page getting organic search traffic related to a product, but there are no keywords being bid on for the same product. The tool presents you with both the keyword and proposed landing page, which makes acting on the suggestions really easy.

There were several other things in the interview, so check it out if you want more.

29 Tidbits from my Interview of Matt Cutts

It is always a pleasure when I get a chance to sit down with Matt Cutts. Google’s Webspam chief is always willing to share what he can for the benefit of webmasters and publishers. In this interview we focused on discussing crawling and indexation in detail.

Starting with this interview, I have also decided to provide the interview series with a bit of a new look. I am going to continue to publish the full transcript of interviews in the STC Articles Feed and on the articles page on our site, but I am going to use the related blog posts as a way of highlighting the most interesting points from the interview (for those of you who want the abridged version).

One of the more interesting points was their focus on seeing all the web’s content, regardless of whether or not it is duplicate, an unreadable file format, or whatever. The crawling and indexing team wants to see it all. You can control some of how they deal with it, but they still want to see it. Another interesting point was that listing a page in robots.txt does not necessarily save you anything in terms of “crawl budget”. (But wait there’s more!)

What follows are some of the more interesting statements that Matt made in the interview. I add my own comments to the end of each point.

  1. Matt Cutts: “there isn’t really any such thing as an indexation cap”
    My Comment: Never thought there was one, but it’s always good to confirm.
  2. Matt Cutts: “the number of pages that we crawl is roughly proportional to your PageRank”
    My Comment: Most experienced SEO professionals know this, but it is a good reminder how the original PageRank defined in the Brin-Page thesis still has a big influence on the world of SEO.
  3. Matt Cutts: “you can run into limits on how hard we will crawl your site. If we can only take two pages from a site at any given time, and we are only crawling over a certain period of time, that can then set some sort of upper bound on how many pages we are able to fetch from that host”
    My Comment: This will likely be a factor for people on shared (or under-powered) servers.
  4. Matt Cutts: “Imagine we crawl three pages from a site, and then we discover that the two other pages were duplicates of the third page. We’ll drop two out of the three pages and keep only one, and that’s why it looks like it has less good content”
    My Comment: Confirmation of one of the costs of duplicate content.
  5. Matt Cutts: “One idea is that if you have a certain amount of PageRank, we are only willing to crawl so much from that site. But some of those pages might get discarded, which would sort of be a waste”
    My Comment: More confirmation
  6. Eric Enge: “When you link from one page to a duplicate page, you are squandering some of your PageRank, correct?
    Matt Cutts: “It can work out that way”
    My Comment: Yes, duplicate content can mess up your PageRank!
  7. Matt Cutts: “If you link to three pages that are duplicates, a search engine might be able to realize that those three pages are duplicates and transfer the incoming link juice to those merged pages”
    My Comment: So Google does try to pass all the PageRank (and other link signals) to the page it believes to be canonical.
  8. Matt Cutts: re: affiliate programs: “Duplicate content can happen. If you are operating something like a co-brand, where the only difference in the pages is a logo, then that’s the sort of thing that users look at as essentially the same page. Search engines are typically pretty good about trying to merge those sorts of things together, but other scenarios certainly can cause duplicate content issues”

    and

    Matt Cutts: re: 301 redirect of affiliate links: “People can do that”, but then “we usually would not count those as an endorsement”
    My Comment: Google will take links it recognizes as affiliate links and not allow them to pass juice.

  9. Matt Cutts: re: link juice loss in the case of a domain change: “I can certainly see how could be some loss of PageRank. I am not 100 percent sure whether the crawling and indexing team has implemented that sort of natural PageRank decay”
    My Comment: In a follow on email, Matt confirmed that this is in fact the case. There is some loss of PR through a 301.
  10. Matt Cutts: No HTTP status code during redirect: “We would index it under the original URL’s location”
    My Comment: No surprise!
  11. Matt Cutts: re use of rel=canonical: “The pages you combine don’t have to be complete duplicates, but they really should be conceptual duplicates of the same product, or things that are closely related”
    My Comment: Consistent with prior Google communication
  12. Matt Cutts: “It’s totally fine for a page to link to itself with rel=canonical, and it’s also totally fine, at least with Google, to have rel=canonical on every page on your site”
    My Comment: Interesting way to protect your site from unintentionally creating dupe pages. Just be careful with how you implement something like this.
  13. Matt Cutts: “the crawling and indexing team wants to reserve the ultimate right to determine if the site owner is accidentally shooting themselves in the foot and not listen to the rel=canonical tag”
    My Comment: The canonical tag is a “hint” not a “directive”
  14. Matt Cutts: re using robots.txt to block crawling of KML files: “Typically, I wouldn’t recommend that. The best advice coming from the crawler and indexing team right now is to let Google crawl the pages on a site that you care about, and we will try to de-duplicate them. You can try to fix that in advance with good site architecture or 301s, but if you are trying to block something out from robots.txt, often times we’ll still see that URL and keep a reference to it in our index. So it doesn’t necessarily save your crawl budget”
    My Comment: One of the more important points of the interview: listing a page in robots.txt does NOT necessarily save you crawl budget.
  15. Matt Cutts: “most web servers end up doing almost as much work to figure out whether a page has changed or not when you do a HEAD request. In our tests, we found it’s actually more efficient to go ahead and do a GET almost all the time, rather than running a HEAD against a particular page. There are some things that we will run a HEAD for. For example, our image crawl may use HEAD requests because images might be much, much larger in content than web pages”
    My Comment: Interesting point regarding the image crawler.
  16. Matt Cutts: “We still use things like If-Modified-Since, where the web server can tell us if the page has changed or not”
  17. Matt Cutts: re faceted navigation: “You could imagine trying rel=canonical on those faceted navigation pages to pull you back to the standard way of going down through faceted navigation”
    My Comment: Should conserve PageRank (and other link related signals), but does not help with crawl budget. Net-net: sites with low PageRank cannot afford to implement faceted navigation because the crawler won’t crawl all of your pages.
  18. Matt Cutts: “If there are a large number of pages that we consider low value, then we might not crawl quite as many pages from that site, but that is independent of rel=canonical”
    My Comment: Lots of thin content pages CAN kill you.
  19. Eric Enge: “It does sound like there is a remaining downside here, that the crawler is going to spend a lot of it’s time on these pages that aren’t intended for indexing”.
    Matt Cutts: ” Yes, that’s true. … You really want to have most of your pages have actual products with lots of text on them.”
    My Comment: Key point is the emphasis on lots of text. I would tweak that a bit to “lots of unique text”.
  20. Matt Cutts: “we said that PageRank Sculpting was not the best use of your time because that time could be better spent on getting more links to and creating better content on your site”
  21. Matt Cutts: more on PR sculpting: “Site architecture, how you make links and structure appear on a page in a way to get the most people to the products that you want them to see, is really a better way to approach it then trying to do individual sculpting of PageRank on links”
    My Comment: Google really does not want you to sculpt your site.
  22. Matt Cutts: “You can distribute that PageRank very carefully between related products, and use related links straight to your product pages rather than into your navigation. I think there are ways to do that without necessarily going towards trying to sculpt PageRank”
    My Comment: Still the best way to sculpt your site – with your navigation / information architecture.
  23. Matt Cutts: on iFrame or JS sculpting: “I am not sure that it would be viewed as a spammy activity, but the original changes to NoFollow to make PageRank Sculpting less effective are at least partly motivated because the search quality people involved wanted to see the same or similar linkage for users as for search engines”
    My Comment: An important insight into the crawling and indexing team’s mindset. Their view is that they want to see every page on the web, and they will sort it out.
  24. Matt Cutts: “I could imagine down the road if iFrames or weird JavaScript got to be so pervasive that it would affect the search quality experience, we might make changes on how PageRank would flow through those types of links”
    My Comment: Even though a particular sculpting techniqe may work now, there is no guarantee that it will work in the future.
  25. Matt Cutts: “We absolutely do process PDF files” … “users don’t always like being sent to a PDF. If you can make your content in a Web-Native format, such as pure HTML, that’s often a little more useful to users than just a pure PDF file” … “There are, however, some situations in which we can actually run OCR on a PDF”
    My Comment: Matt declined to indicate if links in a PDF page will pass PageRank. My guess is that they do, but they may not be as effective as HTML links.
  26. Matt Cutts: “For a while, we were scanning within JavaScript, and we were looking for links. Google has gotten smarter about JavaScript and can execute some JavaScript. I wouldn’t say that we execute all JavaScript, so there are some conditions in which we don’t execute JavaScript. Certainly there are some common, well-known JavaScript things like Google Analytics, which you wouldn’t even want to execute because you wouldn’t want to try to generate phantom visits from Googlebot into your Google Analytics”.

    and

    Matt Cutts: We do have the ability to execute a large fraction of JavaScript when we need or want to. One thing to bear in mind if you are advertising via JavaScript is that you can use NoFollow on JavaScript links”
    My Comment: You can expect that their capacity to execute JavaScript will increase over time.

  27. Matt Cutts: “we don’t want advertisements to affect search engine rankings”
    My Comment: Nothing new here. This is a policy that will never change.
  28. Matt Cutts: “might put out a call for people to report more about link spam in the coming months”
  29. Matt Cutts: “We do a lot of stuff to try to detect ads and make sure that they don’t unduly affect search engines as we are processing them”
    My Comment: Also not new. Google is going to keep investing in this area.

So if you got this far, you must be really interested in Matt’s thoughts on search and webspam. Check out the rest of the interview for more!

Will Google make page speed a ranking factor?

Google is obsessed with speed. A tremendous amount of corporate energy is being put into initiatives to speed up the web. Google’s Let’s make the web faster web page asks the question “What would be possible if browsing the web was as fast as turning the pages of a magazine?”. This clues us in to their goal – instant response.

To see Google engineers talk about this, check out the 3 1/2 minute video on this page. You can also check out the video on this page which includes the statement that 100 milliseconds is a recognized threshold for users to notice some sluggishness. You can also see more on Google’s thoughts on performance in this Jake Brutlag post titled: Speed Matters. The testing discussed in this post showed that small increases in load time of search results pages, less than 1/2 second, resulted in a decline in searches performed of 0.2% to 0.6%.

Google seems fully prepared to take on the task of rebuilding the Internet if need be, and they are challenging some of the most basic protocols on which the web was built. They have an initiative in place to re-design the HTTP: protocol. Their proposed protocol, known as SPDY: is designed for today’s web environment, which HTTP was not. Testing they have done on SPDY shows a 50% uplift in performance – not bad.

Google has also launched its own Public DNS. The DNS infrastructure plays a critical role in the web, that of converting human friendly web addresses, such as www.stonetemple.com, to machine friendly IP addresses, such as 206.130.117.215. Today’s web pages often involve multiple DNS lookups to load. Speeding up these transactions can only improve overall performance.

The we have Chrome. CNet published a study comparing the Javascript performance of Chrome against two versions of IE, Firefox and Safari. Chrome offered 5x to 10x the performance in running the Javascript tests. Similar data was shown in tests performed by codemeit.

Google is providing some interesting tools for publishers as well. In December 2009 they announced Speed Tracer, a tool for monitoring page load time performance. One key component of the tool is that it allows you to graphically locate trouble spots and then drill down to see what the source of the problem is. In addition, Google Webmaster Tools allows you to get a close up look at your site’s performance:

WMT Speed Measurement Tool

The tool will also let you drill down and get specific suggestions from Google on how to improve site performance:

WMT Speed Up Suggestions

Last, but not least, at Pubcon 2009 in Las Vegas, Matt Cutts stated quite clearly that Site Speed would become a ranking factor. Of course, that does not necessarily mean it actually will be done, but when you look at the overall commitment that Google has to web performance, you can count on it. So when should you begin working on your site performance? I’d say now. Turn site speed into advantage for your business!

Josh Cohen Interview: Comment Here

In this interview with Josh Cohen of Google News, we go into a lot of detail on the inner workings of Google News. Perhaps the most interesting bit of information was confirmation that Google uses click data as a ranking factor for Google News. If the Google News team is using it, it seems likely that the web team is using it as well. This is something that has been long suspected, but I am not sure I have seen it confirmed before.

We also cover other aspects of what Google News requires, and how it can be an important way that sites who publish news oriented content can obtain visibility for themselves. Check it out!

Less is More

One of the fascinating trends of the 21st century (more or less) is the fact that “less is more”. This saying has been around for a long time, but this decade has brought it to new heights. In our industry there are two stunning examples. One of the best examples is Google:

Google Home Page

A quick spot check shows that Google currently has a market cap of $163 Billion. As we all know, a lot of technology goes into allowing Google to provide such a simple interface, and also to put them in the market leading position they occupy. But, for the user, one of the major advantages of the service is simplicity. Another great example is Twitter:

Twitter

Where else can you find a company without a revenue model that is valued at one billion dollars? Here the nature of how less is offered is a bit different. The limitation is that you can only enter 140 characters. This limit seems to drive people to particpate, because they can dash off a quick note really easily. Of course, the real time nature of the platform is important as well, but the “limitation” to 140 characters is actually a feature. I would assert that if the box allowed you to enter 400 characters that usage would drop quickly.

People want simple. There is too much complexity in the modern world. Information and advertising is coming at us from everywhere, and there is no reason to believe it will slow down. This also causes us to want to lean on personal recommendations from others more. Talking to someone who already has the product and seeing how they liked it is another defense mechanism, but that is not the subject of this post.

In the case of Google and Twitter, the utter simplicity of the products is a big key to their success. Just let me do what I want, do it quickly, and don’t flood me lots of other stuff I don’t care about. Key to this is that the functions served are in high enough demand. For Google, people just want to be able to search the web. They don’t want Google to provide content, just lists of web sites. With Twitter, I don’t want to write a book, or even a blog post, I just want to have real time connectivity with my friends / associates where all I need to do is send off quick notes.

A third example worth mentioning is texting. Many teenagers simply don’t bother with email, or even using phones to make calls. Too many features. Texting is sufficient, even on those phones where I have only a numeric keypad. Besides, this way I can use the same small device to communicate at home, the office, or while on the road.

Of course, less is not always better. There are times when the additional features are desirable. So when is less more? If a large number of people would say this about an activity: “I simply want to do __________ without any hassle”, you have an opportunity for less to be big. Will this trend continue? Our world’s complexity is not going down (it is increasing). The conclusion?

You can expect to see a lot more, of less, in the future.