Vanessa Fox’s Last Google Interview?

Shortly before she left Google, I spoke to Vanessa Fox about what’s going on with Google Webmaster Tools, and, we also spoke for a while about duplicate content problems. While I was working on polishing up the interview transcript Vanessa left Google. So I may have had the last real interview she did while at Google. After another week of vacation, she will officially make the leap over to work at Zillow. Best of luck Vanessa!

We talk at length about key parts of Webmaster Tools, and we also talk about the newest features, and planned features for the product. Vanessa talks quite a bit about they types of features that users have been requesting, and while nothing was committed, it should provide some insight into the types of things you can expect to see in the product in the future.

We also talk quite a bit about duplicate content, what types are easy for Google to detect, and some more advanced duplicate content situations, such as those were actual penalties are put in place.

Dynamically Built Google Custom Search Engines

Google continues to roll out new features in its Custom Search Engines (CSEs). One of the most interesting ones is the implementation of “Linked CSEs”. What this feature does is allows applications to be built that can dynamically build a CSE. To whet your appetite for this feature, let me provide you with an example of a dynamic CSE you can build this way. You can read Google’s blog posts about this update here and here.

Using the Linked CSE architecture you can write a program that extracts the data from the API of a major social media site. You could then render that into the standard XML format used by CSEs, and the Linked CSE feature would incorporate the information from that XML file into a CSE for you. Sounds relatively neat, right? But the juice for this application is that when you update the XML file (using the program you created to detect new information in the social media site’s feed), Google will also automatically update your CSE.

Another idea would be that a teacher could build a simple CSE with the web sites that are approved sources of information for their class. While this can be done manually through the interface, this could be setup for the teacher so that all they needed to do was update the list of approved resources on their web site, and a simple utility would convert that into the required XML file for the CSE. Google provides a simple demonstration version of such a utility that it calls MakeCSE. You can try out the MakeCSE tool here, and here you can find URL-based Tools for Linked CSE’s.

When the CSE program first launched, it required you to login to your Google Coop account, create a definition, and maintain it in the account over time. In essence, the XML file for your CSE was hosted by Google, based on the information you gave them. Now you can host your own XML file, this creating the ability to create these dynamic applications.

I suspect that there are a lot of interesting Custom Search Engine applications that can be built using this new functionality.

Google also announced that they are providing tools to help people find Custom Search Engines by a variety of means. For example, these can be found by topic of interest, or through a keyword search that will determine relevance by looking at the search engine’s name, description, keywords, and popular queries. In addition, searches can be constrained by attribute. For example, you may want to consider only those CSEs with at least 10 sites in them, or CSE’s searching for volunteers, or only non-profit CSEs.

This new feature should make it easier to find the CSEs of interest for many users. For example, if you were really interested in volunteering to work as an editor for a certain type of non-profit organization, you would now be able to find the available opportunities quickly.

Thanks to Danny Sullivan for prompting me to put up this post.

Interview with Seth Godin

A little over a week ago, I caught up with Seth Godin to talk about the great success of Squidoo. As you might expect, the conversation included a lot of discussion about what makes the web great, and strategies and tactics for making something go viral.

Much of Seth’s focus with Squidoo was just that – to make it go viral. At the heart of making something viral is providing a participant with a reason for telling others. We are not talking about compensation here. We are talking about things like pride of authorship, getting their name widely recognized, etc.

Squidoo is successful because each Lens is designed to be an expert’s view for a given topic area of the web. People create lenses about topics they are passionate about, and then they want other people to see it. Some of these people are passionate about a topic too, so they create another lens, and so forth.

Layer on top of that they can automatically route their revenue share (yes there is a 50-50 revenue share) to the charity of their choice, and now you get some interesting dynamics. The charity organizations get in the game, and start promoting Squidoo to their members.

It’s a great story. Check it out.

12 Ways Webmasters Create Duplicate Content

At the recent SMX Advanced Conference in Seattle one of the big sessions was on duplicate content. There is great blow by blow coverage in posts by Vanessa Fox and by Matt McGhee. You can also see an older post about dupe content here by Chris Boggs.

At the start of this session, the search engines all talked about various types of duplicate content. But let’s take a deeper look at the way that duplicate content happens. Here are 12 ways people unintentionally create dupe content:

  1. Build a site for the sole purpose of promoting affiliate offers, and use the canned text supplied by the agency managing the affiliate program.
  2. Generate lots of pages with little unique text. Weak directory sites could be an example of this.
  3. Use a CMS that allows multiple URLs to refer to the same content. For example, do you have a dynamic site where http://www.yoursite.com/level1id/level2id pulls up the exact same content as http://www.yoursite.com/level2id? If so, you have duplicate content. This is made worse if your site actually refers to these pages using multiple methods. A surprising number of large sites do this.
  4. Use a CMS that resolves sub domains to your main domain. As with the prior point, a surprising number of large sites have this problem as well.
  5. Generate pages that differ only by simple word substitutions. The classic example of this is to generate pages for blue widgets for each state where the only difference between the pages is a simple word substitution (e.g. Alabama Blue Widgets, Arizona Blue Widgets, …).
  6. Forget to implement a canonical redirect. For example, not 301 redirecting http://yoursite.com to http://www.yoursite.com (or vice versa) for all the pages on your site. Regardless of which form you pick to be the preferred form of URL for your site, someone out there will link to the other form, so implementing the 301 redirect will eliminate that duplicate content problem for you, as well as consolidate all the page rank from your inbound links.
  7. Having your on site links back to your home page link to http://www.yoursite.com/index.html (or index.htm, or index.shtml, or …). Since most of the rest of the world will link to http://www.yoursite.com, you now have created duplicate content, and divided your page rank, if you have done this.
  8. Implement printer pages, but not using robots.txt to keep them from being crawled.
  9. Implement archive pages, but not using robots.txt to keep them from being crawled.
  10. Using Session ID parameters on your URLs. This means every time the crawler comes to your site it thinks it is seeing different pages.
  11. Implement parameters on your URLs for other tracking related purposes. One of the most popular is to implement an affiliate program. The search engine will see http://www.yoursite.com?affid=1234 as a duplicate of http://www.yoursite.com. This is made worse if you leave the “affid” on the URL throughout the user’s visit to your site. A better solution is to remove the ID when they arrive at the site, after storing the affiliate information in a cookie. Note that I have seen a case where an affiliate had a strong enough site that http://www.yoursite.com?affid=1234 started showing up in the search engines rather than http://www.yoursite.com (NOT good).
  12. Implement a site where parameters on URLs are ignored. If you, or someone else, links to your site with a parameter on the URL, it will look like dupe content.

There are many ways that people intentionally create duplicate content, by various scraping techniques, but there is no need to cover that here.

There are a number of gray area techniques, such as computer generated content. There was a very interesting presentation about this by Mikkel deMib Svendsen at SMX Advanced that talked about Markov Chains as a technique for generating content. One key for doing this well, is to do it well enough so that the content is not seen as duplicate. The second key, is to generate content that is meaningful for an end user.

When search engines look for duplicate content, they start by filtering out all the content on the page which is template based, such as the navigation on the sides, top, and bottom. They recognize this as being in common, and do not hold this against you. They base their evaluation on the content that is intended to be unique to that page.

Search engines will look at and compare each of the pages on your site to other pages on your site, as well as pages on other sites. One of the known techniques for doing this is the Sliding Window technique. Basically, what this does is that it looks at the unique content on your page a fixed number of characters at a time. For example, perhaps it may look at the first 50 characters in the unique content section of your page, starting with the 1st character.

It then compares that snippet with other snippets as a part of its duplicate content check. It then looks at 50 characters starting with the 2nd character in the unique content section of your page, then it starts with the 3rd character, the 4th character, and so forth. One way you can try to see how you are doing is to use a Page Similarity Checker.

In general, search engines do not penalize you for duplicate content. When they detect duplicate content, they simply try to choose only one of the duplicate pages to return in the search results, and they may not choose yours. They can do this by basing it on a page rank like basis, or by whichever copy of the content they detected first.

In extreme cases, I have actually seen algorithmic penalties applied. This is rare, and should only happen to you if your site is crawling with duplicate content, and has basically nothing else.

The last thing I want to note is that the main focus of webmaster should be on delivering pages of unique value. Uniqueness is important for many reasons, because it makes it far more likely that your site can obtain links. The primary value in knowing how to avoid unintentional duplicate content is to avoid the division of your page rank. Links to duplicate pages are wasted, and marketing your site is hard enough without shooting yourself in the foot.

Other suggestions for ways people unitentionally crest duplicate content? Let me know, and I will add it to the above list.

Matt Cutts and Tim Mayer – tidbits from SMX

While I was out at SMX Advanced, I sat through the penalty box summit. As always, Tim Mayer and Matt Cutts had a few interesting things to say.

Tim Mayer started by noting that Yahoo! remains committed to search. Evidently, there was a rumor floating around in the industry that Yahoo! was going to reduce their investment in search. I had not hear the rumor myself, but it’s not true.

Also of interest was that Yahoo! has now added to Site Explorer a way to disclaim and inbound link. I think this is really cool. It may not be something that most webmasters need, but there are definitely times when it would be useful. For example, if you ever bought a link from the Washington Times, who used to sell text links that people bought for SEO purposes, you probably discontinued it once you learned that they don’t count any more.

However, you would also find that the Washington Times never takes pages down, so the links are persistent. Given that this would be a possible black mark for your site from an SEO perspective, you would want them removed. But the Washington Times is not going to remove them for you, they are a newspaper with other things to do. So disclaiming that old purchased link could be a great idea. The good news is that Yahoo! now makes that easy.

Tim also noted that there are legitimate uses for almost every technique. For example, cloaking is considered OK, if you are doing it for delivery of content on a geographic basis (i.e. the Spanish language site to Spanish speaking countries). Ultimately, it’s all about Intent and Extent. The intent with which you do something, and the extent to which you do it. So beware the consequences if a simple examination of your site or it’s linking pattern will reveal that you obviously tried to disguise something from the engines.

When Matt got up, one of the more interesting things he has to say was that Google was not averse to manual action. His comment was intended to address the perception that Google does everything by algorithm. In fact, Google does have a process by which sites get flagged for manual review. Tie that in to Google’s recent efforts with spam reporting forms, and Google’s statement that they review all spam reports filed using the Google Webmaster Tools authenticated version of the form, and it’s clear that they are willing to act on these reports, if it’s appropriate to do so.

Also of interest was Matt’s comment that Google does send proactive emails to webmasters who violate the Webmaster Guidelines, if they believe the violation is unintentional. In some cases, these emails actually detail the specific problem,

Matt also said that Google does make use of 30 day penalties as a warning. This means if your site disappears for 30 days, and then magically pops back in, that this is a warning. Don’t relax, you do have a problem, and there is something you need to go looking for, and fix it, because the penalty will most likely come back.

The most interesting suggestion made by the audience during this session was that Google Webmaster Tools, and Yahoo! Site Explorer should add functionality to mark sites that are currently being penalized or banned with some sort of readily visible flag. This would help webmasters understand when that drop in rankings is due to a penalty, as opposed to changes in the algorithms.

15 Methods for Paid Link Detection

Many major SEO firms make it a standard practice to recommend the purchasing of links to their clients. The search engines actively discourage this practice, and do their level best to detect those paid links. Here are 15 things they can use as signals that a link is possibly a paid link:

  1. Links Labelled as Advertisements: The search engines can scan for nearby text, such as “Advertisement”, “Sponsors”, “Our Partners”, etc.
  2. Site Wides: Site wide linking is unnatural, and should be a rare part of your link mix (purchased or not). The only exception to this is the interlinking of all the sites owned by your company, but this presumes that the search engine will understand that all of your sites are from your same company. In general, site wides are a serious flag.
  3. Links are Sold By a Link Agency: Of course, link agencies are knowledgeable about the link detection methods listed here, and do their best to avoid detection with the links they sell.
  4. Selling Site has Information on How to Buy a Text Link Ad: Search engines can detect sites that provide information on how to advertise with them. This combined with other clues about links being sold on the site could lead to a review of the site selling the ads, and a discounting of the links.
  5. Relevance of Your Link: It’s a powerful clue if your link is not really that relevant to the page it’s on, or the site it’s on.
  6. Relevance of Nearby Links: Another clue would be the presence of your link among a group of links that are not tightly themed.
  7. Advertising Location Type: The search engine can detect when your link is not part of the main content of the page. For example, it appears in the left or right column of a 3 column site, and the main content is in the middle.
  8. Someone Reports Your Site for Buying Links: Who would do this? Your competitor! If your competitor submits an authenticated spam report to Google, it will get looked at, and acted upon.
  9. Someone Reports Your Site for Some Other Reason: Perhaps your competitor does not recognize you are buying links, and turns you in for something else. Once this happens, the search engine will take a look at all aspects of your site, not just the reported issue.
  10. Someone Reports the Site you Bought Links from for Selling Links: A competitor of yours can do this, or a competitor of the site selling links can do this. Once a search engine figures out that a site is selling links, it’s possible that this could trigger a deeper review of the sites that were buying those links.
  11. Someone Reports the Site you Bought Links from for Some Other Reason: As before, this can lead to the search engine discovering that the site is selling links, even though it was not the core subject of the Spam report filed against it.
  12. Disgruntled Employee Leaves Your Company, and Reports Your Site: For decades, many companies have had a practice of escorting fired (or laid off) employees out of the building. The reason for this approach is that people get upset when they lose their job. However, even this practice would not prevent such a person from reporting your site in a spam report to a search engine. Even though that may be a violation of the confidentiality agreement you probably have with your employees, you would never know, because there is no transparency in spam reporting.
  13. Disgruntled Employee Leaves the Agency Your Used, and Reports Your Site: This same scenario can play out with an employee leaving the link agency you used. This form of disgruntled employee can report either your site directly, or the agency itself.
  14. Disgruntled Employee Leaves the Company of the Site Your Bought Links from, and Reports Your Site: Finally, it can also happen with someone leaving the company you bought the links from. This type of disgruntled employees can report your site, or the site they used to work for.
  15. Internal Human Review: Last, but not least, the search engine can do a human review. In general, search engines don’t do spontaneous reviews of sites, and wait for things detected algorithmically, or a spam report, to trigger a deeper review. But, you could certainly imagine that search engines could make an overt effort to clean up the search results in portions of their index they perceive to be spammy.

Search Engine Courses of Action

In the case of Google, it is known that one of the basic policies is to punish sites who sell text links by terminating that sites ability to pass link juice. This is essentially a first course of action. Once this is done, Google could look more closely at the selling site, and the purchasing sites for other signs of spammy behavior.

The search engines also take stronger actions at times, such as an algorithmic penalty, or banning a site from their index. I don’t know exactly how those determinations are made, but I believe that there are 3 major triggers for such action:

  1. It can be the cumulative affect of several signals of poor site quality.
  2. The search engine determines that a site has bought links on a large scale.
  3. Upon human review, the search engine detects a clear pattern of an intent to deceive them.

Summary

Plenty of businesses are successful with a link buying strategy. However, the search engines are investing more and more effort into their detection. At STC, our preference is to focus on obtaining links through great content, and making people aware of what we (our clients) have. But we place a very high priority on very high value links.

These are the types of sites that are very difficult to buy links from. For one thing, when these higher profile sites sell links, it does not take that long for it to become public knowledge. Just ask United Press International, who recently promoted the sale of links for improving page rank. UPI has discontinued the practice because of the furor it created.

This also has great synergies with the notion of investing time in developing great content for users. In a world with increasing personalization by the search engines, this is increasingly very, very important, and over time may well have a larger impact on your rankings then the links you get. You can see the search engines shifting from having web sites vote on your site, to having users vote on your site. One way or another, this is coming to a search engine near you.