Rand Fishkin is among the search marketing field’s thought leaders, serving 30,000+ daily readers on the SEOmoz blog. Rand has appeared in Newsweek magazine and been quoted by the Washington Post, USA Today and dozens of other publications. He is a frequent traveler and speaker on SEO, blogging & web marketing at worldwide conferences, but loves his hometown of Seattle, WA.
Eric Enge: Let’s start with an overview of what Linkscape is, and how people might use it?
Rand Fishkin: Linkscape is really two big things right now. One is the index; the crawl we built of the World Wide Web, which is about thirty billion pages right now and probably going up to something close to double that in November. Talking to Ben yesterday it sounds like it’s going to be just a week or two more before we are doing that update.
That index is used to construct a link graph, and then that link graph feeds into the public tool that you see at www.seomoz.org/linkscape, which is a link research and link intelligence tool. So, by learning how important certain sites and pages are, how many links point to them, and from how many different domains, we are learning all sorts of details about the links. What’s the anchor text, what’s the target URL, what’s the relative importance, all that kind of stuff. It’s a very robust tool for link intelligence. If you could dream of all the things you wish you had in Site Explorer, we are trying to do that.
Eric Enge: Right. Well, that sounds like a good objective. How did you go about assembling the index?
Rand Fishkin: We looked at a lot of papers from research staff. We did some test crawling back in December and January. One of the things that we did early on was downloaded Wikipedia and DMOZ and build little mini link graphs internally, just to see how we liked our metrics, and whether we thought they were good or not. And then we went out crawling the web using a web crawler to discover links.
We started with a list of trusted seed sites. And we basically went through a list of all the trusted seed sets that other academic and IR conferences and papers had proposed previously. We dug through those and found ones we liked and ones we didn’t so much like. It turns out that a lot of EDU links you would think would be great seed sites link to some pretty scuzzy stuff.
So, what you really want to do is find between a few hundred and a thousand really solid websites and pages that don’t link to anyone bad. Then you get that Kevin Bacon effect where 6 degrees of separation shows you some pretty bad people, and at 5 degrees they are not so bad. At 4 degrees they are pretty decent, 3 degrees they are really good, and 2 degrees they are great, that kind of thing. So, the further away you go, the further in link hops you go from trusted sites, the worse the Internet gets.
Starting with those seed sites and crawling out from there actually gives us our mozTrust score. So we basically use those trusted sites and crawl from there, and just keep on crawling until we got to a place where we are relatively happy. And, that’s where we were happy enough for a beta launch, which was this thirty billion page index that you see now. We like it a lot, because it is very domain diverse.
Basically, the index bias, rather than going very deep into big sites, tries to go very, very wide and cover as many sites as possible. So, for example, I think Nathan Buggia was here a couple of months back for a Whiteboard Friday. He said that at Microsoft they have about 75 million domains that they think are worthy of including in their index that they show in the main search results. In our crawl, which is probably considerably smaller than Live’s, we found 150 million domains. So, we really try to be very, very broad as far as domains go.
Eric Enge: So, you are focused more on breadth than depth, so if you had a hundred thousand page site, you only crawl part of it?
Rand Fishkin: Exactly. And, you can totally see that when you plug stuff into Linkscape. You can see that we have a lot more external links than internal links, right? So, you’ll look at something like Hulu.com and you will see that they know about 20,000 unique external links but only have 1,500 internal links. Clearly there are more than 1,500 pages on Hulu, but we haven’t crawled with the same depth. So obviously, there is some disparity between us and the major search engines right now, and we are looking to close that gap in the next three to six months.
Eric Enge: Right. So, is any of your index a result of licensing data from a third party?
Rand Fishkin: Right now, very little of it is. Technically none of it is crawled under the SEOmoz brand. There is no crawler out there named SEOmoz, there is no crawler that comes from our servers. So I guess it technically all comes from a third party, but a third party that’s controlled by us, a third party where we dictate the crawl and how the process works. We only backfill that when we find sites and pages where we can’t reach them for some reason.
So, a good example would be something like the economist.com. Right now we don’t have information on The economist because it turns out they actually block all robots except Yahoo, Google, Microsoft, Web Archive, Archive.org and a couple of others. So, we are going to end up buying the economist crawl from archive.org and including it that way. I think that might be a good way to think about it.
Basically if we are not getting the data one way to build our index, we’ll go get it in another way. Now, there are some folks who block everyone except Google, and it turns out that for those guys, especially if they put no archive on their pages, there is pretty much no way to get that data. And, we are okay with that. We are talking about maybe one 100,000th of the web’s pages, probably less than that even.
Eric Enge: Sure. So, there is a lot of speculation out there that your bot is called dotbot.
Rand Fishkin: I think there is lot of speculation about that.
Eric Enge: Can you tell me about that?
Rand Fishkin: So the website that runs dotbot is on our list of sources, along with others. So yes, that is absolutely one of the places that we may be pulling from, now or in the future, but we are not saying specifically that we only run one of these spiders. Right now it’s just, here’s all list of sources so that if you want to block, you could potentially block all of the sources that we could pull from.
Basically the thinking behind that is just if you block us, and then you see your information on Linkscape, you are probably going to be very upset. So, what we say is if you want to keep us out, no matter where we get your information from, use the meta noindex. Instead of meta robots noindex, use meta SEOmoz noindex, and we will keep you out, we won’t show you in our list of results.
If you want to block specific bots, you can see all the potential sources and all the potential bots that we’ve got on our list and you can block any or all of them, or none of them, whatever you choose.
Eric Enge: Right. So, the list of bots on your site is a comprehensive list of sources?
Rand Fishkin: That’s right. If we get blocked or find new sources, we may use those. Anytime we do that, we will update that list so that you know where we are pulling from so if you want to block other sources you can. We are very aggressive about getting data, there is no doubt about it. But we don’t want to go over the line and be just totally evil and cloak our bot or disguise ourselves use the headless Firefox browser or something like that to crawl. So, we do disclose all the stuff there. Before we pull every single bot we make sure that they obey robots.txt and all the sources you see listed there obey robots.txt.
Eric Enge: Right. So, the bots that you are making use of, you are leveraging their existing crawl data, right? As opposed to having them do a custom crawl for you?
Rand Fishkin: In some cases but not all. So, for example, I believe it’s Exalead that will crawl for you for specific stuff. So, you can contact them and say can you crawl this, and I want this crawl data, right? And, I think the guys with Majestic SEO are the same way, right? So, if you say I want you to crawl this, can you crawl this for me and tell me that, they can do that too. Of course there are open source crawlers, so anyone who wants to download an open source bot and just crawl a few hundred or few thousand pages can do that too, and build their own stuff, and then that kind of thing. But yes, so we may buy data to backfill anything that we can’t get on our own, or that we have trouble getting on our own.
Eric Enge: Right. So, what about the people out there who have done the math and say there is no way you could have crawled more than 7,000,000 pages in the time that you are talking about?
Rand Fishkin: I think that if you go and look at Linkscape and you start to add up the number of links and all kind of stuff, you will see that our index is much larger than that. There are 30,000,000,000 URLs, and I don’t know that we’d have any particularly good reason not to disclose that number, right? If the number was actually 10 billion I would definitely say 10 billion because that would be accurate.
I think Danny Sullivan was very wise when he said that number of URLs crawled is such a terrible metric. It really is meaningless. So, I would say if 30 billion sounds like it doesn’t mean anything to you, it really doesn’t mean anything. Go look at the tool and see how many domains and pages we actually know about that are pointing to your site or the site that you are interested in looking at.
Eric Enge: Right. I agree that 10 billion or 30 billion doesn’t matter, but there is speculation that it’s really 7 million.
Rand Fishkin: 7,000,000, I mean I think we have like 7,000,000 or more links to Google. Certainly not every page on the web links to Google. Google.com; number of links. So, let’s see, we’ve got 178 million unique links to pages on Google.com. So, that’s probably more than 7 million pages that we had to crawl to get that.
If you want to dispute it we are not going to be releasing our index in a downloadable format. It is tons of terabytes of data. Certainly if you spend any time looking through the tool and looking at the number of links pointing to even very small domains, you’ll get a sense that they clearly have a ton of pages. And, the 30 billion number just comes from the index, obviously I haven’t actually seen it. I just asked Nick and Ben how many pages we have. And they said we’ve got right around 30,000,000,000.
Eric Enge: Right. Just to summarize at the end of the day you’ve engaged multiple sources to sample the data and used a variety of different seeds
Rand Fishkin: For the crawl that we designed, the seed set is something that we set right at the beginning. So that doesn’t really change. And then, we know from that seed and from the crawler we build out what we want to crawl next, what we haven’t seen yet, and where we have gaps in our data. And, that’s where we do the back filling. If we haven’t crawled something than we’ll say alright, that’s on the schedule for next month’s update, or this looks like it’s very fresh and different than last time we crawled it. Just to make sure we crawl it again next month.
Eric Enge: So, you talked a little bit about this before, but let’s expand upon it. What should a site webmaster do if they don’t want to be in the SEOmoz index?
Rand Fishkin: There are a bunch of things that they can do obviously. I think there is the SEOmoz meta tag, which says SEOmoz, don’t include me in your results. I think there are two problems that a lot of people have with that. One is that you can’t do it with one command for a whole site, which is frustrating. When we pull from third party sources, they wouldn’t know not to crawl you.
So, basically we have a big list of gaps in our data, right? So, the meta tag tells us not to include this page in your index, and we know not to do that. Note that the only way to make sure that Google never includes your URL is not to use robots.txt, because then they will still show it just without any crawled data.
You have to use the meta noindex because using the meta noindex they will know not to show you in our results. So, we are the same way. If you use the meta noindex we will know not to show you in our results. Obviously, you can use robots.txt to block any or all of the sources.
And, if you are concerned about other bots there is a company called Syntryx that builds its own World Wide Web Index, and then sells something very similar to Linkscape for around $25,000 a year and plenty of other private label tools that do similar things. I think there are numerous other ones who do stuff like this as well and that Linkscape is not as unique as we hoped it is. It’s just one of the only public ones, one of the only ones that anyone can get access to.
Eric Enge: And there is Attributor and Visible Technologies?
Rand Fishkin: Sure, yeah. Attributor is a little different, they don’t crawl for links, they crawl for content. So, they try and find people who’ve stolen content across the web. They do the web crawling and say oh, it looks like you are stealing content from the AP; you owe them a link, or a license, or you have to pay them, or whatever. I think Visible is for reputation management and that kind of stuff. But, in any case if you want to block any bot out there, you should use the robots.txt. I think that some other smart people have said that if you want to be really careful, you can restrict by IP address. So, you can basically say aha, you are a bot, you are requesting my robots.txt, let’s check if you are the IP address that really matches up to your bot. And, I think that’s a great way to keep prying bots out.
Eric Enge: Right. So, if you use the noindex tag you do not include the page; is that correct?
Rand Fishkin: That’s right. We used to have in Linkscape for the first 9 days that we launch. We’d actually show you all the links from meta noindex pages.Then, someone said hey, wait a minute that’s not cool, you shouldn’t be showing those..So we removed it. Now we basically hide all of those by default. You can no longer see meta noindex links inside SEOmoz, and that’s for robots. Now we have this new tag SEOmoz to basically say I am fine with this showing up in other search engines, I just don’t want to show up in you guys.
Eric Enge: Now, by keeping out, does that mean that you won’t show the links originating on that page or you won’t show the links going to that page?
Rand Fishkin: The former. Basically we are just like the search engines, just like Google or Yahoo. You could run a link search at Yahoo, Google, Exalead, or Live when they had theirs active. You could get links to a page because that data comes from those pages that link to it. What we won’t show is that page in any of our result. So, for example, if you are linking to spammyviagra.com or whatever, we won’t show you in our list of links to spammyviagra.com when someone queries. It’s just like the major search engine.
Eric Enge: Sure. So what happens at the end of the day if somebody who is really determined didn’t want their competitor to be able to get back link structure on their site?
Rand Fishkin: They should make sure that everyone who links to them uses the meta noindex, meta SEOmoz noindex tag.
Eric Enge: Right. Practically speaking.
Rand Fishkin: Not very practical; no.
Eric Enge: But, you can go to linkdiagnosis.com and get a less robust set of data perhaps, but you’d still get a back link profile on someone’s site that you can download on your spreadsheet.
Rand Fishkin: Yes. And, the same is true of Yahoo Site Explorer. You can go in there and export a CSV and see all those links that point to it right now. There are some people who have commented online that keeping yourself out of SEOmoz is not particularly valuable. You really want to keep out everyone who links to you, and that’s a much more challenging task. Linkscape is one of at least half a dozen products that I know of that will show you this data.
Eric Enge: Right. So are you being held to a different standard?
Rand Fishkin: You know though, I think that is SEOmoz’s own fault. The messaging that we’ve always had, and the brand that we’ve always created is one of extreme transparency and extreme respect for Webmasters, and this is I think is a break from that. We are basically not fulfilling the same expectations at that high level of transparency that Webmasters have had in us since we launched SEOmoz.
I am conflicted about it, I really am. But at the same time we feel really strongly about this product. If we didn’t build it, we’d want someone else to build it, and obviously other people have built it. I think to a certain degree there is a conflict between the brand that we created and the expectations that people have for us and how we are going to treat Webmasters.
We’ve always had competitive link intelligence tools in SEOmoz, but they pulled from other sources, from Yahoo, or Google, or Microsoft, or other people. This one is kind of like oh, well now you are the source of it; you are the one crawling me and grabbing my data and that kind of thing. And I can’t really criticize folks who are upset about that. I think that they have a legitimate point, and I do feel bad about it. I think there is more we want to do.
A couple people have suggested that rather than having to pay for your data for your own site, SEOmoz should offer it just like Google Webmaster Tools does, a way to register your domain as your own.. And then, you can go and see the link intelligence report for your domain in advance for free. I think that is an excellent suggestion and it’s also very valuable from a marketing perspective.
Eric Enge: Something that also strikes me as quite interesting that you have are tools like HitWise and ComScore and how they give you detailed information on other people’s pay per click campaigns.
Rand Fishkin: Yes, you can see where people go on the web and what they search for, and all that kind of stuff. If you put it in a relative sense SEOmoz doesn’t seem like a bad guy. But, if you put it in a sense of hey, this is a group that’s always had this ethical and moral stance towards transparency and made it a core of their identity, and that kind of thing.
Now, they are being a little dodgy about how they grab, or the fact that they are so aggressive with grabbing, and that you actually have to jump through several hoops if you are not willing to use that meta SEOmoz tag to keep your information out, that kind of stuff. And so, I understand that frustration and that anger, and I don’t want to downplay it. I don’t want to say oh well, they are all wrong, right? They are not all wrong; I think there is some legitimacy to those concerns.
Eric Enge: Right. At the end of the day, the value you see in the tool is something that you feel strongly enough about that you will deal with the criticisms.
Rand Fishkin: That’s the bottom line. And this project has been a dream of mine since 2005. I think I mentioned in my presentation at SMX East that Todd Malicoat, Nick from Threadwatch, and others all said hey, this is a really good idea; someone should build a crawler and an index within the World Wide Web that’s for SEOs. It should use SEO types of metrics, and build our own version of PageRank so we don’t have to rely on Google’s. Then we can compare it to Google’s, and all that kind of stuff. And that, I feel, is a really important goal and something that we really want to do. It’s something that we weighed against potential concerns and decided to build.
Eric Enge: Let’s go ahead and talk about ways that people can use this and things that they can do to make their lives easier as an SEO.
Rand Fishkin: Well, there are a bunch of things that always frustrated me about Yahoo Site Explorer. One of the things is if you get a thousand links and there is no information about why that thousand is there instead of another thousand. What order they are in, whether it’s most important to least important, or are they all scrambled up? They will show you nofollows, and non nofollows; and they won’t mark them.
They don’t show you the target URL where that link is pointing to if you’ve done a domain wide link search. And, they won’t tell you the anchor text. There is no indication of how much link juice it is passing. So, we always wanted those kinds of things in link analysis tools. Scraping Yahoo and then trying to reverse all that data is just not an efficient way to do it. Building your own index I think is the only good way to do it.
Then, we have wanted to have all this information about people for a couple of years now. We have been saying do Nofollow PageRank sculpting, right? Other people said that Nofollow PageRank sculpting is a sure sign, it’s a big flag to Google, because almost nobody does it except for SEOs. And, there is really no way to refute that until we’ve built this index in the World Wide Web.So out of two hundred or so billion links, it turns out that almost 2% of all links on the web use nofollow. And, almost half of them use it internally to point a page on their own site; so that turns out as a massive group of websites and webpages.
Eric Enge: Right. And, it includes sites like Citysearch?
Rand Fishkin: Oh, yes. I mean it includes all kinds of sites. Facebook, and Yelp, and MySpace. Many people are using PageRank Sculpting. I think that’s great information to have about the web as a whole. One of the other fascinating things is that twice as many pages that use 301 use the 302 Redirect. And that’s probably not entirely smart, right?
Eric Enge: Well, you do realize that every web developer at birth gets subjected to radiation, which makes 302 the default redirect they use.
Rand Fishkin: That probably causes some of that. All of that aggregate data is incredibly valuable for this analysis sort of stuff. And we have all the individual data and features that you dream that you wish you had in Yahoo Site Explorer. So, if you want to know what is making a site rank above another, you can go look and see a bunch of metrics about that particular URL, including who is linking to them and what the anchor text is saying.
A lot of times a really, really crappy page on a site like Wikipedia will out-rank a fantastic page or the homepage of a really important niche website, because that’s how Google works. They are kind of bias towards big, important domains right now. I feel like the ability to see that, and the ability to know a lot about Off Page SEO is important, because it’s just so frustrating to try to figure out what the engines are doing or which direction they are going without having this.
When we did the search engine ranking factors on SEOmoz we asked all sorts of experts (including you, Eric!), what they think are the most important elements? And we saw people from very different backgrounds, people who tend to disagree with each other a lot on the blogosphere, all come together with some pretty similar answers. They all said that links are really important and domain authority is really important.
Eric Enge: Right. And, if you are trying to initiate a new link building campaign, and you are trying to figure out where to go, having that ability to get some sense of what the most important targets is very valuable.
Rand Fishkin: Yes. I think that’s one of those really fun things, and I think it’s a process that will refine overtime. Right now when we stack up mozRank, which is our main metric, it’s very similar to Google PageRank in its intuition in that all links are not created equal. Pages with lots of links pointing to them are more important than pages with few links pointing to them.
One of the fun things that we really like to do like go and try to plug in alternate versions of your domain. We plugged in seomoz.com, www.seomoz.com, seomoz.org, www.seomoz.org, and you can see how many people are linking to each version of the different site, and how many different domains are pointing to that. So, it’s really fun to do that kind of comparison
I think those comparisons can be really valuable to people. You can see if you need to redirect some things, or what you should be concentrating on in terms of building more juice to these domains, or that kind of thing. All that kind of stuff is really quite awesome to do. And then, there is spam analysis too. I think people might be somewhat critical of this like they have been in the past.
My general stance on it has been pretty unwavering, which is that SEOmoz is not an organization that says, we are going to try and protect spam from being discovered by the engines. We say hey, if you are a spammer, you know the risks of manipulating the engines. I am certainly friendly with a lot of people who do black hat SEO and I have no problem with them. Except in the worse cases, I don’t even think it is illegal or even that bad. But, Linkscape is fantastic for going in and realizing which sites are manipulative so you can know not to invest in those links.
We had this funny experience inside SEOmoz where we were kind of like oh man, are the search engines going to be really upset when we launch this? But we learned by hanging out with engineers from Google, Microsoft and Yahoo, and at SMX East that they all had the same thing going.
They were on a panel and an audience member asked: my competitor I think he is using spam blah, blah, blah, and Aaron from Google was up on our panel, he said you should use Linkscape and figure out what he thinks of those links, and then maybe report them to us. They all sort of had the same mentality of that maybe this isn’t so bad, and that it might be good to outsource some of the spam control back to Webmasters to a certain degree.
Eric Enge: Right. So what is the cost to use Linkscape?
Rand Fishkin: Well there are some basic features, things like how many domains are linked to you, how many pages are linking to you and a couple of other metrics that are all free. So, no matter who you are, you can grab them on your site or any other site. The advanced tools, the ones that show the list of links and all that kind of stuff where you can search through URLs or anchor text is part of a pro-membership, and pro-membership is $79 a month to start, and then I think it goes up to $249 for the top package. That top package gives you lots and lots more Linkscape reports per month. I think the basic package has 20 advanced reports per month and unlimited free basic reports, obviously.
Eric Enge: Right. What if someone has an annual membership?
Rand Fishkin: Then they get 20 reports every month, and it refreshes your credits every month. It’s the same style as some of compete.com’s services, where you get credits every month, and then you use them up. I think we are coming up with a system to where people can buy individual credits and block standard features. So, we should have something released on that front soon, hopefully.
Eric Enge: Thanks Rand!
Rand Fishkin: Yes, thank you Eric!