Fabrice Canel is a Principal Program Manager at Bing, Microsoft where he is responsible for web crawling and indexing. Today’s post is the transcript of an interview in which I spoke with Fabrice. Over the 60 minutes we spent together we covered a lot of topics.
During our conversation, Fabrice shared how he and his team thinks about the value of APIs, crawling, selection, quality, relevancy, visual search, and the important role the SEO continues to play.
Eric: What’s behind this idea of letting people submit 10,000 URLs a day to Bing?
Fabrice: The thought process is that as our customers expect to find latest content published online, we try to get this content indexed seconds after the content is published. Getting content indexed fast, is particularly important for content like News. To achieve freshness, relying only on discovering new content via crawling existing web pages, crawling sitemaps and RSS feeds do not always work. Many sitemaps are updated only once a day, and RSS feeds may not provide full visibility on all changes done on web sites.
So instead of crawling and crawling again to see if content changed, the Bing Webmaster API allows to programmatically notify us of the latest URLs published on their site. We see this need not only for large websites but for small and medium websites who don’t have to wait for us to crawl it, and don’t like too many visits from our crawler Bingbot on their web sites.
Eric: It’s a bit like they’re pushing Sitemaps your way. And the code to do this is really very simple. Here is what that looks like:
You can use any of the below protocols to easily integrate the Submit URL API into your system.
Fabrice: Yes, we encourage both, pushing the latest URLs to Bing and having sitemaps to insure we are aware of all relevant URLs on the site. Pushing is great but the internet is not 100% reliable, sites goes down, your publishing system or our system may have temporary issues, sitemaps is the guaranty that we are aware of all relevant URLs on your site. In general, we aim to fetch sitemaps at least once a day, and when we can fetch more often most sites don’t want us to fetch them as often as every second. Complementary to freshness, RSS feeds is still a good solution for small and medium sites, but some sites are really big, and one RSS can’t handle more than 2500 URLs to keep its size within 1 MB. All of these things are complementary to tell us about site changes.
Eric: This means you will get lots of pages pushed your way that you might not have gotten to during crawling, so it should not only enable you to get more real time content, but you’ll be able to see some sites more deeply.
Fabrice: Absolutely, every day we discover more than 100 billion URLs that we have never seen before. What is even scarier, these are the URLs that we normalized– no session ids, parameters, etc. This is only for content that really matters and it’s still 100 billion new ones a day. A large percentage of these URLs are not worth indexing. Some simple examples of this include date archives within blogs or pages that are largely lacking in unique content of value. The Bing mechanism for submitting URLs in many cases is more useful and trustable than what Bingbot can discover through links.
Eric: For sites that are very large, I heard you make reference that you would allow them to form more direct relationships to submit more than 10,000 URLs per day.
Fabrice: You can contact us, and we’ll review & discuss it, and how it bears on business criteria of the sites. please don’t send us useless URLs, as duplicate content or duplicate URLs, so we won’t send fetchers to fetch that.
Eric: How will this change SEO? Will crawling still be important?
Fabrice: It’s still important to ensure that search engines can discover your content and links to that content. With URL submission you may have solved the problem of discovery, but understanding interlinking still matters for context.
Related to that is selection, SEOs should include links to your content and selection. The true size of the Internet is infinity, so no search engines can index all of it.
Some websites are really big, instead of adding URLs to your sites to get only few of the URLs indexed, it’s preferable to focus on ensuring the head and body of your URLs are indexed. Develop an audience, develop authority for your site to increase your chances of having your URLs selected. URL submission helps with discovery, but SEOs still need to pay attention to factors that impact selection, fetching, and content. Ultimately, your pages need to matter on the Internet.
Eric: So, even the discovery part, there is still a role for the SEO to play, even though the API makes it easier to manage on your end.
Fabrice: Yes, for the discovery part there’s a role for the SEO to remove the noise and guide us to the latest content. LESS IS MORE. The basics of content structure still matter too. For example, you:
- still need titles/headers/content
- still need depth and breadth of content
- still need readable pages
- still need to be concerned about site architecture and internal linking
Eric: On the AI side of things, one of the things I think we’re seeing is an increasing push towards proactively delivering what people want before they specifically request it– less about search, and more about knowing preferences & needs of users, serving up things to them real-time, even before they think to do a search. Can you discuss that a little bit?
Fabrice: You might think of this as position “-1”, this is not only to provide results, but to provide content that may satisfy needs of the people, information that is related to you and your interests, within the Bing app or Bing Home page. You can set your own interest via the Bing settings and then you will see the latest content on your interest in various canvas. I am deeply interested in knowing the latest news quantum computing… what’s your interests?
Instead of searching for the latest every five minutes, preferable to be notified about what’s happening in more proactive ways.
Eric: So Bing, or Cortana, becomes a destination in of itself, and rather than searching you’re getting proactive delivery of content, which changes the use case.
Fabrice: Yes. We prefer surfacing the content people are searching for based on their personal interests. To be the provider of that content, to have a chance to be picked up by search engines, you have to create the right content and establish the skill and authority of that content. You must do the right things SEO-wise and amplify the authority of your site above other sites.
Eric: There’s always the issue of authority, you can make great content, but if aren’t sharing or linking to your content, it probably has little value.
Fabrice: Yes, these things still matter. How your content is perceived on the web is a signal that helps us establish the value of that content.
Eric: Let’s switch the topic to visual search and discuss use cases for visual search.
Fabrice: I use it a lot, and shopping is a beautiful example of visual search in action. For example, take a picture of your chair with your mobile device, upload the image to the Bing Apps and bingo you have chairs that are matching this model. The image is of a chair, it’s black, and the App will find similar things that are matching.
Visual search involves everything related to shopping, day to day object recognition, people recognition, and extracting information that is matching what your camera was capturing.
Eric: For example, I want to know what kind of tree that is …
Fabrice: Trees, flowers, everything
Eric: How much of this kind of visual search do you anticipate happening? I’d guess it’s currently small.
Fabrice: Well, yes, and no. We use this technology already in Bing for search and image search– understanding images we are viewing on the Internet– images with no caption or no alt text relating to the image, if we are able to recognize the shapes in the image, people may put in text keywords, the image may have additional meaning, extracting information that can advance the relevance of a web page.
Going beyond Bing and search, this capability is offered in Azure and articulated in all kinds of systems across the industry, this is offering enterprises the ability to recognize images, also camera inputs, and more. This can also extend into movies.
Eric: You mentioned the role images can play in further establishing the relevance of a web page. Can visual elements play a role in assessing a page’s quality as well?
Fabrice: Yes, for example you can have a page on the Internet with text content, and within it you may have an image that is offensive in different ways. The content of the text is totally okay, but the image is offensive for whatever reason. We must detect that and treat it appropriately.
Eric: I’d imagine there are scenarios where the presence of an image is a positive quality identifier, people like content with images after all.
Fabrice: Yes, images can make consuming the content of a page more enjoyable. I think in the end it’s all about the SEO, you need to have good text, good schema, and good images, Users would love to go back to your site if it’s not full of ads, and not too much text with nothing to illustrate. If you have a bad website with junky HTML people may not come back. They may prefer another site with preferable content.
Eric: Integration of searching across office networks is one of the more intriguing things we’ve heard from Bing, including the integration with Microsoft Office documents. As a result, you can search Office files and other types of content on corporate networks.
Fabrice: When you search with Bing and you are signed up to a Microsoft/Office 365 offering enabling Bing for business, Bing will also search your company data, people, documents, sites and locations, as well as public web results, and surface this search results in a unified search results experience with internet links. People don’t have to search in two three places to find stuff. Bing offers a one-click experience, where you can search your Intranet, SharePoint sites for the enterprise, and the Internet all at once. You can have an internal memo that comes up in a search as well as other information that we find online. We offer you a global view. As an employee, this is tremendously helpful to do more by easing finding the information.
Need to find the latest vacation policy for your company? We can help you find it. Need to know where someone is sitting in your office? We can help you find that too. Or, informational searches that we do can seamlessly find documents both online and offline.
Eric: Back to the machine learning topic for a moment – are we at the point today where the algorithm is obscure enough it is not possible for a single human to describe the specifics of ranking factors.
Fabrice: In 15 minutes it can’t be effectively done. We are guided through decisions we are taking in terms of quality expectations and determining good results vs. not so good results. Machine learning is far more complicated, when we have issues, we can break it down, find out what is happening per search. But it’s not made up of simple “if-then” coding structures, it’s far more complicated
Eric: People get confused when they hear about AI and machine learning and they think that it will fundamentally change everything in search. But the reality is that search engines will still want quality content, and need determine its relevance and quality.
Machine learning may be better at this, but as publishers, our goal is still to create content that is very high quality, relevant, and to promote that content to give it high visibility. That really doesn’t change, it doesn’t matter whether you’re using AI / machine learning or a human generated algorithm.
Fabrice: That will never change. SEO is like accessibility where you need common rules to make things accessible for people with disabilities. In the process of implementing SEO you’re helping search engines understand the thing, you need to follow the basic rules, you can’t expect search engines to do magic and adapt to each and every complex case.
Eric: There’s an idea that people have that machine learning might bring in whole new ranking factors that have never been seen before. But it’s not really going to change things that much is it?
Fabrice: Yes, a good article is still a good article.
Eric: A couple of quick questions to finish. John Mueller of Google tweeted recently that they don’t use prev/next anymore. Does Bing use it?
Fabrice: We are looking at it for links & discovery, and we use it for clustering, but it is a loose signal. One thing related to AI, at Bing we look at everything, this isn’t a simple “if-then” thing, everything on the page is a hint of some sort. Our code is looking at each and every character on each and every page. The only thing that isn’t a hint is robots.txt and meta noindex (which are directives), and everything else is a hint.
About Fabrice Canel
Fabrice is 20 years search veteran at Bing, Microsoft. Fabrice is a Principal Program Manager leading the team crawling, processing and indexing at Bing, so dealing with the hundreds of billions of new or updated web pages every day! In 2006, Fabrice joined the MSN Search Beta project and since this day, Fabrice is driving evolution of the Bing platform to insure the Bing index is fresh and comprehensive and he is responsible for the protocols and standards for Sitemaps.org and AMP on Bing. Prior to that MSN Search, Fabrice was the Lead Program Manager for search across Microsoft Web sites in a role covering all aspect of search from Search Engines technology to Search User Experience… to content in the very early days of SEO.