Ramez Naam is Group Program Manager of Live Search at Microsoft responsible for overseeing Live Search queries including relevance and the index.
Naam joined Microsoft Corp. in 1995, and has worked on teams for Microsoft Exchange Server, Microsoft Office Outlook and Microsoft Internet Explorer.
He is the author of a nonfiction book on biotechnology titled “More Than Human: Embracing the Promise of Biological Enhancement.” Published by Broadway Books/Random House in March of 2005.
Eric Enge: At the recent media event you called “Searchification”, you announced a dramatic increase in index size. I think you went from about five billion documents to about twenty billion documents. How will this affect relevance?
Ramez Naam: The first step in providing relevant results to customers is being able to find them, having them in the index at all. If we don’t have them in the index, there is no way we can serve them up for our customers, so that’s just a requirement. The web is constantly growing, so you’ll see us continually grow our index in the future as well. It really helps, not on the most common queries, but on the more difficult queries and queries that are less frequent. Having additional content makes the difference between being able to serve up good results for customers or not good results.
Eric Enge: Do you have some examples that you can talk about of things that illustrate that point?
Ramez Naam: One example I gave at Searchification was Dona Nobis Pacem, sung in Latin, where before we just didn’t have the right result in the index at all, so we couldn’t help the customer. Now they get the right result from us.
Eric Enge: Yes, it means “gives us peace”.
Ramez Naam: Yeah. Another sample query is “Janet Buxman Kurihara”, where previously we didn’t have her executive profile page in our index at all, and now we do. As a result, we are able to get the exact right result at the top of the results, and in fact even with our new index now there are only ten results for this query on the web, and a couple of those are our announcement in our press release.
Eric Enge: That makes sense. , I suppose though once you get to a certain index size that there is less to be gained by adding a lot more pages at that point, like if you went from twenty billion to thirty billion, the gains start to diminish.
Ramez Naam: There certainly are diminishing returns, but at the same time the frontier of research is helping people with the hardest queries they have. And so, on those kind of super-difficult and super-tail queries, it continues to be very important.
Eric Enge: You also talked about was the ability to determine query intent better, and I think you gave a couple of examples while you were doing your demos, stop word handling queries, query streams, stemming, query term equivalence and punctuation analysis. Can you talk a little bit about these?
Ramez Naam: Absolutely. Today or historically, the burden has been on the searcher to type their search in exactly the way that a page would express the same concepts. They have to use the exact same spelling, e-mail versus email for instance. A search engine would not view those as equivalent, but they really are the same concept. Another direction for search in general, and certainly one for Live Search is to let the customer simply express the query in a way that makes sense. Then, put the burden of matching on different spellings, or a plural versus singular, or even close synonyms on the search engine relevant.
Eric Enge: Right, absolutely. So, let’s just talk about a couple of related examples. What’s an example of a stop word?
Ramez Naam: A stop word is a word like the, or a, for instance. And so, one of the examples I used in the demo was a query like “the office”. Historically any search engine would just drop the word the, which translates the query from being a query about a television show to a query about a software product, which are just totally not the same thing whatsoever. So, we have to be intelligent.
It’s not always the case, but sometimes keeping the stop words helps you, and in many queries keeping the stop words in the query hurts you. So, you have to selectively understand what the customer intent is, when is the stop word part of the name of something? When is it that part of a name of a book, or television show, or a person, or a place? And, those are especially important cases for you to actually preserve it in the query.
Eric Enge: Do you have an algorithmic way of making that determination, or is there a manual aspect to this?
Ramez Naam: It’s algorithmic. In web search the diversity of queries is so large that there is only a little bit you can do manually. Everything has to be algorithmic if you want to scale.
Eric Enge: Right, so you have a way of determining that for a query like “the office”, that “the” is a part of it, because of the context in which it’s used across a number of sites on the web.
Ramez Naam: Exactly, things like that.
Eric Enge: Right. So, what about query stream stemming?
Ramez Naam: Stemming is a great example of a singular word versus a plural one. Fish recipe versus fish recipes for instance. The best results for the query are if you stem it, because the best pages are on sites like Allrecipes.com for instance. The title of their page is fish recipes. It has the singular form of the term on there as well, but it’s a far better match if you allow for a match that’s plural. And again, you shouldn’t force the customer to have to understand what the page will contain exactly. But, when customers issue queries, what they are really issuing is kind of a description of what they want rather than a sample text of what might be in that page.
Eric Enge: Right. For this example when a customer types in “fish recipe”, they probably actually want “fish recipes”.
Ramez Naam: Yes. They are trying to get to at least one, and sites that contain such things probably have used the word recipes.
Eric Enge: Understood. What about query term equivalents?
Ramez Naam: Equivalents are things like email versus e-mail for instance.
Eric Enge: That we already talked about, right?
Ramez Naam: Right.
Eric Enge: How about learning how to analyze punctuation, because you talked about punctuation analysis at Searchification. That’s just understanding the way people use different punctuation characters and whether or not they are an important part of the query or not?
Ramez Naam: That’s right. And, whether you should concatenate the terms or the letters around them into one word.
Eric Enge: So, for example St. is the same as saint?
Ramez Naam: Yes. Or, the example we used in the demos was C.N.N., because what you will see most on the web is CNN. Or, U.S. usually means US.
Eric Enge: Right, which doesn’t mean us.
Ramez Naam: That’s actually a very tricky one for just that reason.
Eric Enge: You could translate it into something where you get more confused than ever.
Ramez Naam: Yes. So, we really have to look at the context.
Eric Enge: Right. There is a lot of contextual analysis going on. Let’s talk about improved query refinement. Can you talk about what that is, and provide some specific examples along the way?
Ramez Naam: Yes. There are actually a couple of different things we’ve put into this. One of them is really a little bit more like the query intent stuff we just talked about, which is automatic spell correction. We all mistype things, and, there are times that we simply don’t know how to spell something, and that’s an even more important scenario. So, we try to again remove that burden from the customer and shift it to the search engine to do the right thing with their query.
Eric Enge: Right. So, someone could misspell Britney Spears for example S-P-E-E-R-S say, and you might spell it S-P-E-A-R-S. If, I remember this correctly, don’t you actually just automatically make the correction rather than offer them a link that they can click on?
Ramez Naam: When we are confident we will automatically make the correction, we’ll still give them a link to tell them what we did, and let them undo the correction in case they really wanted the old spelling, but yes we will do that for them automatically.
Eric Enge: Right, which is a little bit of a departure from what the other engines do, it’s a nice thing. Clearly to make that work then, you’ve got some sort of confidence score that you are keeping and you have to exceed some threshold before you would do that.
Ramez Naam: Yes. We have to use a quite a bit of data and thus see the algorithms to be able to figure out when that confidence is high enough.
Eric Enge: Right. So, let’s talk a little bit about query suggestions.
Ramez Naam: Query suggestions are an area we moved into last year with the launch of Live Search 1.0, and that we are keen to improve on. What we see is that customers don’t want dead-ends, they want to be able to refine their queries, and picking the right queries is important. And then secondly, sometimes if they are searching on one topic, a related topic might be interesting to them.
Eric Enge: Right. For example if you type the name of some town, you might like to see suggestions for things that relate to that town. Perhaps, you are looking for restaurants, or a sports teams or whatever.
Ramez Naam: Yes, absolutely.
Eric Enge: Right. You are trying to help them, the other thing that I remember from being at the event is the search experience for famous people.
Ramez Naam: Yes, especially in the image search protocol. If you do a search in images for someone like George Bush, off in the right we have a set of suggestions for related people. We find that customers really like this, actually this has really high engagement. Especially in searching images people like to browse around, they like to look at images; they are curious about the relationship between people.
Eric Enge: It’s kind of interesting, because historically the search engines have all had an image search protocol as a separate tool. But, my guess is that you are driving the image search volume much higher by integrating it into the core of web search in that fashion.
Ramez Naam: It is true that customers really like to get relevant results from any type of content in that one results page. So, at the top of the page when it’s appropriate we’ll put news articles, we’ll put images, we’ll put videos, we’ll put star quotes, or wherever we think the most relevant kind of content is for a customer.
Eric Enge: Right. Machine is another note I have here. Are we talking about language translation?
Ramez Naam: This is language translation. This an area where I’m in collaboration with Microsoft Research. We have really innovated in the user experience around translating pages. Specifically, most translation services let you put in the original text and get back the translated version. But, we have this very rich way to view both the original content and the translated content, which is useful in all sorts of ways.
Eric Enge: Right. You have a way that you can see both versions side by side, right?
Ramez Naam: Exactly.
Eric Enge: You can actually see the translations, which if you have a crude knowledge of the original language, you might be able to detect problems that way.
Ramez Naam: That’s right, and you can learn a lot, because we do kind of sentence by sentence translation. If you hover over one translated sentence, we’ll show you what the text is in the original document as well. So, I just have a page in front of me that is a German page that I translated into English. I can go hover over a part of the English page and it will show me what part of a German page that text was translated from.
Eric Enge: Do you have a sample search I can try?
Ramez Naam: Yeah, if you go to a search engine, type in Volkswagen Kaefer. Then you will see the sixth result is the VW Kaefer Wikipedia page. This is a German page, so just click on translate this page and you will get it in this side-by-side view. Now, you can hover over any part of the text in the German site for instance, and you’ll see the translation into English.
We think this is a really very innovative user experience that takes reading content that is not in your native language to a new level.
Eric Enge: Right. Now, presumably that also means if I enter an English query, and you find a really good answer on a German site, that I might see that result. Is that something that you’ll present in the results?
Ramez Naam: Yeah, we will do that, but not that often. Currently, we try to find content in your language if at all possible. This helps English speakers certainly because there is good content up there. But, where it really has an impact is worldwide, because there is so much English language content on the web already. But, if you are a German speaker, there is less content available.
If you are a Korean speaker, there is less content available. So, it can really help people worldwide.
Eric Enge: Yes, indeed. That’s very cool. Let’s talk a little bit about RankNet.
Ramez Naam: RankNet is innovative technology that came out of Microsoft Research. It’s a neural network for ranking on the web. A neural network is a simple simulation of the way the human brain works. It has some number of neurons that all can interconnect to each other, and you can show with examples of how it should respond to various kinds of inputs and you can train it. For instance, we showed examples of pages that are relevant and not relevant for queries, and it learns which types of pages it should rank the most highly.
It’s an AI approach or a machine learning approach that’s been used in many fields, that’s been used in fields like computer vision, also prediction sorts of scenarios. We believe we are the first to apply such a technique to web search, and it’s one of the core foundations of our ranking technology.
Eric Enge: What does it allow you to do that you can’t do with other algorithms?
Ramez Naam: With other algorithms you need to think about the signals on a page that are important in terms of whether or not it’s a good match for a query. Other algorithms have to be very simple, because people have to typically hand-tune the weights of various signals for instance. You might say it’s how many times all the words appear in the title plus how many times all the words appear in the body, things like that.
With RankNet we can let the machine figure this out for itself, so we can put together a much more complex view of pages overall, rather than a simple score via very simple rules. This can start to behave like a human, like someone who could read a page, understand it to some extent and decide whether or not it was a good match for a query.
Eric Enge: Okay. Perhaps you might also get some advantages from understanding how to treat one category of websites differently than another?
Ramez Naam: Absolutely. A neural net can learn complex things like that.
Eric Enge: Right. That’s interesting; any other advantages that we should talk about?
Ramez Naam: I think you will see the biggest advantage are in the most complex queries, such as those that have real meaning in the relationship of the words to each other. For instance, the query “hottest temperature in the State of Arizona”. Of course we can have a complex query, it’s not just a navigational query, it’s actually expressing a relationship, hottest temperature as it relates to the State of Arizona. Those sorts of complex scenarios are the ones which the neural net could do well.
Eric Enge: Right, excellent. Alright, so let’s talk about structured information extraction.
Ramez Naam: Right. What we see today on the web is a lot of text, but hidden in that text is structured data that is super relevant to customers. And, that can be presented not just as free text, but as structured in some way. So, examples of this are our opinion index, where we take the text of user reviews for instance, and combine those into more structured information about a product. So, for instance if you go to a query like Canon Powershot.
Then, click on the first image you see, the Canon Powershot SD800. You see here that out of reviews across the web, hundreds of reviews, we’ve pulled out structure, we’ve pulled out how customers rate on the left hand side various factors of this camera, like its price, and its speed of lens. This is the core technology that we use for crawling the web, indexing it, understanding it, and ranking it. We can start to pull out this kind of structured information, aggregate it for users, and give them a richer, more powerful experience.
Eric Enge: Right. You could argue that that’s in the dimension of adding structure to information which is currently unstructured, right?
Ramez Naam: You could say that’s adding structure, or you could say that’s extracting structure that is already there.
Eric Enge: That’s great. What are the other kinds of things other than reviews that you can do similar things with?
Ramez Naam: Right. If you look at Yellow Pages for instance, you’ll see areas there, some of which have reviews, but there are other sorts of structured information. So, if we type Seattle Crush, the top result is the Crush restaurant on Madison Street. It’s a favorite restaurant of mine, not a mile from my house. When we click on that link we see not just reviews that we pulled out, but we have photos of it. We have their email to contact the head chef there, we have their hours of operation, we have what the parking is, what the average price is, and the address. So, this is all structured information in this particular domain of local businesses.
Eric Enge: Right. So, you are extracting information to aid in the accuracy or the completeness of local search results.
Ramez Naam: Right. You see that we’ve done this in local, we’ve done this in shopping, and help in another area very important to us. We will keep doing this more and more as we see opportunities to give customers a richer experience.
Eric Enge: Right. Another area that you featured and talked quite a bit about was the expansion of the answers’ platform.
Ramez Naam: This was really a comment about backend capabilities that we’ve created to make it easier for us to integrate other types of information into the core web results page. So, when we talk about the answers’ platform, what we are talking about is if you type Britney Spears, how do we know that that’s a celebrity, and give you the right celebrity experience at the top. If you type Seattle Crush, how do we get local results at the top. It’s that framework behind the inclusion of content other than just the ten algorithmic web crawled links into the core page is what we call answers.
Eric Enge: As per your comments before then, that’s something that you are going to be pushing on, expanding, going into the future.
Ramez Naam: Yes.
Eric Enge: Right. Is Microsoft going to continue to place a heavy emphasis on vertical platforms, and are there other vertical platforms that you are focusing already that you might be able to talk about?
Ramez Naam: The high value and high interest domains we are currently focused on are commerce, local, health, and entertainment. You see specifically in entertainment that there is a big emphasis on multimedia for us. We think we have the best in breed image search of anyone out there in the industry. We think with our smart motion thumbnails, we’ve taken a big lead in thought leadership in video, and you’ll see those wrapped around that core experience.
Eric Enge: Right. One of the things that I hear from various places in the industry is that you have a greater emphasis on “on page” factors than some of the other search engines.
Ramez Naam: That’s very difficult for me to say, because I don’t know what the other engines do really. Our ranking algorithms will continually evolve, and we are always looking to make them better, so the emphasis could change day-to-day. We move fast, and we are always looking to increase customer satisfaction.
Eric Enge: So, what does the future hold from Live Search, anything that we should expect to see in near to medium term?
Ramez Naam: We are just going to keep on focusing on delighting customers. We think we’ve made fantastic progress in a short period of time, as the underdog in this field really. We think we’ve come a long way fast, and we are going to keep on delivering advances in relevance, advances in those domains, advances across the product as best as we can.
Eric Enge: Excellent! Well, I really appreciate your time today. It was great to catch up.
Ramez Naam: My pleasure, good talking to you.