Guest Post by Ashutosh Garg
In this article, by page quality we refer to quality of a page with respect to a search query and a user who issued the query.
Page quality is a broad concept and depending upon the actual context in which one plans to use quality score, the actual algorithm will vary. Instead of going into a specific algorithm, this article will present a framework in which to think about page quality and how one can morph it to fit one’s unique situation.
Some of the situations where page quality is used are
- Search Engines – search engines score a page with respect to query and use this signal to understand if a page might be relevant to a user’s query or not. Additionally, by assigning a numeric score, one can identify one page is “relatively” better than the other one or not.
- Ad Targeting – when showing a particular ad to a user, an adnetwork may score the ad and corresponding landing page against user issued query and use it to identify if the ad is indeed relevant to what user is looking for or not.
- Discovery – A page can be evaluated, even in absence of query, to understand its quality and thus identifying if this page should be recommended to end user or not.
In this article we will consider the different algorithms used to assess page quality.
The first set of algorithms will compute the score of the document as a function of the actual query issued by the user –
IR score – Information Retrieval community has been researching how to compute the best possible score for a page given a query. This is probably the most important score one can use in evaluating a page. Various open source search engines like Lucene implemented this algorithm. Given a query Q={q1, q2, q3} containing three words and page P, various steps that are used in computing the score of the page are
- Come up with a relative weight of each section of the page – A typical webpage can be divided into various components like – title, headings (H1, H2, H3..,) body, bold text, large text, small text (based on font sizes), text above the fold of the page (assuming a certain display), anchor text, boiler plate text, text on pages being pointed to, text on pages user visited to prior to visiting this page, text present in images on the page, URL text etc. Depending upon the application, one can assign different weight to different elements of the page. One rule of thumb is to see how people are going to discover the page and form their first impression. If it is search – people will discover page by reading the title and snippet displayed by search engines. People will form their first opinions by reading the text above the fold.
- Generate features based on query – Take the query and break it down into n–grams (a bigram is all phrases of length two). This is followed by assigning weight to each of these n–grams. For e.g. consider a query – “canon digital camera”. In this query, “canon” is an important unigram as it refers to the brand. “canon digital” is a bad phrase while “digital camera” is a good phrase. Traditionally people have used TFIDF (http://en.wikipedia.org/wiki/Tf*idf ) to come up with a weighting. One thing to be cautious is which dataset that is used to compute TFIDF. It should very closely resemble the dataset where weighting is applied.
- Document quality for computing TFIDF score – A document that consists of the content of all the pages on the web will match any query. However, it is not a great experience to come across a very large document. At the same time a document which is identical to query is bad as a user won’t learn anything new when (s)he lands on the page. Review what the platform is used by most of the visitors of your website. If they are using smartphones, ideal document length should be less than 500 words, tablet – 1K words, laptop – even longer given the presentation. Some way of normalizing the score by document length should be used. Various papers have been published in IR community to normalize IR score based on document length.
- A simple scoring of a document can be
Page P consists of fields di, with weight wi and query Q consists of words qk. Length of page is L, number of phrases in query is Nq
where f is a normalization element function based on doc length.
Which page has a higher IR Score for “Canon digital camera”?


Eric Enge: Cool. I really like that you’re continuing these social integrations. The latest is Quora. Tell us a little bit about that. 






