A Mathematical Model for Assessing Page Quality

Guest Post by Ashutosh Garg

In this article, by page quality we refer to quality of a page with respect to a search query and a user who issued the query.

Page quality is a broad concept and depending upon the actual context in which one plans to use quality score, the actual algorithm will vary. Instead of going into a specific algorithm, this article will present a framework in which to think about page quality and how one can morph it to fit one’s unique situation.

Some of the situations where page quality is used are

  1. Search Engines – search engines score a page with respect to query and use this signal to understand if a page might be relevant to a user’s query or not. Additionally, by assigning a numeric score, one can identify one page is “relatively” better than the other one or not.
  2. Ad Targeting – when showing a particular ad to a user, an adnetwork may score the ad and corresponding landing page against user issued query and use it to identify if the ad is indeed relevant to what user is looking for or not.
  3. Discovery – A page can be evaluated, even in absence of query, to understand its quality and thus identifying if this page should be recommended to end user or not.

In this article we will consider the different algorithms used to assess page quality.

The first set of algorithms will compute the score of the document as a function of the actual query issued by the user –

IR score – Information Retrieval community has been researching how to compute the best possible score for a page given a query. This is probably the most important score one can use in evaluating a page. Various open source search engines like Lucene implemented this algorithm. Given a query Q={q1, q2, q3} containing three words and page P, various steps that are used in computing the score of the page are

  1. Come up with a relative weight of each section of the page – A typical webpage can be divided into various components like – title, headings (H1, H2, H3..,) body, bold text, large text, small text (based on font sizes), text above the fold of the page (assuming a certain display), anchor text, boiler plate text, text on pages being pointed to, text on pages user visited to prior to visiting this page, text present in images on the page, URL text etc. Depending upon the application, one can assign different weight to different elements of the page. One rule of thumb is to see how people are going to discover the page and form their first impression. If it is search – people will discover page by reading the title and snippet displayed by search engines. People will form their first opinions by reading the text above the fold.
  2. Generate features based on query – Take the query and break it down into n–grams (a bigram is all phrases of length two). This is followed by assigning weight to each of these n–grams. For e.g. consider a query – “canon digital camera”. In this query, “canon” is an important unigram as it refers to the brand. “canon digital” is a bad phrase while “digital camera” is a good phrase. Traditionally people have used TFIDF (http://en.wikipedia.org/wiki/Tf*idf ) to come up with a weighting. One thing to be cautious is which dataset that is used to compute TFIDF. It should very closely resemble the dataset where weighting is applied.
  3. Document quality for computing TFIDF score – A document that consists of the content of all the pages on the web will match any query. However, it is not a great experience to come across a very large document. At the same time a document which is identical to query is bad as a user won’t learn anything new when (s)he lands on the page. Review what the platform is used by most of the visitors of your website. If they are using smartphones, ideal document length should be less than 500 words, tablet – 1K words, laptop – even longer given the presentation. Some way of normalizing the score by document length should be used. Various papers have been published in IR community to normalize IR score based on document length.
  4. A simple scoring of a document can be

Page P consists of fields di, with weight wi and query Q consists of words qk. Length of page is L, number of phrases in query is Nq

where f is a normalization element function based on doc length.

Which page has a higher IR Score for “Canon digital camera”?

Both product pages above are for Canon digital cameras, but one of them has a much higher IR score. Can you see tell which page’s score is depicted in the table below?

Query Words/Phrase Title H1 Body Bold Weight
Canon 1 1 4 0 1
Digital 1 1 2 0 1
Camera 1 1 7 0 1
Canon digital 1 1 0 0 2
Digital camera 1 1 2 0 2
Canon digital camera 1 1 0 0 3

Query Behavior Score – It is the score based on how people are interacting with the page. How often, visitors of a page find the page interesting for a given query. Most websites out there have a way of defining success (also known as conversion). In case of e-commerce websites, conversion is defined as purchase of a product or service. In case of lead-gen websites, conversion is defined as filling out a form. In case of media websites, it could be interaction with some media element – like playing of video or number of page views. For a query, one can compute the conversion rate and use that directly as the behavior score. The challenge with that is typically this data is very sparse. On an e–commerce website, conversion rate can be as low as 0.5%. That means for every 200 visits for a given query, on an average one conversion will be observed. Long tail queries, by definition, have low volume, making this computation impossible. There are two ways to address this concern -

  1. Query level generalization – instead of computing the score for actual query, compute the score for an abstraction of a query. E.g. query “canon digital camera” can be abstracted to following -
    • Three word query
    • Query with a brand name
    • Query with all words present in the title of the page
  2. Now one can instead say – what is the conversion rate of all queries which are of length three and have all the words in the title and also have a brand name as one of the words of the query. As you can see, this abstraction can be very general or very specific. Based on the amount of data available, one can choose an appropriate level of abstraction.
  3. Alternates to conversion such as bounce rate – While a conversion rate can be as low as 0.5% or lower, bounce rates are typically in the range of 20–80%. This means that you need significantly fewer visits to evaluate the quality of the page. One needs to be careful as bounce rate may not always be highly correlated with conversion rate.

There is a second set of scores for a page that are computed independently of the query. Some examples are:

Behavioral score of page – How people perceive a page is a big indicator of the quality of the page. This can be measured by analyzing the user behavior. Some of the factors that are traditionally used are –

  1. Conversion score – Compute the conversion rate of this page independent of the queries leading to this page
  2. Bounce rate – Compute the bounce rate of this page independent of the queries leading to this page
  3. Number of page views – How many pages are viewed in a session followed by viewing of this page
  4. Number of repeat visitors to this page – How many users keep coming back to this page.
  5. How many people add products to cart after visiting this page?
  6. Average amount of time that is spend on this page.

Behavioral signals cannot be analyzed in isolation. They have to be analyzed relative to other pages that are similar. E.g. On an e-retailer’s website, one can compare the behavior of a product page with respect to other product pages.

A simple way to compute the score would be

Where fi is the value of feature (bounce rate etc) and mfi is the average value of feature fi across all pages of the same type. wi is the weighting given to different features – one may give a weighting of 0.8 to conversion and only 0.1 to bounce rate which is a very noisy feature. A more sophisticated way to do this is to look at the number of people who bounce off a web site and they click on a different search result for the same search query.

Reputation of the page – Page rank is a great proxy to compute how reputed the page is with respect to other pages on the site. Other factors of reputation are – how far is this page from the home page – number of hops required to get to this page as one would navigate from the homepage.

Language quality of the page – One can build a language model over the content that people have liked on the site and score the page with respect to this language model. HMMs are used to typically model the page. Some of the papers describing language models are –

http://dl.acm.org/citation.cfm?id=383970, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.76.1126, http://dl.acm.org/citation.cfm?id=243206

Once all these scores have been computed, next step is to combine these scores.

Lets say the scores are:

IR (IR Score)
B (Behavioral score)
R (Reputation score – or page rank)br />
LM (Language Model Score)

A simple way to combine these scores would be

S = wir* IR + wbi* B + wr*R + wlm*LM

The weights can be adjusted to reflect how much weight you want to give to each feature. If the page is new, behavioral data will be minimal and you want to give it a small weight. However, if the page is old, you must give it a much higher weight.

While the above gives a good perspective on how one can go about computing the score of a page with respect to query, it requires reasonable amount of IT investments which may not be possible for an average marketer. In the next blog post, I will cover some of the methods that can be applied on top of Google analytics to approximate these scores.

About Ashutosh

Ashutosh is the Chief Technology Officer (CTO) of BloomReach and a true guru of all things search, with 10 years of information retrieval, machine learning and search experience. Previously, he was a Staff Scientist at Google for 4+ years, which spanned 8 product launches. Prior to that, he was at IBM research. He is also a prolific publisher/inventor, with a book on machine learning, 30+ papers, and 50+ patents. Ashutosh holds a BTech from IIT-Delhi and a PhD from U of Illinois UC. Ashutosh has numerous awards, including best thesis award at IIT Delhi, IBM Fellowship and outstanding researcher award at UIUC.

Comments

  1. The model laid out makes perfect sense, however how does a search engine determine a behavioral score?

    I’ve always guessed last click on search results as a viable ranking factor but that doesn’t seem to quite fit the description you’ve laid out. I just don’t see how any search engine, Google included, can make a reliable behavior model without owning the property in question.

    • “I just don’t see how any search engine, Google included, can make a reliable behavior model without owning the property in question.”

      Google could track users’ behavior on sites with Google Analytics installed. I wonder if the author can shed any light on that.

    • Plus, they actually track IP’s that make queries, and if an IP has made a query, went to site A, then went back to SERP and clicked site B – there you go, a real bounce rate measurement.

  2. Great article.

    It’s shocking to me that most people in the industry are perfectly willing to accept that Google is probably using “user behavior” factors now, but on they other hand they don’t believe that Google is measuring relative Click-Through-Rate (i.e. for this keyword, in this position, people usually click through 1.5% but this piece of content is only being clicked through .5% of the time, so let’s demote it).

    CTR is by definition the highest amount of data they will be able to obtain for any type of user behavior, since users must *first * click through to the content, before behaving.

    Also it’s the primary ranking factor in Adwords (i.e. CTR) after the bid. Paid search and organic search are really identical if you think about it – they are just presenting ordered lists of titles and descriptions (in one case called “creatives”, but it’s essentially the same thing).

    So why would Google use relative CTR in one channel as a ranking factor but not in the other if it’s such a great relevance signal?

  3. Sam,

    Google have stated several times in the past that Google Search doesn’t use Google Analytics data.

    Following is a video of Matt Cutts stating that in 2011 the web spam team don’t use Google Analytics data but that he couldn’t categorically state that other parts of quality team don’t use it.

    http://www.youtube.com/watch?v=PZoesvNUPDQ

    However in June 2012 at SMX Advanced in Seattle, Danny Sullivan questioned Matt about this again and he not only reconfirmed that web spam team don’t use Google Analytics but he checked before the conference that Google Search aren’t using it as well.

    http://searchengineland.com/live-blog-you-a-with-matt-cutts-at-smx-advanced-123513.

    Al.

  4. I remember reading somewhere that Google did a lot of user testing when working on the Panda / page quality updates, and then used some of that information to update the search algorithm (no idea where I read it now though). There was not need to dive into client analytic data (something that Google have indeed stated they will not do) because they could analyse user behaviour and then examine the content, site structure etc. (i.e. things that can be assessed objectively) to see which sorts of sites are liked and which are not. Sorry, got a feeling I am probably not making sense …..

    My point being – it seemed more likely that they created a correlation between real user behaviour and a variety of types of website to determine which ones people liked the most.

    My assumption was that sites with a lot of thin content, repetitive articles and maybe, poorly written (gibberish) were the ones that people disliked the most, and these factors are key to the computerised quality assessment of sites today. In my personal experience, it was not the previously popular pages on the site that led to the Panda penalty, but other pages on the site that were dragging the whole lot down.

    IMO it is always good to review Google’s list of “what counts as a high-quality site?” – http://googlewebmastercentral.blogspot.co.uk/2011/05/more-guidance-on-building-high-quality.html

  5. HI Alistair, i wouldnt always go by what Google tells you. I have read through some Google patents I dug up off the seobythesea website thanks to Bill Slawski, that indicate that Google does indeed look at user behavior as part of page ranking algorithms. This is is a very good follow up read for those studying the evolution of the Google Pagerank formula and how pages are ranked: http://www.seobythesea.com/2010/05/googles-reasonable-surfer-how-the-value-of-a-link-may-differ-based-upon-link-and-document-features-and-user-data/ I believe this article holds more interest in the realm of improving internal site search algos vs trying to understand how this applies Google. This is not very apples to apples to try to apply this article to make theoretical assumptions about how to improve rankings in Google. By the way, as a funny side note.. if you look at the meta description for this page in the Google SERP, it might have a funny way of tickling the immaturity out of readers!

  6. Google doesn’t need to use Analytics data when it has Google SERP (CTR, bounce rate, subsequent click behavior), Toolbar & Chrome (on page behavior), Adsense code on many sites (can’t swear what that JS is capturing) and tons of other services powered by Google, not to mention 3rd party APIs (number of share, likes, tweets, etc.).

    The other thing they have that we frequently forget is an enormous IP footprint. Remember the stink Google made about Bing “stealing” search results? It turned out that Bing was actually “exploring” results for unfamiliar queries they detected from HTTP headers being transmitted through datacenters. What percentage of public HTTP sessions have packets flowing through peering with Google owned infrastructure/datacenters? If companies like Hitwise can collect data from packets flowing through ISPs I would assume Google can learn a tremendous amount just by data mining HTTP traffic.

Speak Your Mind

*

*