Pankaj Mathur is the Vice President Sales for the InfoGroup Licensing Division. He has been with InfoGroup since 2005 and is currently managing the POI data licensing relationships with local search engines, navigation partners and LBS players. He works with the product team at InfoGroup and with customers on new technologies and products.
Pankaj has an MBA in finance from the Carlson School of Management at the University of Minnesota. He brings tremendous experience to the table as he has worked in accounting, as a financial analyst and in project management roles in the past.
He attended the prestigious IIT (Indian Institute of Technology) for his Undergraduate Degree. He majored in Naval Architecture and supervised construction of merchant ships in China, South Korea and Thailand before deciding to pursue a Masters Degree in finance.
Eric Enge: Could you start off by characterizing the nature of the quantity, the number of businesses, and the kinds of things that InfoGroup does to build and maintain its database of businesses in the United States?
Pankaj Mathur: In a nutshell, we have approximately 15 million businesses in the U.S., 1.4 million in Canada, and between 2.5 and 3 million in the UK. For all three geographies, we typically phone-validate the information. In the North American market last year, we made over 25 million phone calls.
Eric Enge: That’s a lot of phone calls; don’t you get tired of dialing the phone after a while?
Pankaj Mathur: Yes, it is a lot of phone calls! Part of this has to do with constant changes going on with businesses, especially in the SMB segment. In the economic climate that we are currently experiencing there is a lot of churn in the market place. So there is a lot we have to do just to keep pace with these changes. Often it may be required to try multiple times to get hold of a business to phone validate the information.
Eric Enge: Do you have live people conducting these phone calls or is it an automated system?
Pankaj Mathur: It’s all operator-driven, although we do use a smart dialer. We separate our call-scripts based on the size of the business. Let’s say a business is brand new, there would be no point in asking them who is the head of HR or who is in charge of payroll because most likely it doesn’t have one. When we call large businesses like Fortune 500 companies, however, we do get into deeper level data, such as who is in-charge of finance or additional attributes which coincide with the size and relevancy of the business.
Eric Enge: Have you ever run into an issue where a business owner simply thinks that you are trying to sell them something and that your call is a spam call?
Pankaj Mathur: When we started phone-validating businesses in the early 1990s, I am pretty sure there must have been at least few incidences of businesses interpreting our phone calls the way you described. Now that we have been doing this for a little under 20 years, most businesses are familiar with us, and they know for sure that we are not trying to sell anything rather helping them display information correctly on various search channels. But even today, we do come across a handful of scenarios where a business may refuse information, specifically with businesses that are financially distressed or owners having a hard time on a personal or business front. After all, they are humans, so we have to be a respectful of their mood, emotions, and the situations they may be under.
A small percentage of them either refuse to give us information or ask us to remove their listings from Infogroup’s database, for a variety of reasons. What ends up happening once we suppress a listing is that a lot of them actually end up calling us back asking why their listings are not showing up on major search engines or navigation devices. At that point in time, they are usually happy to have their listings included back in the Infogroup database.
Eric Enge: What about some of the other things you do to maintain the data? Can you provide some metrics about other kinds of things you do to collect the data?
Pankaj Mathur: We get approximately 6,000 phonebooks every year, but we do not necessarily compile each phonebook every year. Some are used for comparative audits to ensure coverage, so the number of books actually compiled in a year keeps fluctuating year after year based on our requirements. When we think about data, we compile keeping four key guidelines in mind, namely:
Let us look at each one of the above aspect in more details.
Completeness – is a measure of the total number of listings and in some sense reflects coverage. There are different numbers regarding businesses in the U.S., Infogroup has around 15 million companies in U.S., the IRS claims about 20 million, and the Chamber of Commerce claims about 24 million. These numbers will vary depending on the definition of a business. For example, from a Chamber of Commerce perspective, if a license was filed back in 1959, it is considered a valid business in 2010 as long as the owner is still alive and has not filed for bankruptcy. Infogroup has defined a business as a brick-and-mortar store having a phone number and location address. We do allow a few exceptions for categories like contractors or realtors working from home.
The other way to look at completeness is to look at the fill rates for information available. So it may be acceptable to look at how many businesses have a location address, say 95%. It is usually productive to understand the drivers for 5% blank addresses, few of these drivers could be no answer on phone call, working from home etc.
But fill rates will not necessarily make sense in every scenario, for example for SIC codes or Yellow Page categories. A restaurant may have a drive-in, bar and an ATM on premises, so this restaurant record will have multiple lines of businesses. But a funeral home, library or a school may only have one SIC code. Therefore lack of secondary SIC codes or yellow page headings for a library or school is not necessarily indicative of lack of completeness for these categories.
Eric Enge: If someone is working as a plumber out of their house and they use their house as the brick-and-mortar address, are they counted?
Pankaj Mathur: Yes, that will work. They usually have a P.O. Box number, which is usually within the same zip code or city. Therefore, we are able to provide some level of geocoding based on that, which is good enough for geo-targeting because nobody needs a pinpoint location of their plumber’s home. In some cases they are okay sharing the location address, but we will most likely end up flagging it as a home-based business so that our partners do not use home based addresses to give turn-by-turn directions to somebody’s house.
Eric Enge: I believe you told me that Google has 70 million businesses on its records in the U.S.?
Pankaj Mathur: You are probably referring to their master index, but nobody really knows for sure because Google does not share this information with any outside parties. But I have noticed that for the last several months the search results on Google Local have fewer duplicates and out-of-business records, credit to the Google Local team! A natural outcome of such exercises is that their master index would have shrunk in size as more redundant listings are removed.
Eric Enge: Do you have any idea where it has dropped to, roughly?
Pankaj Mathur: No I have no such information but it will be more than the 15 million listings that we have. Part of the reason is because Google, or any search engine, follows a much broader definition of business or point of interest. So while Infogroup compiles businesses that can be phone validated, search engines may include beaches, monuments, or landmarks that may or may not have a location address or phone number.
Eric Enge: Why don’t you tell me about the next key guideline, Conformance?
Pankaj Mathur: Sure, but before I move onto conformance, I want to say one last thing on Completeness. Most data providers attempt to achieve coverage by supplementing sources such as phone books, utilities, tourism guides, industry lists, Internet scraping. But as we will see under Conformance and Accuracy, such efforts usually create more duplicates in the process.
Conformance- in some sense this implies standardization or adherence to structure. It has different implications for different pieces (name, address, category etc) of information even for the same business record. I have not come across any authoritative definition of Conformance so for the benefit of the readers, I will try to illustrate with few simple examples.
- Location Address – to understand Conformance in context of location address means the data has to conform to street number, street name and suite number (like 123 Main St #45), city, state and zip+4. There are several technologies and software available that can help with address hygiene.
- Line of business- depending on sources compiled, we may come across a Hilton listed under “Banquet Hall” and no mention of it under the “Hotel” category. We have quality control rules and audits in place that helps ensure that all Hilton locations are assigned “Hotels” as a primary line of business. Some part of Conformance may be automated and you can set filters to compile Hilton locations as hotels and have fine dining, bar etc as secondary lines of business.
But a good data provider will have processes in place to capture exceptions and handle it appropriately. For example, compiling a phonebook for Orlando, Florida it is possible that a Hilton shows up as a golf course. Understandably most people will not think of a Hilton as a golf course, but given its geographical location, this particular Hilton may have a golf course attached to it. In which case, the robustness of a data provider requires the ability to capture and handle these exceptions, in our case it would be phone validation. We can call this Hilton and verify objectively if there is a golf course attached to it, and then assign the appropriate categories to the record.
Thus conformance can take several different meanings depending on the data element and the nature of information compiled. Conformance plays a large role in the de-duping process and has a significant impact on occurrences of duplicates on account of multiple combinations of name, address and phone numbers.
Accuracy. This is the probably the easiest of all four to understand, because it is factual, but Accuracy is also the most expensive aspect for data compilation. Accuracy is closely tied to reliability which is a little harder to grasp. So if you toss a coin and guess heads or tails, your results will be accurate half the time but tossing a coin does not constitute a reliable decision making process for most things in life.
So just because you have a listing for a McDonald’s at 123 Main Street, the listing may be accurate but still not reliable depending on the underlying processes used to compile this data. At Infogroup, we use phone validation to ensure reliability of listing information and Accuracy automatically follows from it!
Relevancy can be best correlated to intent. So if I am searching for a McDonald’s, the information on John Doe LLC who owns the location is irrelevant (even if it is accurate). Relevancy is becoming more important, and part of the reason is the nature of evolution of the whole LBS (local business search) ecosystem. About 15 years back, people were happy to just find information online. Up until then, everything was either phone based, printed or word of mouth. So users were glad if some even remotely related stuff would show up in their search results.
Thus it was acceptable that while searching for “Sushi in Santa Clara, California” if there were few records for books on how to make sushi, Japanese grocery stores or cooking schools. We all were happy to sort through this mess and get to the part that best meets our expectations. But in last decade, Consumer’s expectations have risen, people are using keywords and becoming more focused on intent. Relevancy has acquired several contexts such as keywords used, time of day, location of user, social aspect and so on.
The intent is different when I am searching in front of a desktop than when I am searching on my smart phone at 10 O’clock at night. Due to this evolution of LBS, there are additional attributes that are coming to the forefront, like opening-closing hours, credit cards accepted, ratings, reviews, and coupons and so on. The list just goes on.
Eric Enge: Yes, you have to capture the extra detail so people understand how it fits with wherever they are looking for?
Pankaj Mathur: Yes, it makes compilation of local business data challenging and exciting at the same time. Also it would be hard for anyone to assume if one of the four metrics is more important than the other. It does create an interesting situation to compile relevant information in a cost effective manner. Thus, in addition to phone validation, Infogroup has started accepting direct listing submissions from merchants and corporate chains to compile information on the attributes we discussed under relevancy.
Eric Enge: Say a business only has a P.O. Box for an address; is that something that you would count as a valid business?
Pankaj Mathur: Yes, we do, such scenario can occur for A financial advisor or a tax consultant working from home. If we can reach them (by phone) and if we have reason to believe that we can objectively verify the information, then we will compile it.
Eric Enge: Similarly, what about a kiosk-based location like an ordering terminal in a shopping mall or at an airport?
Pankaj Mathur: Great question, Eric. This is a very gray area. I will give you two examples. The short answer is “one size does not fit all”. We come across cases like this where a doughnut chain may have a kiosk or a shelf space inside a retailer like Safeway, Target or Wal-Mart.
When you look at the corporate list, they will tell you that there is a Dunkin’ Donuts or Baskin-Robbins at a particular address, which may actually be a retailer. In this case, what we usually do is make a decision on a case-by-case basis. For the example above, there probably isn’t enough evidence to necessarily route somebody, looking for Dunkin’ Donuts, to a grocery store just because grocery store has a shelf where you could pick donuts of certain brands. It is more like SKU level information and not the brand that’s driving the grocery store.
There are cases like an ATM location inside a bank that is still considered a line of business, and we will compile it.
Eric Enge: Or a FedEx and Kinko’s that share space?
Pankaj Mathur: Yes. In all these cases there is no right answer. We usually consult a lot with our customers and our partners, and we usually get good feedback on what should be done in these scenarios, although there is not always a consensus amongst them!
Eric Enge: So it sort of underscores how it gets complicated. Just to take it a little further, we talked about FedEx and Kinko’s that share locations and the donut shelf in a grocery store for Dunkin’ Donuts. Now let’s talk about FedEx drop-boxes, because people want to know where FedEx drop-boxes are, right? Is that a location for a given business?
Pankaj Mathur: Yes, I will actually expand on that. This will also apply to vending machines, ATMs, public transit locations, DVD rental kiosks, taxi pickup points etc. We are witnessing a lot of demand for such categories, especially on the mobile applications.
As a consumer, I myself, wish if this data was available with high level of reliability. There are different sources out there and Infogroup is working on all the above categories but so far the sources have been less than reliable. For these categories there are a different set of challenges which makes it hard to stick to the four guidelines (completeness, conformance, accuracy and relevancy) we discussed earlier. Here are few examples:
- It is not possible to phone validate the information as an ATM or DVD rental kiosk when there is no address or phone number.
- The information is rather vague, so if we know that there is an ATM machine for a particular bank at O’Hare Airport, it does not provide much relevant information from a user perspective unless there is precise geo-code for walking directions.
- There is no reliable way to track changes of this data. So if the vending machine provider decides to move the machine, which can take as little as 30 minutes, it is difficult to know these changes in real-time.
About a year back, we did a test with one of the national financial brand names that operates ATM machines nationwide. They shared a list of ATM locations with us, and after analyzing the data we found that the bank has a tendency to track legal entities because of financial nature of transactions involved. For example, in the particular city in which we tested, there were three businesses Nicky’s Grill, BBQ’s Grand Palace, and Rumor’s Bar listed as ATM locations. All three businesses were listed at the same address and sharing the same phone number, but we found that the actual business at that address is called Little Nikki’s BBQ & Grill and a different phone number, which was validated by Infogroup.
Eric Enge: You recently wrote an article about how merchant-submitted listings are not the solution to the local search problem.
Pankaj Mathur: The article actually deals with merchant submitted listings in exhaustive detail; you can see it at http://www.license.infousa.com/MerchantSubmittedListings.aspx. But let me reiterate a few salient points here again. Firstly, at Infogroup we believe that merchant submitted data is a great source as it allows us access to compile attributes that helps connect local search users with relevant merchants. Merchant submitted listings are also a good source of operational level data, for example, the bar owner knows better if there is an ATM machine or DVD rental kiosks inside the premises. The intent of the article is to highlight the fact that data coming from corporate chains may not necessarily comply with the four guidelines namely completeness, conformance, accuracy and relevance. There are lots of reasons for this and I am not implying that businesses don’t know basic information about their business.
If you are a big chain corporation like McDonald’s or KFC, managing data on over 10,000 locations can be quite a daunting task. It is true that marketing, operation and accounting will have lists for their locations but these may not be the same list. So even if a particular store location open or closed, there is some lag time when these lists get updated.
These companies would be better off just coming to Infogroup because we can perform all the steps necessary for getting data in the right shape and form for local search.
Eric Enge: You talked about a few different scenarios in your article, like how you can even have a situation where business owners have interest in providing the wrong information.
Pankaj Mathur: Yes, sometimes it does happen. I mean, there is a fine line between misrepresentation and marketing exaggeration. All, marketing messages are designed (or at least intended) to create a feel-good factor about products and services. So from a marketing perspective you can live with messages like “best hotel in town for $39 a night.” I am sure you have friends that have signed up for vacation home packages. In some cases the marketing message gets in the way of data compilation requirements.
Eric Enge: Have you run into situations where there is a fine line between marketing and fantasy? Are there business reasons why people would overtly misrepresent their data?
Pankaj Mathur: There are always going to be a very small, hopefully insignificant number of people who misrepresent data. We have tons of rules to take care of this small portion of the population that does not subscribe to law anyway.
But these misrepresentations can also happen due to ignorance. Think about it, if I am a locksmith and somehow over the years I have realized that my competition is doing better than me because they do certain things better than me like geo-targeting. Now based on my rudimentary knowledge of geo-targeting, I may figure out that if I use an address for an existing business near the residential area, it can solve most of my geo-targeting problems.
If you interpret this strictly, it is misrepresentation, but I don’t think malice is the intent here.
But frankly speaking, at the end of the day, we couldn’t care less. Our job, our value proposition in the local search industry is to make sure that we provide data to our customers which is complete, has conformance, is accurate, is reliable, and has relevancy.
Usually there is a perception, largely amongst data compilers who do not invest as much in compilation efforts, that merchant submitted listings are “gold” so take it for its face value. My personal opinion is that merchant submitted listings are at best “okay”; there is lot of crap in there that needs to be cleansed to make it valuable. On a similar line all data sources we compile are neither 100 percent accurate nor 100 percent wrong, it’s almost always somewhere right in the middle.
So the challenge becomes how do we discard the part which is wrong and hold on to the part which is right.
Eric Enge: Thanks Pankaj!
Pankaj Mathur: Yes, thank you Eric!