Location Extractor. Crawl millions of business websites to find their Office Locations

Problem

Our client offers market research to its customers. Helping customers identify the demand & supply of certain businesses in a given region. With their database offering location data of 1 Billion+ business, they wanted to expand the records in their database.

Solution

We built an Internet crawler. Much like a typical Search Engine’s crawler – it would go to millions of websites every day, and employ AI and NLP to identify the office locations where their business operates from. These office locations would be extracted from special webpages of each website. For pages like : Contact Us, About Us, Office Locations, etc

Result

Our client was able to expand their database of 1 Billion business locations to 1.5 Billion business locations in a matter of 6 months. This 50% increase was attributed to the Location Extract Internet Crawler we built for them.

Approach

A lot of websites today are built using rich internet applications. Because these websites make heavy use of Javascript, the typical HTTP crawlers are not able to interpret the content in such websites. To solve this we built a Deep Web Crawler. Our crawler would navigate the websites with javascript enabled web browser,
and navigate through network of links on the website much like humans would. To speed up this crawler & save on crawling costs we also implemented Focused Crawling. This enabled our crawler to prevent going from the entire website content, but rather using AI to identify just the key pages where location data was most likely to be found. By identifying which webpages for Contact Us, About Us, Office Locations, we would navigate directly to them. To access only the information we were interested in. Finally we used a combination of Natural Language Process and Computer Vision to identify the very region of the pages where Location Details were mentioned.