How to Scrape Yelp with Python

Share:

Scrape yelp business listings

Yelp.com hosts reviews that show how much people like a business. You can scrape Yelp and analyze the reviews to conduct market research. Moreover, the reviews will help you generate ideas for improving your products and services.

This tutorial will show you how to scrape Yelp data using Python. The code uses Python requests to manage HTTP requests and lxml to parse HTML.

The Environment for Web Scraping Yelp

Both requests and lxml are external Python libraries, so you must install them separately using pip. You can use this code to install both lxml and requests.

pip install lxml requests

Data Scraped from Yelp

The code will scrape Yelp for these details from its search results page.

  • Business name
  • Rank
  • Review count
  • Categories
  • Ratings
  • Price range
  • Yelp URL

Screenshot showing the data extracted from the search results page while web scraping Yelp

These will be in the JSON data inside a script tag of the search results page; there is no need to figure out XPaths for individual data points.

However, the code also makes HTTPS requests to the URL of each business listing extracted in the previous step and extracts more details. The code uses the XPath syntax to locate and extract these details, which include

  • Name
  • Working hours
  • Featured info
  • Phone number
  • Rating
  • Address
  • Price Range
  • Claimed Status
  • Review Count
  • Category
  • Website
  • Longitude and Latitude

Screenshot showing the business details extracted from a business listing page while web scraping Yelp

Screenshot showing data working hours table extracted from a business listing page while web scraping Yelp

Screenshot showing the contact details extracted from a business listing page while web scraping Yelp

Screenshot showing details of amenities extracted from a business listing page while web scraping Yelp

The Code for Web Scraping Yelp

To scrape Yelp using Python, import the libraries mentioned above, namely requests and lxml. These are the core libraries required for scraping Yelp data. Other packages imported are JSON, argparse, urllib.parse, re, and unicodecsv.

  • The JSON module is necessary to parse JSON content from Yelp and save the data to a JSON file.
  • The argparse module allows you to pass arguments from the command line.
  • unicodecsv helps you save the scraped data as a CSV file.
  • Urllib.parse enables you to manipulate the URL string.
  • The re module handles regular expressions.
from lxml import html
import unicodecsv as csv
import requests
import argparse
import json
import re
import urllib.parse

You will define two functions in this code: parse() and parseBusiness().

parse()

The function parse()

  • Sends HTTP requests to the search results page
  • Parses responses, extracts business listings
  • Returns the scraped data as objects

parse() sends requests to Yelp.com with a header intended to pose as a legitimate user. It uses a loop to try sending HTTP requests repeatedly until it gets the status code 200.

headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,'
                '*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
      'accept-language': 'en-GB;q=0.9,en-US;q=0.8,en;q=0.7',
      'dpr': '1',
      'sec-fetch-dest': 'document',
      'sec-fetch-mode': 'navigate',
      'sec-fetch-site': 'none',
      'sec-fetch-user': '?1',
      'upgrade-insecure-requests': '1',
      'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                    'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}
success = False
   
for _ in range(10):
    response = requests.get(url, verify=False, headers=headers)
    if response.status_code == 200:
        success = True
        break
    else:
        print("Response received: %s. Retrying : %s"%(response.status_code, url))
        success = False

Once you get the response, you can parse it using html.fromstring().

parser = html.fromstring(response.text)

You can now extract the JSON data from the parsed object. After extracting it, you will also clean the data by removing unnecessary characters and spaces.

raw_json = parser.xpath("//script[contains(@data-hypernova-key,'yelpfrontend')]//text()")
cleaned_json = raw_json[0].replace('<!--', '').replace('-->', '').strip()

The code then parses the JSON data using json.loads() and extracts the search results.

json_loaded = json.loads(cleaned_json)
search_results = json_loaded['legacyProps']['searchAppProps']['searchPageProps']['mainContentComponentsListProps']

You can then iterate through the search results and obtain the required data with get().

for results in search_results:
            # Ad pages doesn't have this key.  
            result = results.get('searchResultBusiness')
            if result:
                is_ad = result.get('isAd')
                price_range = result.get('priceRange')
                position = result.get('ranking')
                name = result.get('name')
                ratings = result.get('rating')
                reviews = result.get('reviewCount')
                category_list = result.get('categories')
                url = "https://www.yelp.com"+result.get('businessUrl')

The function then

  • stores the scraped data in a dictionary
  • appends it to an array
  • returns the array
category = []
for categories in category_list:
     category.append(categories['title'])
business_category = ','.join(category)


# Filtering out ads
if not(is_ad):
   data = {
       'business_name': name,
       'rank': position,
       'review_count': reviews,
       'categories': business_category,
       'rating': ratings,
       'price_range': price_range,
       'url': url
     }
   scraped_data.append(data)
return scraped_data

parseBusiness()

The parseBusiness() function extracts details from the businesses extracted by parse().

As before, the function makes an HTTP request to the URL of the Yelp business page and parses the response. However, this time, you will use XPaths.

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
    response = requests.get(url, headers=headers, verify=False).text
    parser = html.fromstring(response)

You can figure out XPaths by inspecting the source code. To do that, right-click on the webpage and click inspect. For example, look at the code for the price range.

//span[@class=' css-14r9eb']/text()

Similarly, you can find all the XPaths and extract the corresponding data.

raw_name = parser.xpath("//h1//text()")
raw_claimed = parser.xpath("//span[@class=' css-1luukq']//text()")[1] if parser.xpath("//span[@class=' css-1luukq']//text()") else None
raw_reviews = parser.xpath("//span[@class=' css-1x9ee72']//text()")
raw_category  = parser.xpath('//span[@class=" css-1xfc281"]//text()')
hours_table = parser.xpath("//table[contains(@class,'hours-table')]//tr")
details_table = parser.xpath("//span[@class=' css-1p9ibgf']/text()")
raw_map_link = parser.xpath("//a[@class='css-1inzsq1']/div/img/@src")
raw_phone = parser.xpath("//p[@class=' css-1p9ibgf']/text()")
raw_address = parser.xpath("//p[@class=' css-qyp8bo']/text()")
raw_wbsite_link = parser.xpath("//p/following-sibling::p/a/@href")
raw_price_range = parser.xpath("//span[@class=' css-14r9eb']/text()")[0] if parser.xpath("//span[@class=' css-14r9eb']/text()") else None
raw_ratings = parser.xpath("//span[@class=' css-1fdy0l5']/text()")[0] if parser.xpath("//span[@class=' css-1fdy0l5']/text()") else None

You can then clean each data by striping extra spaces.

name = ''.join(raw_name).strip()
phone = ''.join(raw_phone).strip()
address = ' '.join(' '.join(raw_address).split())
price_range = ''.join(raw_price_range).strip() if raw_price_range else None
claimed_status = ''.join(raw_claimed).strip() if raw_claimed else None
reviews = ''.join(raw_reviews).strip()
category = ' '.join(raw_category)
cleaned_ratings = ''.join(raw_ratings).strip() if raw_ratings else None

However, you must iterate through the hours table to find the working hours.

working_hours = []
   
for hours in hours_table:
    if hours.xpath(".//p//text()"):
        day = hours.xpath(".//p//text()")[0]
        timing = hours.xpath(".//p//text()")[1]
        working_hours.append({day:timing})

The business’s website URL will be inside another link, so you must decode it using regular expressions and urllib.parse.

if raw_wbsite_link:
        decoded_raw_website_link = urllib.parse.unquote(raw_wbsite_link[0])
        print(decoded_raw_website_link)
        website = re.findall("biz_redir\?url=(.*)&amp;website_link",decoded_raw_website_link)[0]
else:
    website = ''

Similarly, you require regular expressions to get the business location’s longitude and latitude.

if raw_map_link:
        decoded_map_url =  urllib.parse.unquote(raw_map_link[0])
        if re.findall("center=([+-]?\d+.\d+,[+-]?\d+\.\d+)",decoded_map_url):
            map_coordinates = re.findall("center=([+-]?\d+.\d+,[+-]?\d+\.\d+)",decoded_map_url)[0].split(',')
            latitude = map_coordinates[0]
            longitude = map_coordinates[1]
        else:
            latitude = ''
            longitude = ''
else:
    latitude = ''
    longitude = ''

Finally, you will save all the extracted business details to a dict and append them to an array, which the function will return.

data={'working_hours':working_hours,
        'info':info,
        'name':name,
        'phone':phone,
        'ratings':ratings,
        'address':address,
        'price_range':price_range,
        'claimed_status':claimed_status,
        'reviews':reviews,
        'category':category,
        'website':website,
        'latitude':latitude,
        'longitude':longitude,
        'url':url
    }
return data

Next,

  • set up argparse to accept the zip code and search keywords from the command line.
    argparser = argparse.ArgumentParser()
    argparser.add_argument('place', help='Location/ Address/ zip code')
    search_query_help = """Available search queries are:\n
                            Restaurants,\n
                            Breakfast &amp; Brunch,\n
                            Coffee &amp; Tea,\n
                            Delivery,
                            Reservations"""
    argparser.add_argument('search_query', help=search_query_help)
    args = argparser.parse_args()
    place = args.place
    search_query = args.search_query
    
  • call parse()
    yelp_url = "https://www.yelp.com/search?find_desc=%s&amp;find_loc=%s" % (search_query,place)
        print ("Retrieving :", yelp_url)
    #Calling the parse function
        scraped_data = parse(yelp_url)
    
  • use DictWriter() to write the data to a CSV file by writing
    • the header of the CSV file using writeheader()
    • each row with writerow() using a loop
    #writing the data
        with open("scraped_yelp_results_for_%s_in_%s.csv" % (search_query,place), "wb") as fp:
            fieldnames = ['rank', 'business_name', 'review_count', 'categories', 'rating', 'price_range', 'url']
            writer = csv.DictWriter(fp, fieldnames=fieldnames, quoting=csv.QUOTE_ALL)
            writer.writeheader()
            if scraped_data:
                print ("Writing data to output file")  
                for data in scraped_data:
                    writer.writerow(data)
  • call parseBusiness() in a loop and write the extracted details to a JSON file
    for data in scraped_data:
              bizData = parseBusiness(data.get('url'))
              yelp_id = data.get('url').split('/')[-1].split('?')[0]
              print("extracted "+yelp_id)
              with open(yelp_id+".json",'w') as fp:
                   json.dump(bizData,fp,indent=4)
    

Here is the complete code for web scraping Yelp.

from lxml import html
import unicodecsv as csv
import requests
import argparse
import json
import re
import urllib.parse



def parse(url):
    headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,'
                '*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
      'accept-language': 'en-GB;q=0.9,en-US;q=0.8,en;q=0.7',
      'dpr': '1',
      'sec-fetch-dest': 'document',
      'sec-fetch-mode': 'navigate',
      'sec-fetch-site': 'none',
      'sec-fetch-user': '?1',
      'upgrade-insecure-requests': '1',
      'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                    'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}
    success = False
   
    for _ in range(10):
        response = requests.get(url, verify=False, headers=headers)
        if response.status_code == 200:
            success = True
            break
        else:
            print("Response received: %s. Retrying : %s"%(response.status_code, url))
            success = False
   
    if success == False:
        print("Failed to process the URL: ", url)
   
    parser = html.fromstring(response.text)
    raw_json = parser.xpath("//script[contains(@data-hypernova-key,'yelpfrontend')]//text()")
    scraped_data = []
    if raw_json:
        print('Grabbing data from new UI')
        cleaned_json = raw_json[0].replace('<!--', '').replace('-->', '').strip()
        json_loaded = json.loads(cleaned_json)
        search_results = json_loaded['legacyProps']['searchAppProps']['searchPageProps']['mainContentComponentsListProps']
       
        for results in search_results:
            # Ad pages doesn't have this key.  
            result = results.get('searchResultBusiness')
            if result:
                is_ad = result.get('isAd')
                price_range = result.get('priceRange')
                position = result.get('ranking')
                name = result.get('name')
                ratings = result.get('rating')
                reviews = result.get('reviewCount')
                category_list = result.get('categories')
                url = "https://www.yelp.com"+result.get('businessUrl')
               
                category = []
                for categories in category_list:
                    category.append(categories['title'])
                business_category = ','.join(category)


                # Filtering out ads
                if not(is_ad):
                    data = {
                        'business_name': name,
                        'rank': position,
                        'review_count': reviews,
                        'categories': business_category,
                        'rating': ratings,
                        'price_range': price_range,
                        'url': url
                    }
                    scraped_data.append(data)
        return scraped_data
   
def parseBusiness(url):


    headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
    response = requests.get(url, headers=headers, verify=False).text
    parser = html.fromstring(response)
    print("Parsing the page")
    raw_name = parser.xpath("//h1//text()")
    raw_claimed = parser.xpath("//span[@class=' css-1luukq']//text()")[1] if parser.xpath("//span[@class=' css-1luukq']//text()") else None
    raw_reviews = parser.xpath("//span[@class=' css-1x9ee72']//text()")
    raw_category  = parser.xpath('//span[@class=" css-1xfc281"]//text()')
    hours_table = parser.xpath("//table[contains(@class,'hours-table')]//tr")
    details_table = parser.xpath("//span[@class=' css-1p9ibgf']/text()")
    raw_map_link = parser.xpath("//a[@class='css-1inzsq1']/div/img/@src")
    raw_phone = parser.xpath("//p[@class=' css-1p9ibgf']/text()")
    raw_address = parser.xpath("//p[@class=' css-qyp8bo']/text()")
    raw_wbsite_link = parser.xpath("//p/following-sibling::p/a/@href")
    raw_price_range = parser.xpath("//span[@class=' css-14r9eb']/text()")[0] if parser.xpath("//span[@class=' css-14r9eb']/text()") else None
    raw_ratings = parser.xpath("//span[@class=' css-1fdy0l5']/text()")[0] if parser.xpath("//span[@class=' css-1fdy0l5']/text()") else None


    working_hours = []
   
    for hours in hours_table:
        if hours.xpath(".//p//text()"):
            day = hours.xpath(".//p//text()")[0]
            timing = hours.xpath(".//p//text()")[1]
            working_hours.append({day:timing})
       
    info = details_table
   
    name = ''.join(raw_name).strip()
    phone = ''.join(raw_phone).strip()
    address = ' '.join(' '.join(raw_address).split())
    price_range = ''.join(raw_price_range).strip() if raw_price_range else None
    claimed_status = ''.join(raw_claimed).strip() if raw_claimed else None
    reviews = ''.join(raw_reviews).strip()
    category = ' '.join(raw_category)
    cleaned_ratings = ''.join(raw_ratings).strip() if raw_ratings else None


    if raw_wbsite_link:
        decoded_raw_website_link = urllib.parse.unquote(raw_wbsite_link[0])
        print(decoded_raw_website_link)
        website = re.findall("biz_redir\?url=(.*)&amp;website_link",decoded_raw_website_link)[0]
    else:
        website = ''
   
    if raw_map_link:
        decoded_map_url =  urllib.parse.unquote(raw_map_link[0])
        if re.findall("center=([+-]?\d+.\d+,[+-]?\d+\.\d+)",decoded_map_url):
            map_coordinates = re.findall("center=([+-]?\d+.\d+,[+-]?\d+\.\d+)",decoded_map_url)[0].split(',')
            latitude = map_coordinates[0]
            longitude = map_coordinates[1]
        else:
            latitude = ''
            longitude = ''
    else:
        latitude = ''
        longitude = ''


    if raw_ratings:
        ratings = re.findall("\d+[.,]?\d+",cleaned_ratings)[0]
    else:
        ratings = 0
   
    data={'working_hours':working_hours,
        'info':info,
        'name':name,
        'phone':phone,
        'ratings':ratings,
        'address':address,
        'price_range':price_range,
        'claimed_status':claimed_status,
        'reviews':reviews,
        'category':category,
        'website':website,
        'latitude':latitude,
        'longitude':longitude,
        'url':url
    }
    return data
if __name__ == "__main__":
    argparser = argparse.ArgumentParser()
    argparser.add_argument('place', help='Location/ Address/ zip code')
    search_query_help = """Available search queries are:\n
                            Restaurants,\n
                            Breakfast &amp; Brunch,\n
                            Coffee &amp; Tea,\n
                            Delivery,
                            Reservations"""
    argparser.add_argument('search_query', help=search_query_help)
    args = argparser.parse_args()
    place = args.place
    search_query = args.search_query
    yelp_url = "https://www.yelp.com/search?find_desc=%s&amp;find_loc=%s" % (search_query,place)
    print ("Retrieving :", yelp_url)
#Calling the parse function
    scraped_data = parse(yelp_url)
#writing the data
    with open("scraped_yelp_results_for_%s_in_%s.csv" % (search_query,place), "wb") as fp:
        fieldnames = ['rank', 'business_name', 'review_count', 'categories', 'rating', 'price_range', 'url']
        writer = csv.DictWriter(fp, fieldnames=fieldnames, quoting=csv.QUOTE_ALL)
        writer.writeheader()
        if scraped_data:
            print ("Writing data to output file")  
            for data in scraped_data:
                writer.writerow(data)
#Extracting details from business pages
    for data in scraped_data:
          bizData = parseBusiness(data.get('url'))
          yelp_id = data.get('url').split('/')[-1].split('?')[0]
          print("extracted "+yelp_id)
          with open(yelp_id+".json",'w') as fp:
               json.dump(bizData,fp,indent=4)

Here is the data extracted from Yelp.

Screenshot showing a sample of data extracted while web scraping

Code Limitations for Web Scraping Yelp

You can collect data from Yelp using this code for the time being. However, Yelp may change the JSON structure at any time. And when they do, you must update this code to reflect the changes.

Moreover, this code won’t be sufficient if you want to scrape data from Yelp on a large scale. You must consider advanced techniques like proxy rotation to bypass Yelp’s anti-scraping measures.

Using The Script

You can use the script from the command line with a zip code or location name and a search query. For example,

python yelp_scraper.py 20001 Restaurants

You can get the syntax by using the -h flag.

usage: yelp_search.py [-h] place keyword
positional arguments:
 place    Location/ Address/ zip code
 keyword  Any keyword

optional arguments:
 -h, --help show this help message and exit

Wrapping Up

It is possible to scrape Yelp using Python to gather data about your competitors and understand the pain points of your target customers. Python library requests and lxml can do the job.

But remember to watch for any changes to Yelp.com’s HTML structure. Whenever Yelp’s HTML structure changes, you must figure out the new XPaths for data points and JSON data.

Although, you don’t have to take all the trouble for web scraping Yelp.

You can try the ScrapeHero Yelp Scraper from the ScrapeHero cloud for free. It offers a no-code solution, so you don’t need to learn how to scrape Yelp data to use it.

You can also forget about modifying the code for large-scale data extraction. ScrapeHero services can help you with that.

ScrapeHero is an enterprise-grade web scraping service provider. Our services range from large-scale web scraping and crawling to custom robotic process automation. Leave all the coding to ScrapeHero; you only need to tell us what you need.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Table of content

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

Transform and map scraped data

How to Transform and Map Scraped Data with Python Libraries

Learn how you can transform and map data using Python.
Using NLP to clean and structure scraped data

How to Use NLP to Clean and Structure Scraped Data

Learn how to use NLP to clean and structure scraped data.
Search engine web crawling

From Crawling to Ranking! This is How Search Engines Use Web Crawling to Index Websites!

Search engine crawling indexes web pages, making it essential for ranking and visibility in search results.
ScrapeHero Logo

Can we help you get some data?