Tripadvisor Scraping: How to Get Started

Share:

Tripadvisor Scraping

If you want reliable tourist data, Tripadvisor is a great resource. However, copying data manually can be a real hassle, which is where Tripadvisor scraping comes in. Here’s how to scrape Tripadvisor using Python. 

Tripadvisor Scraping: Setting Up Your Environment

You’ll need two key Python libraries for Tripadvisor data scraping:

    1. Selenium: To control the browser and perform actions like clicking buttons.
    2. lxml: To parse HTML and locate the data points
    3. requests: To handle HTTP requests

Just use Python’s pip to install these packages.

pip install selenium lxml requests

Now, that you have everything set up, you’re ready to plan your scraper. And the first step is to know what you’ll scrape.

Tripadvisor Scraping: Data Scraped

The scraper pulls six data points from Tripadvisor:

  1. Rank
  2. Hotel Name
  3. Price
  4. URL
  5. Review Count
  6. Tripadvisor Rating

You’ll spot the HTML tags holding this data by using your browser’s inspect feature.

Inspect panel showing HTML tags holding the data points

By using the ‘Inspect’ feature you can build XPaths. If you don’t know how to do that, here’s a nice XPath cheat sheet that’ll get you going.

Here you’ll need the following XPaths:

  1. URL: ‘.//div[@data-automation=”hotel-card-title”]//a/@href’
  2. Review Count:  ‘.//span/span[contains(text(),”reviews”)]//text()’
  3. Rating: ‘.//div[contains(@aria-label,”reviews”)]/@aria-label’
  4. Name: ‘.//div[@data-automation=”hotel-card-title”]//h3/text()’
  5. Price: ‘.//span[contains(@data-automation,”Price”)]/text()’

There’s no need for an XPath for getting the rank as you can get it from the hotel name.

After understanding the XPaths, you can start coding.

Go the hassle-free route with ScrapeHero

Why worry about expensive infrastructure, resource allocation and complex websites when ScrapeHero can scrape for you at a fraction of the cost?

Tripadvisor Scraping: Building The Scraper

Start by importing the modules you need:

  1. time: to get the current time
  2. sleep: to pause script execution
  3. unicodecsv: to write the extracted data into a CSV file
  4. argparse: to create a command-line interface (CLI) for your script
  5. datetime: to change date formats
  6. selenium.webdriver: to control the browser
  7. selenium.webdriver.common.by: to specify the selectors for finding elements
from time import time, sleep
import json
import argparse, requests
from datetime import datetime
from lxml import html
from selenium import webdriver
from selenium.webdriver.common.by import By

Next, organize the code by breaking it down into functions, This code uses three key functions:

  1. get_json(): Gets the URL for the list of hotels of a particular location
  2. get_response(): Pulls the HTML source code of the page holding the hotel list
  3. parse(): calls the above two functions and extracts the data points

get_json()

First, you’ll need to grab the appropriate Tripadvisor URL for a given locality; you can get that from the site’s Content API. This API returns a JSON containing the required URL. So the get_json() function:

  1. Takes the API endpoint URL as input
  2. Makes an HTTP request to this URL using Python requests
  3. Parses the response
  4. Returns the JSON data
def get_json(url):
    headers = {
                 "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,"
                 "*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
                 "accept-language": "en-GB;q=0.9,en-US;q=0.8,en;q=0.7",
                 "dpr": "1",
                 "sec-fetch-dest": "document",
                 "sec-fetch-mode": "navigate",
                 "sec-fetch-site": "none",
                 "sec-fetch-user": "?1",
                 "upgrade-insecure-requests": "1",
                 "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                 "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
             }
   
    response = requests.get(url, headers=headers, verify=False)
    json_data = json.loads(response.text)

    with open("geo_url.json", "w") as f:
        json.dump(json_data, f, indent=4)

    return json_data

get_response()

The get_response() function fetches the HTML source code of a URL. This function:

  1. Accepts a URL 
  2. Launches the Selenium browser
  3. Navigates to the URL
  4. Extracts and returns the HTML source code
def get_response(url,checkin,checkout):
    driver = webdriver.Chrome()
    driver.get(url)
    sleep(3)
    checkin = checkin.strftime('%B %d, %Y')    
    checkout = checkout.strftime('%B %d, %Y')    
    driver.find_element(By.XPATH,'//div[contains(@data-automation,"checkin")]')
    sleep(1)
    while True:
        try:
            driver.find_element(By.XPATH, f'//div[contains(@aria-label,"{checkin}")]').click()
            break
        except:
            driver.find_element(By.XPATH, '//button[contains(@aria-label,"Next month")]').click()
            sleep(1)
    sleep(2)
    while True:
        try:
            driver.find_element(By.XPATH, f'//div[contains(@aria-label,"{checkout}")]').click()
            break
        except:
            driver.find_element(By.XPATH, '//button[contains(@aria-label,"Next month")]').click()
            sleep(1)
    sleep(10)
    response = driver.page_source
    with open("page.scraped.html", "w", encoding="utf-8") as f:
        f.write(response)
    return response

You can see that the code sets the check-in and check-out dates before grabbing the HTML source code. This is necessary or the page won’t load the hotel prices. 

The function uses a loop to find the correct dates and select them. Then, it downloads the HTML after the page loads the new hotel list that includes the prices

parse()

This function integrates the above functions. It accepts a locality, and:

  1. Constructs a URL using the locality
  2. Calls get_json() with the constructed endpoint
  3. Forms the required URL from the returned JSON data
  4. Calls get_response() with the URL
  5. Parses the returned HTML source code using lxml
  6. Locates all the span elements that hold the hotel details
  7. Loops through the span elements
    1. Locates the required data using the XPaths mentioned previously.
    2. Appends the data to a list
  8. Returns the list
def parse(locality, checkin_date, checkout_date):

    print("Scraper Inititated for Locality:%s" % locality)

    # TA rendering the autocomplete list using this API

    print("Finding search result page URL")

    geo_url =  "https://www.tripadvisor.com/TypeAheadJson?action=API&startTime="+ str(int(time()))+ "&uiOrigin=GEOSCOPE&source=GEOSCOPE&interleaved=true&types=geo,theme_park&neighborhood_geos=true&link_type=hotel&details=true&max=12&injectNeighborhoods=true&query="+ locality
   

    print(geo_url)

    api_response = get_json(geo_url)
    # getting the TA url for th equery from the autocomplete response

    url_from_autocomplete = (
        "http://www.tripadvisor.com" + api_response["results"][0]["url"]
    )

    print("URL found %s" % url_from_autocomplete)

    print("Downloading search results page")
   
    page_response = get_response(url_from_autocomplete,checkin_date,checkout_date)
    print("Parsing results ")
    parser = html.fromstring(page_response)

    hotel_lists = parser.xpath('//span[@class="organic"]')
    hotel_data = []

    for hotel in hotel_lists:
       
        XPATH_HOTEL_LINK = './/div[@data-automation="hotel-card-title"]//a/@href'
        XPATH_REVIEWS  = './/span/span[contains(text(),"reviews")]//text()'
        XPATH_RATING = './/div[contains(@aria-label,"reviews")]/@aria-label'
        XPATH_HOTEL_NAME = './/div[@data-automation="hotel-card-title"]//h3/text()'
        XPATH_HOTEL_PRICE = './/span[contains(@data-automation,"Price")]/text()'

        raw_hotel_link = hotel.xpath(XPATH_HOTEL_LINK)
        raw_no_of_reviews = hotel.xpath(XPATH_REVIEWS)
        raw_rating = hotel.xpath(XPATH_RATING)
        raw_hotel_name = hotel.xpath(XPATH_HOTEL_NAME)
        raw_hotel_price = hotel.xpath(XPATH_HOTEL_PRICE)

        url = 'http://www.tripadvisor.com'+raw_hotel_link[0] if raw_hotel_link else  None
        reviews = raw_no_of_reviews[0].replace("reviews","").replace(",","").strip() if raw_no_of_reviews else 0
        rank = raw_hotel_name[0].split('.')[0].strip() if raw_hotel_name else None
        rating = raw_rating[0].replace('of 5 bubbles','').split()[0].strip() if raw_rating else None
        name = raw_hotel_name[0].split('.')[1].strip() if raw_hotel_name else None
        price = raw_hotel_price[0] if raw_hotel_price else 'Not Available'
           
        data = {    
                    'rank':rank,
                    'hotel_name':name,
                    'price':price,
                    'url':url,
                    'review count':reviews,
                    'tripadvisor_rating':rating,

        }

        hotel_data.append(data)

    return hotel_data

Now, it’s time to call these functions.

Since the script will ask the user for location, check-in date, and check-out date when it runs, set up a command-line interface (CLI) using argparse. This lets users easily interact with the script through the terminal:

  1. Initialize the ArgumentParser() object
  2. Add argument for locality, check-in date, and check-out date
  3. Parse the arguments and store them in a variable
if __name__ == "__main__":

    parser = argparse.ArgumentParser()
    parser.add_argument('checkin_date',help = 'Hotel Check In Date (Format: YYYY/MM/DD')
    parser.add_argument('checkout_date',help = 'Hotel Chek Out Date (Format: YYYY/MM/DD)')
    parser.add_argument("locality", help="Search Locality")
    args = parser.parse_args()

    locality = args.locality
    checkin_date = datetime.strptime(args.checkin_date,"%Y/%m/%d")
    checkout_date = datetime.strptime(args.checkout_date,"%Y/%m/%d")

You can then call parse() with the locality, checkin_date, and checkout_date as the arguments.

data = parse(locality, checkin_date, checkout_date)

Finally, save the data as a JSON file.

with open("tripadvisor_data.json", "w",encoding='utf-8') as jsonfile:  
        json.dump(data,jsonfile,indent=4,ensure_ascii=False)

The results of scraping Tripadvisor will look like this.

{
        "rank": "1",
        "hotel_name": "Moxy NYC Times Square",
        "price": "$335",
        "url": "http://www.tripadvisor.com/Hotel_Review-g60763-d12301470-Reviews-Moxy_NYC_Times_Square-New_York_City_New_York.html",
        "review count": "3968",
        "tripadvisor_rating": "4.5"
    }

Here’s the complete code to scrape Tripadvisor:

#!/usr/bin/env python
from time import time, sleep
import json
import argparse, requests
from datetime import datetime

from lxml import html
from selenium import webdriver
from selenium.webdriver.common.by import By

def get_response(url,checkin,checkout):

    driver = webdriver.Chrome()
    driver.get(url)
    sleep(3)
    checkin = checkin.strftime('%B %d, %Y')    
    checkout = checkout.strftime('%B %d, %Y')    

    driver.find_element(By.XPATH,'//div[contains(@data-automation,"checkin")]')

    sleep(1)
    while True:
        try:
            driver.find_element(By.XPATH, f'//div[contains(@aria-label,"{checkin}")]').click()
            break
        except:
            driver.find_element(By.XPATH, '//button[contains(@aria-label,"Next month")]').click()
            sleep(1)
    sleep(2)
    while True:
        try:
            driver.find_element(By.XPATH, f'//div[contains(@aria-label,"{checkout}")]').click()
            break
        except:
            driver.find_element(By.XPATH, '//button[contains(@aria-label,"Next month")]').click()
            sleep(1)
    sleep(10)
    response = driver.page_source

    with open("page.scraped.html", "w", encoding="utf-8") as f:
        f.write(response)

    return response

def get_json(url):

    headers = {
                 "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,"
                 "*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
                 "accept-language": "en-GB;q=0.9,en-US;q=0.8,en;q=0.7",
                 "dpr": "1",
                 "sec-fetch-dest": "document",
                 "sec-fetch-mode": "navigate",
                 "sec-fetch-site": "none",
                 "sec-fetch-user": "?1",
                 "upgrade-insecure-requests": "1",
                 "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                 "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
             }
   
    response = requests.get(url, headers=headers, verify=False)
    json_data = json.loads(response.text)

    with open("geo_url.json", "w") as f:
        json.dump(json_data, f, indent=4)

    return json_data

def parse(locality, checkin_date, checkout_date):

    print("Scraper Inititated for Locality:%s" % locality)

    # TA rendering the autocomplete list using this API

    print("Finding search result page URL")

    geo_url =  "https://www.tripadvisor.com/TypeAheadJson?action=API&startTime="+ str(int(time()))+ "&uiOrigin=GEOSCOPE&source=GEOSCOPE&interleaved=true&types=geo,theme_park&neighborhood_geos=true&link_type=hotel&details=true&max=12&injectNeighborhoods=true&query="+ locality
   

    print(geo_url)

    api_response = get_json(geo_url)
    # getting the TA url for th equery from the autocomplete response

    url_from_autocomplete = (
        "http://www.tripadvisor.com" + api_response["results"][0]["url"]
    )

    print("URL found %s" % url_from_autocomplete)

    print("Downloading search results page")
   
    page_response = get_response(url_from_autocomplete,checkin_date,checkout_date)
    print("Parsing results ")
    parser = html.fromstring(page_response)

    hotel_lists = parser.xpath('//span[@class="organic"]')
    hotel_data = []

    for hotel in hotel_lists:
       
        XPATH_HOTEL_LINK = './/div[@data-automation="hotel-card-title"]//a/@href'
        XPATH_REVIEWS  = './/span/span[contains(text(),"reviews")]//text()'
        XPATH_RATING = './/div[contains(@aria-label,"reviews")]/@aria-label'
        XPATH_HOTEL_NAME = './/div[@data-automation="hotel-card-title"]//h3/text()'
        XPATH_HOTEL_PRICE = './/span[contains(@data-automation,"Price")]/text()'

        raw_hotel_link = hotel.xpath(XPATH_HOTEL_LINK)
        raw_no_of_reviews = hotel.xpath(XPATH_REVIEWS)
        raw_rating = hotel.xpath(XPATH_RATING)
        raw_hotel_name = hotel.xpath(XPATH_HOTEL_NAME)
        raw_hotel_price = hotel.xpath(XPATH_HOTEL_PRICE)

        url = 'http://www.tripadvisor.com'+raw_hotel_link[0] if raw_hotel_link else  None
        reviews = raw_no_of_reviews[0].replace("reviews","").replace(",","").strip() if raw_no_of_reviews else 0
        rank = raw_hotel_name[0].split('.')[0].strip() if raw_hotel_name else None
        rating = raw_rating[0].replace('of 5 bubbles','').split()[0].strip() if raw_rating else None
        name = raw_hotel_name[0].split('.')[1].strip() if raw_hotel_name else None
        price = raw_hotel_price[0] if raw_hotel_price else 'Not Available'
           
        data = {    
                    'rank':rank,
                    'hotel_name':name,
                    'price':price,
                    'url':url,
                    'review count':reviews,
                    'tripadvisor_rating':rating,

        }

        hotel_data.append(data)

    return hotel_data

if __name__ == "__main__":

    parser = argparse.ArgumentParser()
    parser.add_argument('checkin_date',help = 'Hotel Check In Date (Format: YYYY/MM/DD')
    parser.add_argument('checkout_date',help = 'Hotel Chek Out Date (Format: YYYY/MM/DD)')
    parser.add_argument("locality", help="Search Locality")
    args = parser.parse_args()

    locality = args.locality
    checkin_date = datetime.strptime(args.checkin_date,"%Y/%m/%d")
    checkout_date = datetime.strptime(args.checkout_date,"%Y/%m/%d")
   
    data = parse(locality, checkin_date, checkout_date)
    print('extracted data',data)
    print("Writing to output file tripadvisor_data.json")

    with open("tripadvisor_data.json", "w",encoding='utf-8') as jsonfile:  
        json.dump(data,jsonfile,indent=4,ensure_ascii=False)

Using the Code

To use this code, copy and paste it into a file with a ‘.py’ extension, then run it from a terminal.

python yourpythoncode.py 2025/05/15 2025/05/20 boston

The code above scrapes details of hotels in Boston for a check-in date of January 15, 2025, and a check-out date of January 20, 2025

Code Limitations

The code can grab hotel details of a particular locality from Tripadvisor, but it has some limitations:

  1. The code relies on the API endpoint. If that changes, the code will break.
  2. It locates data points using XPaths, which relies on HTML structure; the code will fail if the structure changes.
  3. This scraper doesn’t use any advanced techniques to avoid anti-scraping measures, so it is unsuitable for large-scale scraping.
  4. The code won’t work for all the pages; you need to alter it to scrape other travel, airline, and hotel data from Tripadvisor.

Want to Get Tripadvisor Reviews Instead? Use ScrapeHero Tripadvisor Scraper

ScrapeHero Tripadvisor Ratings and Review Scraper is a no-code scraper from ScrapeHero Cloud. Within a few clicks, you can get ratings and reviews from any hotel’s page. 

Try for free:

  1. Log in or sign up for a ScrapeHero Cloud account
  2. Create a Project and Add details 
  3. Enter the Tripadvisor URLs from which you want to scrape data
  4. Click ‘Gather data’

Wait for the scraper to finish, and you can get the downloaded data from the ‘My Projects’ section.

Advanced Features

Besides giving you a no-code way to get Tripadvisor reviews, you can also

  • Schedule your scraper to run and pull reviews periodically.
  • Get the reviews delivered to your preferred cloud storage.
  • Integrate this review scraper with your workflow using APIs.

Still confused? Here’s a detailed guide on how to scrape Tripadvisor reviews.

Bottom Line

You can scrape Tripadvisor using the code shown in this tutorial. However, you need to maintain the code yourself; this includes replacing the API endpoint and altering the XPaths. Moreover, you need to alter the code if you intend to perform scraping on a large scale. 

But with a web scraping service, you can avoid coding yourself. A service like ScrapeHero can take care of all the technicalities, including dealing with anti-scraping measures.

ScrapeHero is a fully-managed web scraping service capable of building enterprise-grade scrapers customized to your needs. Contact ScrapeHero now and get all your data needs covered.

Table of content

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

Enhance scraped data

Transform Raw Data: Enrich and Enhance Scraped Data Effectively

Learn ways in which you can enrich and enhance scraped data for decision-making and increase operational efficiency.
Scrape Amazon Fresh

Web Scraping Amazon Fresh: How to Get Online Grocery Market Data

Learn how you can use Python to scrape Amazon Fresh.
Data privacy and security in web scraping

Protecting User Data: Essential Methods for Ensuring Data Privacy and Security in Web Scraping

Implement these essential methods to ensure data privacy and security, which are critical in web scraping.
ScrapeHero Logo

Can we help you get some data?