Web Scraping Amazon Product Reviews Using Python

Share:

Scrape amazon product reviews using python

Web scraping Amazon reviews can offer insightful data for market research, sentiment analysis, and competitive analysis. By building a Python scraper, you can extract product review data such as title, content, product name, rating, etc. from Amazon and export it into an Excel spreadsheet.

Through this article, learn to create a simple Amazon product review scraper using Python and SelectorLib. Later, run it on a console in order to extract all the details you need.

Steps for Web Scraping Amazon Reviews

  1. Markup the data fields to scrape using Selectorlib.
  2. Copy and run the code.
  3. Download the data in Excel or CSV file.

Data Fields To Scrape From Amazon

Data fields obtained when scraping Amazon reviews

Here are some of the data fields that are obtained when scraping Amazon product reviews and extracted into a spreadsheet from Amazon:

  1. Product Name
  2. Review Title
  3. Review Content/Review Text
  4. Rating
  5. Date of Publishing Review
  6. Verified Purchase
  7. Author Name
  8. URL

Installation

For web scraping Amazon product reviews, install the required Python3 packages. The libraries mentioned are used:

  • Python Requests – To make requests and download the HTML content of the pages
  • LXML – To parse the HTML tree structure using XPaths
  • Python Dateutil – To parse review dates
  • Selectorlib – To extract data using the YAML file created from the web pages downloaded

Install these libraries using pip3

pip3 install python-dateutil lxml requests selectorlib

The Functionality of the Amazon Product Review Scraper

Here is what a scraper created for scraping Amazon product reviews does:

  1. It reads a list of product review page URLs from a file called urls.txt, which contains the URLs for the Amazon product pages you need.
  2. It uses a selectorlib YAML file that identifies the data on an Amazon page and is saved in a file called selectors.yml.
  3. Scrapes the data.
  4. Saves the data as a CSV spreadsheet called data.csv.

Creating the YAML File – selectors.yml

Using the Selectorlib Web Scraper Chrome Extension, you can mark the data that you need to extract. It also helps in creating the CSS selectors or XPaths needed to extract that data.

You can use the template created by ScrapeHero to extract the details already mentioned. But if you want to scrape more data fields, then add the new field using Selectorlib to the existing template.

Let’s consider an Amazon page: Nike Women’s Reax Run 5 Running Shoes

Mark up the fields for the data you need to scrape from the Amazon Product Reviews Page using the Selectorlib Chrome Extension.

Marking the data fields to scrape from the Amazon Product Reviews Page using the Selectorlib Chrome Extension

Click on ‘Highlight’ to highlight and preview all of your selectors in the template, and then click on ‘Export’ to download the YAML file, selectors.yml.

Exporting the YAML file, selectors.yml when scraping Amazon product reviews

Here is how the template (selectors.yml) file looks:

product_title:
    css: 'h1 a[data-hook="product-link"]'
    type: Text
reviews:
    css: 'div.review div.a-section.celwidget'
    multiple: true
    type: Text
    children:
        title:
            css: a.review-title
            type: Text
        content:
            css: 'div.a-row.review-data span.review-text'
            type: Text
        date:
            css: span.a-size-base.a-color-secondary
            type: Text
        variant:
            css: 'a.a-size-mini'
            type: Text
        images:
            css: img.review-image-tile
            multiple: true
            type: Attribute
            attribute: src
        verified:
            css: 'span[data-hook="avp-badge"]'
            type: Text
        author:
            css: span.a-profile-name
            type: Text
        rating:
            css: 'div.a-row:nth-of-type(2) > a.a-link-normal:nth-of-type(1)'
            type: Attribute
            attribute: title
next_page:
    css: 'li.a-last a'
    type: Link

The Code

Let’s create a file called reviews.py and paste the Python code into it.

from selectorlib import Extractor
import requests 
import json 
from time import sleep
import csv
from dateutil import parser as dateparser

# Create an Extractor by reading from the YAML file
e = Extractor.from_yaml_file('selectors.yml')

def scrape(url):    
    headers = {
        'authority': 'www.amazon.com',
        'pragma': 'no-cache',
        'cache-control': 'no-cache',
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'none',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-dest': 'document',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }

    # Download the page using requests
    print("Downloading %s"%url)
    r = requests.get(url, headers=headers)
    # Simple check to check if page was blocked (Usually 503)
    if r.status_code > 500:
        if "To discuss automated access to Amazon data please contact" in r.text:
            print("Page %s was blocked by Amazon. Please try using better proxies\n"%url)
        else:
            print("Page %s must have been blocked by Amazon as the status code was %d"%(url,r.status_code))
        return None
    # Pass the HTML of the page and create 
    return e.extract(r.text)

with open("urls.txt",'r') as urllist, open('data.csv','w') as outfile:
    writer = csv.DictWriter(outfile, fieldnames=["title","content","date","variant","images","verified","author","rating","product","url"],quoting=csv.QUOTE_ALL)
    writer.writeheader()
    for url in urllist.readlines():
        data = scrape(url) 
        if data:
            for r in data['reviews']:
                r["product"] = data["product_title"]
                r['url'] = url
                if 'verified' in r:
                    if 'Verified Purchase' in r['verified']:
                        r['verified'] = 'Yes'
                    else:
                        r['verified'] = 'Yes'
                r['rating'] = r['rating'].split(' out of')[0]
                date_posted = r['date'].split('on ')[-1]
                if r['images']:
                    r['images'] = "\n".join(r['images'])
                r['date'] = dateparser.parse(date_posted).strftime('%d %b %Y')
                writer.writerow(r)
            # sleep(5)

Get the complete code for scraping Amazon reviews on GitHub

Running the Amazon Review Scraper

Now you should add the URLs you need to scrape into a text file called urls.txt in the same folder where you saved the code. Then run the scraper using the command:

python3 reviews.py

As mentioned earlier, the Amazon page for Nike Women’s Reax Run 5 Running Shoes is considered here. You can get this URL by clicking on “See more reviews” near the bottom of the product page.

Navigating to the Amazon product page to scrape details by clicking on “See more reviews”

The results obtained after scraping Amazon reviews using Python:

Results obtained in Excel or CSV after scraping Amazon reviews using Python

Use Cases for Scraping Amazon Reviews

The data that you gather from scraping Amazon reviews can help you:

  1. Monitoring customer opinions on products that you sell or manufacture using data analysis
  2. Monitor product quality sold by third-party sellers
  3. Create Amazon review datasets for educational purposes and research

ScrapeHero Amazon Product Reviews Scraper and Amazon Review API

Using ScrapeHero Cloud can be a better alternative to a Python Amazon product review scraper that you have created. This is because the structure of the Amazon website may change over time, and the current Python scraper you have created may have to be updated.

In such scenarios, prebuilt scrapers such as ScrapeHero Amazon Product Reviews Scraper can save you time and money. All you have to do is provide ASINs or product page URLs. That’s it. Moreover, our prebuilt scrapers allow you to configure your needs and fetch the data without any dedicated teams or hosting infrastructure.

You can also try using the ScrapeHero Amazon Product Reviews and Ratings API, which can be easily integrated into your application and stream extracted data seamlessly. It can also avoid the IP bans and captchas that Amazon implements when detecting web crawlers.

If you want to explore our scrapers related to scraping Amazon, try out:

Wrapping Up

Creating a scraper for Amazon product reviews in Python can be challenging for beginners, especially when dealing with complex issues like dynamic content and anti-scraping measures. But you can also scrape Amazon reviews without any coding on your part.

Keep in mind that the scraper discussed in this article is used to scrape Amazon reviews of a relatively small number of ASINs for your personal web scraping needs.

If you plan on running web scrapers on a larger scale, then ScrapeHero web scraping services would be more suitable. Businesses can rely on ScrapeHero to meet their data scraping needs effectively and efficiently as we manage data volume, providing clean, structured, and relevant data.

Frequently Asked Questions

1. Can I scrape Amazon reviews using Python?

Yes, you can scrape Amazon reviews using Python. You can do it either with libraries such as BeautifulSoup, Selenium, or Requests or with APIs such as ScrapeHero’s Amazon reviews API.

2. How to scrape Amazon reviews using Python BeautifulSoup?

Web scraping Amazon reviews using Python with BeautifulSoup includes installing the BeautifulSoup4 and Requests libraries, sending HTTP requests to the Amazon product page, parsing the HTML content, and extracting the desired information.

3. How to extract reviews from Amazon using Python Selenium?

Scraping Amazon product reviews using Python with Selenium involves automating a web browser to interact with the Amazon website, navigating through the product reviews, and extracting the data. It is better to use Selenium when dealing with JavaScript-generated dynamic content, which is standard on Amazon’s review pages.

4. What is an Amazon review scraper Chrome extension?

An Amazon review scraper Chrome extension is a browser add-on designed to extract reviews from Amazon product pages. It can collect review data, such as ratings and comments, which is later used for analysis or monitoring purposes.

5. Where can I find a free Amazon review scraper?

You will get a free Amazon review scraper from ScrapeHero Cloud, which you can avail of using the 25 free credits that you get during signup. ScrapeHero scrapers are fast, reliable, easy to use, and do not require coding on the user side. Apart from that, you can also find free code to scrape Amazon reviews in Python on GitHub.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Table of content

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

Transform and map scraped data

How to Transform and Map Scraped Data with Python Libraries

Learn how you can transform and map data using Python.
Using NLP to clean and structure scraped data

How to Use NLP to Clean and Structure Scraped Data

Learn how to use NLP to clean and structure scraped data.
Search engine web crawling

From Crawling to Ranking! This is How Search Engines Use Web Crawling to Index Websites!

Search engine crawling indexes web pages, making it essential for ranking and visibility in search results.
ScrapeHero Logo

Can we help you get some data?