How To Scrape Amazon Product Data and Prices using Python 3

Web scraping helps in automating data extraction from websites. In this tutorial, we will build an Amazon scraper for extracting product details and pricing. We will build this simple web scraper using Python and SelectorLib and run it in a console.

Here is how you can scrape Amazon product details from Amazon product page

  1. Markup the data fields to be scraped using Selectorlib
  2. Copy and run the code provided

Check out our web scraping tutorials to learn how to scrape Amazon Reviews easily using Google Chrome and how to build a Amazon Review Scraper using Python.

We have also provided how you can scrape product details from Amazon search result page, how to avoid getting blocked by Amazon and how to scrape Amazon on a large scale below.

Try the Amazon Product Detail Crawler in ScrapeHero Cloud for free, scrape Amazon easily without having to code.

 

Setting up your computer for Amazon Scraping

We will use Python 3 for this Amazon scraper. The code will not run if you are using Python 2.7. To start, you need a computer with Python 3 and PIP installed in it.

Follow this guide to setup your computer and install packages if you are on windows

How To Install Python Packages for Web Scraping in Windows 10

Packages to install for Amazon scraping

  • Python Requests, to make requests and download the HTML content of the Amazon product pages
  • SelectorLib python package to extract data using the YAML file we created from the webpages we download

Using pip3,

pip3 install requests requests selectorlib

Scrape product details from the Amazon Product Page

The Amazon product page scraper will scrape the following details from product page.

  1. Product Name
  2. Price
  3. Short Description
  4. Full Product Description
  5. Image URLs
  6. Rating
  7. Number of Reviews
  8. Variant ASINs
  9. Sales Rank
  10. Link to all Reviews Page

Markup the data fields using Selectorlib

We have already marked up the data, so you can just skip this step if you want to get right to the data.

Here is how our template looks like. See the file here

Let’s save this as a file called selectors.yml in the same directory as our code.

name:
    css: '#productTitle'
    type: Text
price:
    css: '#price_inside_buybox'
    type: Text
short_description:
    css: '#featurebullets_feature_div'
    type: Text
images:
    css: '.imgTagWrapper img'
    type: Attribute
    attribute: data-a-dynamic-image
rating:
    css: span.arp-rating-out-of-text
    type: Text
number_of_reviews:
    css: 'a.a-link-normal h2'
    type: Text
variants:
    css: 'form.a-section li'
    multiple: true
    type: Text
    children:
        name:
            css: ""
            type: Attribute
            attribute: title
        asin:
            css: ""
            type: Attribute
            attribute: data-defaultasin
product_description:
    css: '#productDescription'
    type: Text
sales_rank:
    css: 'li#SalesRank'
    type: Text
link_to_all_reviews:
    css: 'div.card-padding a.a-link-emphasis'
    type: Link

 

Here is a preview of the markup

Selectorlib Template for Amazon.com

 

Selectorlib is a combination of tools for developers that makes marking up and extracting data from web pages easy. The Selectorlib Chrome Extension lets you mark data that you need to extract, and creates the CSS Selectors or XPaths needed to extract that data, then previews how the data would look like.

You can learn more about Selectorlib and how to use it to markup data here

If you don't like or want to code, ScrapeHero Cloud is just right for you!

Skip the hassle of installing software, programming and maintaining the code. Download this data using ScrapeHero cloud within seconds.

Get Started for Free
Deploy to ScrapeHero Cloud

The Code

Create a folder called amazon-scraper and paste your selectorlib yaml template file as selectors.yml.

Let’s create a file called amazon.py and paste the code below into it. All it does is

  1. Read a list of Amazon Product URLs from a file called urls.txt
  2. Scrape the data
  3. Save the data as a JSON Lines file
from selectorlib import Extractor
import requests 
import json 
from time import sleep


# Create an Extractor by reading from the YAML file
e = Extractor.from_yaml_file('selectors.yml')

def scrape(url):    
    headers = {
        'authority': 'www.amazon.com',
        'pragma': 'no-cache',
        'cache-control': 'no-cache',
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'none',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-dest': 'document',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }

    # Download the page using requests
    print("Downloading %s"%url)
    r = requests.get(url, headers=headers)
    # Simple check to check if page was blocked (Usually 503)
    if r.status_code > 500:
        if "To discuss automated access to Amazon data please contact" in r.text:
            print("Page %s was blocked by Amazon. Please try using better proxies\n"%url)
        else:
            print("Page %s must have been blocked by Amazon as the status code was %d"%(url,r.status_code))
        return None
    # Pass the HTML of the page and create 
    return e.extract(r.text)

# product_data = []
with open("urls.txt",'r') as urllist, open('output.jsonl','w') as outfile:
    for url in urllist.readlines():
        data = scrape(url) 
        if data:
            json.dump(data,outfile)
            outfile.write("\n")
            # sleep(5)

Running the Amazon Product Page Scraper

You can get the full code from Github – https://github.com/scrapehero-code/amazon-scraper

You can start your scraper by typing the command:

python3 amazon.py

Once the scrape is complete you should see a file called output.jsonl with your data. Here is an example for the URL

https://www.amazon.com/HP-Computer-Quard-Core-Bluetooth-Accessories/dp/B085383P7M/

{
  "name": "2020 HP 15.6\" Laptop Computer, 10th Gen Intel Quard-Core i7 1065G7 up to 3.9GHz, 16GB DDR4 RAM, 512GB PCIe SSD, 802.11ac WiFi, Bluetooth 4.2, Silver, Windows 10, YZAKKA USB External DVD + Accessories",
  "price": "$959.00",
  "short_description": "Powered by latest 10th Gen Intel Core i7-1065G7 Processor @ 1.30GHz (4 Cores, 8M Cache, up to 3.90 GHz); Ultra-low-voltage platform. Quad-core, eight-way processing provides maximum high-efficiency power to go.\n15.6\" diagonal HD SVA BrightView micro-edge WLED-backlit, 220 nits, 45% NTSC (1366 x 768) Display; Intel Iris Plus Graphics\n16GB 2666MHz DDR4 Memory for full-power multitasking; 512GB Solid State Drive (PCI-e), Save files fast and store more data. With massive amounts of storage and advanced communication power, PCI-e SSDs are great for major gaming applications, multiple servers, daily backups, and more.\nRealtek RTL8821CE 802.11b/g/n/ac (1x1) Wi-Fi and Bluetooth 4.2 Combo; 1 USB 3.1 Gen 1 Type-C (Data Transfer Only, 5 Gb/s signaling rate); 2 USB 3.1 Gen 1 Type-A (Data Transfer Only); 1 AC smart pin; 1 HDMI 1.4b; 1 headphone/microphone combo\nWindows 10 Home, 64-bit, English; Natural silver; YZAKKA USB External DVD drive + USB extension cord 6ft, HDMI cable 6ft and Mouse Pad\n› See more product details",
  "images": "{\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX425_.jpg\":[425,425],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX466_.jpg\":[466,466],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SY355_.jpg\":[355,355],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX569_.jpg\":[569,569],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SY450_.jpg\":[450,450],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX679_.jpg\":[679,679],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX522_.jpg\":[522,522]}",
  "variants": [
    {
      "name": "Click to select 4GB DDR4 RAM, 128GB PCIe SSD",
      "asin": "B01MCZ4LH1"
    },
    {
      "name": "Click to select 8GB DDR4 RAM, 256GB PCIe SSD",
      "asin": "B08537NR9D"
    },
    {
      "name": "Click to select 12GB DDR4 RAM, 512GB PCIe SSD",
      "asin": "B08537ZDYH"
    },
    {
      "name": "Click to select 16GB DDR4 RAM, 512GB PCIe SSD",
      "asin": "B085383P7M"
    },
    {
      "name": "Click to select 20GB DDR4 RAM, 1TB PCIe SSD",
      "asin": "B08537NDVZ"
    }
  ],
  "product_description": "Capacity:16GB DDR4 RAM, 512GB PCIe SSD\n\nProcessor\n\n  Intel Core i7-1065G7 (1.3 GHz base frequency, up to 3.9 GHz with Intel Turbo Boost Technology, 8 MB cache, 4 cores)\n\nChipset\n\n  Intel Integrated SoC\n\nMemory\n\n  16GB DDR4-2666 SDRAM\n\nVideo graphics\n\n  Intel Iris Plus Graphics\n\nHard drive\n\n  512GB PCIe NVMe M.2 SSD\n\nDisplay\n\n  15.6\" diagonal HD SVA BrightView micro-edge WLED-backlit, 220 nits, 45% NTSC (1366 x 768)\n\nWireless connectivity\n\n  Realtek RTL8821CE 802.11b/g/n/ac (1x1) Wi-Fi and Bluetooth 4.2 Combo\n\nExpansion slots\n\n  1 multi-format SD media card reader\n\nExternal ports\n\n  1 USB 3.1 Gen 1 Type-C (Data Transfer Only, 5 Gb/s signaling rate); 2 USB 3.1 Gen 1 Type-A (Data Transfer Only); 1 AC smart pin; 1 HDMI 1.4b; 1 headphone/microphone combo\n\nMinimum dimensions (W x D x H)\n\n  9.53 x 14.11 x 0.70 in\n\nWeight\n\n  3.75 lbs\n\nPower supply type\n\n  45 W Smart AC power adapter\n\nBattery type\n\n  3-cell, 41 Wh Li-ion\n\nBattery life mixed usage\n\n  Up to 11 hours and 30 minutes\n\n  Video Playback Battery life\n\n  Up to 10 hours\n\nWebcam\n\n  HP TrueVision HD Camera with integrated dual array digital microphone\n\nAudio features\n\n  Dual speakers\n\nOperating system\n\n  Windows 10 Home 64\n\nAccessories\n\n  YZAKKA USB External DVD drive + USB extension cord 6ft, HDMI cable 6ft and Mouse Pad",
  "link_to_all_reviews": "https://www.amazon.com/HP-Computer-Quard-Core-Bluetooth-Accessories/product-reviews/B085383P7M/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"
}

Scrape Amazon products from the Search Results Page

The Amazon search result page scraper will scrape the following details from search result page.

  1. Product Name
  2. Price
  3. URL
  4. Rating
  5. Number of Reviews

The steps and code for scraping search results is very similar to the product page scraper.

Markup the data fields using Selectorlib

Here is our selectorlib yml file. Lets calls it search_results.yml

products:
    css: 'div[data-component-type="s-search-result"]'
    xpath: null
    multiple: true
    type: Text
    children:
        title:
            css: 'h2 a.a-link-normal.a-text-normal'
            xpath: null
            type: Text
        url:
            css: 'h2 a.a-link-normal.a-text-normal'
            xpath: null
            type: Link
        rating:
            css: 'div.a-row.a-size-small span:nth-of-type(1)'
            xpath: null
            type: Attribute
            attribute: aria-label
        reviews:
            css: 'div.a-row.a-size-small span:nth-of-type(2)'
            xpath: null
            type: Attribute
            attribute: aria-label
        price:
            css: 'span.a-price:nth-of-type(1) span.a-offscreen'
            xpath: null
            type: Text

The Code

The code is almost identical to the previous scraper, except that we iterate through each product and save them as a separate line.

Let’s create a file searchresults.py and paste the code below into it. Here is what the code does

  1. Open a file called search_results_urls.txt and read search result page URLs
  2. Scrape the data
  3. Save to a JSON Lines file called search_results_output.jsonl

 

from selectorlib import Extractor
import requests 
import json 
from time import sleep


# Create an Extractor by reading from the YAML file
e = Extractor.from_yaml_file('search_results.yml')

def scrape(url):  

    headers = {
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://www.amazon.com/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }

    # Download the page using requests
    print("Downloading %s"%url)
    r = requests.get(url, headers=headers)
    # Simple check to check if page was blocked (Usually 503)
    if r.status_code > 500:
        if "To discuss automated access to Amazon data please contact" in r.text:
            print("Page %s was blocked by Amazon. Please try using better proxies\n"%url)
        else:
            print("Page %s must have been blocked by Amazon as the status code was %d"%(url,r.status_code))
        return None
    # Pass the HTML of the page and create 
    return e.extract(r.text)

# product_data = []
with open("search_results_urls.txt",'r') as urllist, open('search_results_output.jsonl','w') as outfile:
    for url in urllist.read().splitlines():
        data = scrape(url) 
        if data:
            for product in data['products']:
                product['search_url'] = url
                print("Saving Product: %s"%product['title'])
                json.dump(product,outfile)
                outfile.write("\n")
                # sleep(5)
    

 

Running the Amazon Scraper to Scrape Search Result

You can start your scraper by typing the command:

python3 searchresults.py

Once the scrape is complete you should see a file called search_results_output.jsonl with your data.

Here is an example for the URL
https://www.amazon.com/s?k=laptops

https://github.com/scrapehero-code/amazon-scraper/blob/master/search_results_output.jsonl

What to do if you get blocked while scraping Amazon

We are adding this extra section to talk about some methods you could use to not get blocked while scraping Amazon. Amazon is very likely to flag you as a “BOT” if you start scraping hundreds of pages using the code above. The idea is to avoid getting flagged as BOT while scraping and running into problems. How do we solve such challenges?

Mimic human behavior as much as possible.

While we cannot guarantee that you will not be blocked. Here are some tips and tricks on how to avoid getting blocked by amazon

Use proxies and rotate them

Let us say we are scraping hundreds of products on Amazon.com from a laptop, which usually has just one IP address. Amazon would know that we are a bot in no time, as NO HUMAN would ever visit hundreds of product pages in a minute. To look more like a human –  make requests to Amazon.com through a pool of IP Addresses or proxies. The rule of thumb here is to have 1 proxy or IP address make not more than 5 requests to Amazon in a minute. If you are scraping about 100 pages per minute, we need about 100/5 = 20 Proxies. You can read more about rotating proxies here

Specify the User Agents of latest browsers and rotate them

If you look at the code above, you will a line where we had set User-Agent String for the request we are making.

 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'

Just like proxies, it always good to have a pool of User Agent Strings. Just make sure you’re using user-agent strings of the latest and popular browsers and rotate the strings for each request you make to Amazon. You can learn more about rotating user agent string in python here.  It is also a good idea to create a combination of  (User-Agent, IP Address) so that it looks more human than a bot.

Reduce the number of ASINs scraped per minute

You can try slowing down the scrape a bit, to give Amazon fewer chance of flagging you as a bot. But about 5 requests per IP per minute isn’t much throttling. If you need to go faster, add more proxies. You can modify the speed by increasing or decreasing the delay in the sleep function

Retry, Retry, Retry

When you are blocked by Amazon, make sure you retry that request. If you look at the code block above we have added 20 retries. Our code retries immediately after the scrape fails, you could do an even better job here by creating a retry queue using a list, and retry them after all the other products are scraped from Amazon.

If you don't like or want to code, ScrapeHero Cloud is just right for you!

Skip the hassle of installing software, programming and maintaining the code. Download this data using ScrapeHero cloud within seconds.

Get Started for Free
Deploy to ScrapeHero Cloud

How to Solve Amazon Scraping Challenges

This Amazon scraper should work for small-scale scraping and hobby projects. It can get you started on your road to building bigger and better scrapers. However, if you do want to scrape Amazon for thousands of pages at short intervals here are some important things to keep in mind:

Use a Web Scraping Framework like PySpider or Scrapy

When you’re crawling a massive site like Amazon.com, you need to spend some time to figure out how to run your entire crawl smoothly. Choose an open-source framework for building your scraper, like Scrapy or PySpider which are both based in Python. These frameworks have pretty active communities and can take care of handling a lot of the errors that happen while scraping without disturbing the entire scraper. Most of them also let you use multiple threads to speed up scraping – if you are using a single computer. You can deploy Scrapy to your own servers using ScrapyD.

If you need speed, Distribute and Scale-Up using a Cloud Provider

There is a limit to the number of pages you can scrape from Amazon when using a single computer. If you’re scraping Amazon on a large scale, you need a lot of servers to get data within a reasonable time. You could consider hosting your scraper in the cloud and use a scalable Version of the Framework – like Scrapy Redis. For broader crawls, use message brokers like Redis, Rabbit MQ, Kafka, to run multiple spider instances to speed up crawls.

Use a scheduler if you need to run the scraper periodically

If you are using a scraper to get updated prices of products, you need to refresh your data frequently to keep track of the changes. Use CRON (in UNIX) or Task Scheduler in Windows to schedule the crawler, if you are using the script in this tutorial. If you are using Scrapy, scrapyd+cron can help schedule your spiders so you can refresh the data on a regular interval.

Use a database to store the Scraped Data from Amazon

If you are scraping a large number of products from Amazon, writing data to a file would soon become inconvenient. Retrieving data becomes tough, and you might even end up getting gibberish inside the file when multiple processes write to a single file. Use a database even if you are scraping from a single computer. MySQL will be just fine for moderate workloads, and you can use simple analytics on the scraped data tools like Tableau, PowerBI or Metabase by connecting them to your database. For larger write loads you can look into some of the NoSQL databases like MongoDB, Cassandra, etc.

Use Request Headers, Proxies, and IP Rotation to prevent getting Captchas from Amazon

Amazon has a lot of anti-scraping measures. If you are throttling Amazon, they will block you in no time and you’ll start seeing captchas instead of product pages. To prevent that, while going through each Amazon product page, it’s better to change headers by replacing your UserAgent value. This makes requests look like they’re coming from a browser and not a script.
To crawl Amazon on a very large scale, use Proxies and IP Rotation to reduce the number of captchas you get. You can learn more techniques to prevent getting blocked by Amazon and other sites here –  How to prevent getting blacklisted while scraping.  You can also use python to solve some basic captchas using an OCR called Tesseract.

Write some simple data quality tests

Scraped data is always messy. An XPath that works for a page might not work for another variation of the same page on the same site. Amazon has LOTS of product page layouts. If you spend an hour writing basic sanity check for your data – like verifying if the price is a decimal, you’ll know when your scraper breaks and you’ll also be able to minimize its impact. Incorporating data quality checks to your code are helpful especially if you are scraping Amazon data for price monitoring, seller monitoring, stock monitoring etc.

We hope this tutorial gave you a better idea on how to scrape Amazon or similar e-commerce websites. As a company, we understand e-commerce data having worked with it before. If you are interested in professional help with scraping complex websites, let us know, and we will be glad to help.

How to use Amazon Product Data

  1. Monitor Amazon products for change in Price, Stock Count/Availability, Rating, etc.
    By using a web scraper, you can update your data feeds on a timely basis to monitor any product changes. These data feeds can help you form pricing strategies by looking at your competition – other sellers or brands.
  2. Scrape Amazon Product Details that you can’t get with the Product Advertising API
    Amazon provides a Product Advertising API, but like most other “API”s, this API doesn’t provide all the information that Amazon has on a product page. A web scraper can help you extract all the details displayed on the product page.
  3. Analyze how a particular Brand sells on Amazon
    If you’re a retailer, you can monitor your competitor’s products and see how well they do in the market and make adjustments to reprice and sell your products. You could also use it to monitor your distribution channel to identify how your products are sold on Amazon by sellers, and if it is causing you any harm.
  4. Find Customer Opinions from Amazon Product Reviews
    Reviews offer abundant amounts of information. If you’re targeting an established set of sellers who have been selling reasonable volumes, you can extract the reviews of their products to find what you should avoid and what you could quickly improve on while trying to sell similar products on Amazon.

Getting blocked scraping Amazon data?

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

                    

Posted in:   Developers, eCommerce Data Gathering Tutorials, Web Scraping Tutorials

Responses

Swedha August 14, 2016

I don’t get the output. No error too. In the json file all the values are ‘null’, for eg:
[
{
“CATEGORY”: null,
“ORIGINAL_PRICE”: null,
“NAME”: null,
“URL”: “http://www.amazon.com/dp/B0046UR4F4”,
“SALE_PRICE”: null,
“AVAILABILITY”: null
},
]


    ScrapeHero August 16, 2016

    Glad to hear you got it working !


    Subhasis Mukherjee September 1, 2016

    Thank for this. This solved the issue.


    stedentriplonden January 4, 2017

    Hi , the user agent trick didn’t work for me, scrapehero is there something changed on the amazon code that i get this results:

    [
    {
    “CATEGORY”: null,
    “ORIGINAL_PRICE”: null,
    “NAME”: null,
    “URL”: “http://www.amazon.com/dp/B0046UR4F4”,
    “SALE_PRICE”: null,
    “AVAILABILITY”: null
    },
    {
    “CATEGORY”: null,
    “ORIGINAL_PRICE”: null,
    “NAME”: null,
    “URL”: “http://www.amazon.com/dp/B00JGTVU5A”,
    “SALE_PRICE”: null,
    “AVAILABILITY”: null
    },


      ScrapeHero January 4, 2017

      We will have a look at the code and see if it still works and get back with a comment as soon as our paying job allows 😉


      Alan February 6, 2018

      thanks bro..


Rakesh Pandey October 15, 2016

Nice python script. Great work for beginner.


Ed November 5, 2016

Are there any cheap web hosting solutions what have Python installed? Hoping I could set up my required Amazon products, update prices daily then point a website/app to the .json file on my new shared hosting.

Maybe even AWS, Azure etc or a Cloud IDE. Just looking for a simple solution to start off with.


    ScrapeHero November 5, 2016

    Most VPSs or shared hosting plans support Python. Just ask them before buying.


Konstantinos Bazakos November 14, 2016

Nice implementation! Very well done! Just a question…What is the purpose of the sleep() functions? How comes Amazon does not return a typical robot/spider message to use their api?


    ScrapeHero November 16, 2016

    Hi,
    sleep just pauses the execution for a bit so that we dont hammer the server.
    Can you clarify the second part of the question – not sure what that means.
    Thanks


      Konstantinos November 17, 2016

      In the beginning I did not use headers in the requests.get() so in the HTML (html.fromstring()) content there was the following message “To discuss automated access to Amazon data please contact mail. For information about migrating to our APIs refer to our Marketplace APIs at link, or our Product Advertising API at link for advertising use cases.” from Amazon.


        ScrapeHero November 17, 2016

        You should mimic the browser as much as possible including headers, cookies and sessions – that with IP rotation will work for small scale data gathering


Dheeraj November 22, 2016

Any wayto extract the reviews based in ASIN number for particular product


    ScrapeHero November 22, 2016

    Sure but it would need modification to this code.
    The tutorial provides the basis for it but you will need to identify the xpaths for the review and grab the content that way.


Saul Bretado November 28, 2016

Does this code works for extracting 1500 products?… Adding IP rotation off course. Please let me know.


    ScrapeHero November 28, 2016

    Hi Saul,
    The code should work but at those numbers (1500 products) the code is not the problem.
    Everything else related to web scraping that we have written about on our site starts to matter.
    Please try the code by modifying it and let us know.

    Thanks


      Saul Bretado December 9, 2016

      I was trying to read a csv file as:

      AsinList = csv.DictReader(open(os.path.join(os.path.dirname(__file__),”asinnumbers.csv”)))

      But I am getting the error below:

      Traceback (most recent call last):
      File “amazon_scraper.py”, line 66, in
      ReadAsin()
      File “amazon_scraper.py”, line 57, in ReadAsin
      url = “http://www.amazon.com/dp/”+i
      TypeError: cannot concatenate ‘str’ and ‘dict’ objects

      Any recommendations? I already google about, but could not find anything.


        ScrapeHero December 12, 2016

        Hi Saul,

        You are trying to concatenate a dictionary object with “http://www.amazon.com/dp/”.

        Can you try replacing

        url = “http://www.amazon.com/dp/”+i

        with

        url = “http://www.amazon.com/dp/”+i[‘asin’].

        This is assuming that your CSV looks like this

        asin,
        B00JGTVU5A
        B00GJYCIVK,
        B00EPGK7CQ,
        B00EPGKA4G,
        B00YW5DLB4,
        B00KGD0628,
        B00O9A48N2,
        B00O9A4MEW,
        B00UZKG8QU


          Saul Bretado December 16, 2016

          Thanks a lot for this amazing tutorial, but, after using the script for few days, now is not working well, I am getting much as bellow:

          “CATEGORY”: null,
          “ORIGINAL_PRICE”: null,
          “NAME”: null,
          “URL”: “http://www.amazon.com/dp/B00FF01SSS”,
          “SALE_PRICE”: null,
          “AVAILABILITY”: null

          And as I told you, everything was working amazing well, even I add the code below to switch headers every time…

          navegador = randint(0,2)
          if navegador==0:
          headers = {‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36’}
          print ‘Using Chrome’
          elif navegador==1:
          headers = {‘User-Agent’: ‘Mozilla/5.0 (Windows NT 6.3; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0’}
          print ‘Using Firefox’
          else:
          headers = {‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10240’}
          print ‘Using Edge’

          And, everything was perfect, til today, any ideas why?
          Thanks!


Rob December 13, 2016

Does anyone know of a commercial version of this process? I am looking to scrape Amazon data for an inventory system. We have the ASINs on incoming excel sheets, but need to pull product data and images to populate the inventory. We’d be happy to pay for a pre-existing version of this process rather than build it ourselves or hire a developer.


    muntazirlakhani December 21, 2018

    Great! I am too looking for this please mail me on arlmathsir@gmail.com if you found any solution.


Jamen McGranahan December 20, 2016

The main issue I see with this is that it only gets the offer from the Buy Box, but not every offer available from Amazon. I’m trying to do this now to see if I can get it to work; just not overly familiar with python. But I know the URLs stay pretty much the same: http://www.amazon.com/gp/offer-listing/{ASIN}/ref=olp_f_freeShipping?ie=UTF8&f_freeShipping=true&f_new=true&f_primeEligible=true


    ScrapeHero December 20, 2016

    Hi James,
    You are correct, the tutorial only scrapes the buy box price.
    You will need to modify the code to get the 3rd party sellers.
    Thanks


Dan March 23, 2017

Hello ScrapeHero,
what of I want to get other product details, how can I change the code, I assume it’s the following parts
“XPATH_NAME = ‘//h1[@id=”title”]//text()’
XPATH_SALE_PRICE = ‘//span[contains(@id,”ourprice”) or contains(@id,”saleprice”)]/text()’
XPATH_ORIGINAL_PRICE = ‘//td[contains(text(),”List Price”) or contains(text(),”M.R.P”) or contains(text(),”Price”)]/following-sibling::td/text()’
XPATH_CATEGORY = ‘//a[@class=”a-link-normal a-color-tertiary”]//text()’
XPATH_AVAILABILITY = ‘//div[@id=”availability”]//text()'”
thanks


    ScrapeHero March 25, 2017

    Hi Dan,
    Yes – you will need to add or update XPATHS to get additional data.


rockemon May 9, 2017

I Try to get Image url using this xpath :

XPATH_IMG = ‘//div[@class=”imgTagWrapper”]/img/@src//text()’

but the result is Null, can you give me the point to achieved this


    syed mustafa September 22, 2017

    Yes i am also getting same error Did you Find the solution if yes please help


Chris June 18, 2017

Hejsan from Sweden,

I am a total “dummie” regarding python. I tried to use this code with Python 3 instead. There you have pip and requests included as I understand. Anyway, I do not get a data.json file respectively the provided code is not running and if i check it through python they mention missing parentheses. I just wonder if the code should work for python 3 as well and if not, why? Is it a different language?

best regards,

Chris


    ScrapeHero June 18, 2017

    Hi Chris,
    Yes it is almost a new language – v2 code will not work in 3 for most cases especially with libraries used.
    Try downloading and running in V2.

    Thanks


    zanzi88 September 23, 2017

    Hi Chris,

    I am running the following version of python:
    Python 3.5.2 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:53:06)
    [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
    Type “help”, “copyright”, “credits” or “license” for more information.

    I changed the code only a little to fit python 3. Pasted the code below. Let me know if you need any help.

    from lxml import html
    import csv,os,json
    import requests
    #from exceptions import ValueError
    from time import sleep

    def AmzonParser(url):
    headers = {‘User-Agent’: ‘Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36’}
    page = requests.get(url,headers=headers)
    while True:
    sleep(3)
    try:
    doc = html.fromstring(page.content)
    XPATH_NAME = ‘//h1[@id=”title”]//text()’
    XPATH_SALE_PRICE = ‘//span[contains(@id,”ourprice”) or contains(@id,”saleprice”)]/text()’
    XPATH_ORIGINAL_PRICE = ‘//td[contains(text(),”List Price”) or contains(text(),”M.R.P”) or contains(text(),”Price”)]/following-sibling::td/text()’
    XPATH_CATEGORY = ‘//a[@class=”a-link-normal a-color-tertiary”]//text()’
    XPATH_AVAILABILITY = ‘//div[@id=”availability”]//text()’

    RAW_NAME = doc.xpath(XPATH_NAME)
    RAW_SALE_PRICE = doc.xpath(XPATH_SALE_PRICE)
    RAW_CATEGORY = doc.xpath(XPATH_CATEGORY)
    RAW_ORIGINAL_PRICE = doc.xpath(XPATH_ORIGINAL_PRICE)
    RAw_AVAILABILITY = doc.xpath(XPATH_AVAILABILITY)

    NAME = ‘ ‘.join(”.join(RAW_NAME).split()) if RAW_NAME else None
    SALE_PRICE = ‘ ‘.join(”.join(RAW_SALE_PRICE).split()).strip() if RAW_SALE_PRICE else None
    CATEGORY = ‘ > ‘.join([i.strip() for i in RAW_CATEGORY]) if RAW_CATEGORY else None
    ORIGINAL_PRICE = ”.join(RAW_ORIGINAL_PRICE).strip() if RAW_ORIGINAL_PRICE else None
    AVAILABILITY = ”.join(RAw_AVAILABILITY).strip() if RAw_AVAILABILITY else None

    if not ORIGINAL_PRICE:
    ORIGINAL_PRICE = SALE_PRICE

    if page.status_code!=200:
    raise ValueError(‘captha’)
    data = {
    ‘NAME’:NAME,
    ‘SALE_PRICE’:SALE_PRICE,
    ‘CATEGORY’:CATEGORY,
    ‘ORIGINAL_PRICE’:ORIGINAL_PRICE,
    ‘AVAILABILITY’:AVAILABILITY,
    ‘URL’:url,
    }

    return data
    except Exception as e:
    print(e)

    def ReadAsin():
    # AsinList = csv.DictReader(open(os.path.join(os.path.dirname(__file__),”Asinfeed.csv”)))
    AsinList = [‘B0046UR4F4’,
    ‘B00JGTVU5A’,
    ‘B00GJYCIVK’,
    ‘B00EPGK7CQ’,
    ‘B00EPGKA4G’,
    ‘B00YW5DLB4’,
    ‘B00KGD0628’,
    ‘B00O9A48N2’,
    ‘B00O9A4MEW’,
    ‘B00UZKG8QU’,]
    extracted_data = []
    for i in AsinList:
    url = “http://www.amazon.com/dp/”+i
    print(“Processing: “+url)
    extracted_data.append(AmzonParser(url))
    sleep(5)
    f=open(‘data.json’,’w’)
    json.dump(extracted_data,f,indent=4)

    if __name__ == “__main__”:
    ReadAsin()


LtPitt June 21, 2017

Hello there!

What is my item price change according to its color?

Great script, love it 🙂


Boulahna July 30, 2017

Thanks a lot for this very useful script. I m going to the next step : Scalable do-it-yourself scraping – How to build and run scrapers on a large scale


Neha Sharma August 24, 2017

Hi In the bulk extraction for product details ! is it limited to 10 , Would it be possible to extract more than 10 Product details


    pkale1708 September 7, 2017

    Yes its possible for more than 10 ID’s.


miku September 14, 2017

AVAILABILITY does not work in .cn website.


pimo October 11, 2017

Hi there!

What if I have a list of Urls in this form (ASIN + Merchant ID) and only want to scrape the actual quantity?

https://www.amazon.co.uk/dp/B00NI02DB8?m=A2XF3BWCLY1PQM
Quantity: 30

Quantity:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

Quantity:1

Thanks!


Harrison Kenning November 10, 2017

I keep on getting this error: SSLError: HTTPSConnectionPool(host=’www.amazon.com’, port=443): Max retries exceeded with url: /dp/B00YG0JV96 (Caused by SSLError(SSLError(“bad handshake: Error([(‘SSL routines’, ‘tls_process_server_certificate’, ‘certificate verify failed’)],)”,),))

What am I missing?


    ScrapeHero November 10, 2017

    Hi Harrison,
    It is most likely an old version of python.


      rijesh ck January 7, 2018

      use verify=False. like this requests.get(url, headers=headers, verify=False)


Vivek Verma November 14, 2017

Getting Captha i.e. the error, do let me know how to fix it.

Processing: http://www.amazon.com/dp/B00J0K55L0
captha
captha
captha
captha
captha
captha
captha
ERROR: execution aborted


NICHOLAS JENKINS November 19, 2017

I am looking to modify this script to also scrape Walmart, Gamestop, Target, etc what resources can you point me to to modify this script to include those?


Ankita March 14, 2018

Ge This Error
File “data1.py”, line 48
print e
^
SyntaxError: Missing parentheses in call to ‘print’. Did you mean print(print e)?


    ScrapeHero March 18, 2018

    You are trying to run a python 2 script using python 3. Try running this script using Python 2 and it should work.


Sophal SUN April 9, 2018

Thanks a lot for your tutorial, the are awesome, I am a total beginner and with some courses with python I success to Scrape Amazon.com with your tutorials 🙂


Sanika Dhongade April 15, 2018

Thanks a lot for the tutorial. Is there any way to save the same output to a .CSV file instead of JSON?


HG kim July 15, 2018

Is someone who has success prime scrape in amazon? please!! advise me, if there are any successful people who scrape whether you are prime of additional items classified by size & color. my head will be crushed.


SmartY August 22, 2018

Amazon URL responding slow, who can I make it fast?


Klewis October 12, 2018

Is there a way I can import my ASINs from an Excel file and export the prices and findings to another Excel file?


    ScrapeHero October 12, 2018

    Sure anything is possible through programming however not in the scope of this article.
    Libraries that manipulate excel or excel macros can help you do that easily.


Jognnner November 28, 2018

Hey, I think your code might need some modifications because it returns no value currently~


Sarah November 28, 2018

how to scrape the rating and # of reviews for a product


Daniel Gaytán April 30, 2019

I am interested on the API, but I need to get all variations from an ASIN, is that possible?


    ScrapeHero April 30, 2019

    Sure – please reach out to us using our website contact form.
    Thanks


shree May 14, 2019

How to scrape the feedback from consumer?
Thanks in advance


    Go November 15, 2020

    just have the same question. And there is still no reply on this. disappointed.


      ScrapeHero November 17, 2020

      These enhancements are exercises for the reader and our code is for learning purposes only.
      Thank you


Bharat Bhushan June 25, 2019

@ ScrapeHero
Can you please give some idea like how to crawl data from amazon for a specific city ?


Tiana August 15, 2019

I am getting this errors:

Amazon_Scraper.py”, line 72, in
ReadAsin()
Amazon_Scraper.py”, line 67, in ReadAsin
f=open(‘data.json’,’w’)
PermissionError: [Errno 13] Permission denied: ‘data.json’


    ScrapeHero August 16, 2019

    Looks like the output file cannot be written due to lack of permissions.
    Please google for such generic python errors.


    Kashif March 8, 2020

    Hello.

    I want to be able to do the following with python.

    Initiate a search for any category of products using following parameters:

    No. Reviews
    Average review rating
    Average monthly sales
    Average monthly revenues

    Based on the above parameters, I want python to give me products who fall on the above criteria.

    Please tell me if it’s possible?

    If it’s possible, my next question would be how would we use python to access monthly sales and monthly revenue for a particular product?

    Please looking forward to your reply.


jan August 19, 2019

Is there any way to scrape the Asin automatically? I mean, I want to scrapy over 1000+ products and I don’t want to make a list with that much Asin numbers.


Terry April 2, 2020

So how would one scrape an ecommerce site of their sale/clearance items automatically on a weekly basis and compare to Amazon’s prices?


Andrew July 14, 2020

I just wonder this there any technical way you can track the number of sales of a product from Amazon?


Anu Rani July 20, 2020

I am getting error while reading data in python??

raise JSONDecodeError(“Extra data”, s, end)

JSONDecodeError: Extra data


arka August 3, 2020

often occurs ;
traceback most recent call last at line — > data = scrape(url) and return e.extract(r.text)……


Jim September 22, 2020

The results returned from the search results never match with the results trough searches manually. Usually, the search results are multiple pages. But the search_results_1.jsonl file only contains a few records.


      Jim September 23, 2020

      Thanks. When I tried the tool using the url: ‘https://www.amazon.com/s?k=printer’
      it only returns me a few records. But you can see that there are at least 20 pages there.


Pritesh Mistry August 2, 2021

Is it possible to download all the images instead of just one ?and if yes How


Vinay July 5, 2022

How can I scrap ASIN from that or how to select ASIN in Seletors.yml file


Comments are closed.

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?