Create a Python scraper using SelectorLib for scraping Amazon product reviews.
Web scraping helps in automating data extraction from websites. In this tutorial, we will build an Amazon scraper for extracting product details and pricing. We will build this simple web scraper using Python and SelectorLib and run it in a console.
Table of Contents
Here is how you can scrape Amazon product details from Amazon product page
- Markup the data fields to be scraped using Selectorlib
- Copy and run the code provided
Check out our web scraping tutorials to learn how to scrape Amazon Reviews easily using Google Chrome and how to build a Amazon Review Scraper using Python.
We have also provided how you can scrape product details from Amazon search result page, how to avoid getting blocked by Amazon and how to scrape Amazon on a large scale below.
Setting up your computer for Amazon Scraping
We will use Python 3 for this Amazon scraper. The code will not run if you are using Python 2.7. To start, you need a computer with Python 3 and PIP installed in it.
Follow this guide to setup your computer and install packages if you are on windows
How To Install Python Packages for Web Scraping in Windows 10
Packages to install for Amazon scraping
- Python Requests, to make requests and download the HTML content of the Amazon product pages
- SelectorLib python package to extract data using the YAML file we created from the webpages we download
Using pip3,
pip3 install requests requests selectorlib
Scrape product details from the Amazon Product Page
The Amazon product page scraper will scrape the following details from product page.
- Product Name
- Price
- Short Description
- Full Product Description
- Image URLs
- Rating
- Number of Reviews
- Variant ASINs
- Sales Rank
- Link to all Reviews Page
Read More – Learn to scrape Ebay product data
Markup the data fields using Selectorlib
We have already marked up the data, so you can just skip this step if you want to get right to the data.
Here is how our template looks like. See the file here
Let’s save this as a file called selectors.yml
in the same directory as our code.
name: css: '#productTitle' type: Text price: css: '#price_inside_buybox' type: Text short_description: css: '#featurebullets_feature_div' type: Text images: css: '.imgTagWrapper img' type: Attribute attribute: data-a-dynamic-image rating: css: span.arp-rating-out-of-text type: Text number_of_reviews: css: 'a.a-link-normal h2' type: Text variants: css: 'form.a-section li' multiple: true type: Text children: name: css: "" type: Attribute attribute: title asin: css: "" type: Attribute attribute: data-defaultasin product_description: css: '#productDescription' type: Text sales_rank: css: 'li#SalesRank' type: Text link_to_all_reviews: css: 'div.card-padding a.a-link-emphasis' type: Link
Here is a preview of the markup
Selectorlib is a combination of tools for developers that makes marking up and extracting data from web pages easy. The Selectorlib Chrome Extension lets you mark data that you need to extract, and creates the CSS Selectors or XPaths needed to extract that data, then previews how the data would look like.
You can learn more about Selectorlib and how to use it to markup data here
If you don't like or want to code, ScrapeHero Cloud is just right for you!
Skip the hassle of installing software, programming and maintaining the code. Download this data using ScrapeHero cloud within seconds.
Get Started for FreeThe Code
Create a folder called amazon-scraper and paste your selectorlib yaml template file as selectors.yml
.
Let’s create a file called amazon.py
and paste the code below into it. All it does is
- Read a list of Amazon Product URLs from a file called
urls.txt
- Scrape the data
- Save the data as a JSON Lines file
from selectorlib import Extractor import requests import json from time import sleep # Create an Extractor by reading from the YAML file e = Extractor.from_yaml_file('selectors.yml') def scrape(url): headers = { 'authority': 'www.amazon.com', 'pragma': 'no-cache', 'cache-control': 'no-cache', 'dnt': '1', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'sec-fetch-site': 'none', 'sec-fetch-mode': 'navigate', 'sec-fetch-dest': 'document', 'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8', } # Download the page using requests print("Downloading %s"%url) r = requests.get(url, headers=headers) # Simple check to check if page was blocked (Usually 503) if r.status_code > 500: if "To discuss automated access to Amazon data please contact" in r.text: print("Page %s was blocked by Amazon. Please try using better proxies\n"%url) else: print("Page %s must have been blocked by Amazon as the status code was %d"%(url,r.status_code)) return None # Pass the HTML of the page and create return e.extract(r.text) # product_data = [] with open("urls.txt",'r') as urllist, open('output.jsonl','w') as outfile: for url in urllist.readlines(): data = scrape(url) if data: json.dump(data,outfile) outfile.write("\n") # sleep(5)
Running the Amazon Product Page Scraper
You can get the full code from Github – https://github.com/scrapehero-code/amazon-scraper
You can start your scraper by typing the command:
python3 amazon.py
Once the scrape is complete you should see a file called output.jsonl
with your data. Here is an example for the URL
https://www.amazon.com/HP-Computer-Quard-Core-Bluetooth-Accessories/dp/B085383P7M/
{ "name": "2020 HP 15.6\" Laptop Computer, 10th Gen Intel Quard-Core i7 1065G7 up to 3.9GHz, 16GB DDR4 RAM, 512GB PCIe SSD, 802.11ac WiFi, Bluetooth 4.2, Silver, Windows 10, YZAKKA USB External DVD + Accessories", "price": "$959.00", "short_description": "Powered by latest 10th Gen Intel Core i7-1065G7 Processor @ 1.30GHz (4 Cores, 8M Cache, up to 3.90 GHz); Ultra-low-voltage platform. Quad-core, eight-way processing provides maximum high-efficiency power to go.\n15.6\" diagonal HD SVA BrightView micro-edge WLED-backlit, 220 nits, 45% NTSC (1366 x 768) Display; Intel Iris Plus Graphics\n16GB 2666MHz DDR4 Memory for full-power multitasking; 512GB Solid State Drive (PCI-e), Save files fast and store more data. With massive amounts of storage and advanced communication power, PCI-e SSDs are great for major gaming applications, multiple servers, daily backups, and more.\nRealtek RTL8821CE 802.11b/g/n/ac (1x1) Wi-Fi and Bluetooth 4.2 Combo; 1 USB 3.1 Gen 1 Type-C (Data Transfer Only, 5 Gb/s signaling rate); 2 USB 3.1 Gen 1 Type-A (Data Transfer Only); 1 AC smart pin; 1 HDMI 1.4b; 1 headphone/microphone combo\nWindows 10 Home, 64-bit, English; Natural silver; YZAKKA USB External DVD drive + USB extension cord 6ft, HDMI cable 6ft and Mouse Pad\n› See more product details", "images": "{\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX425_.jpg\":[425,425],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX466_.jpg\":[466,466],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SY355_.jpg\":[355,355],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX569_.jpg\":[569,569],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SY450_.jpg\":[450,450],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX679_.jpg\":[679,679],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX522_.jpg\":[522,522]}", "variants": [ { "name": "Click to select 4GB DDR4 RAM, 128GB PCIe SSD", "asin": "B01MCZ4LH1" }, { "name": "Click to select 8GB DDR4 RAM, 256GB PCIe SSD", "asin": "B08537NR9D" }, { "name": "Click to select 12GB DDR4 RAM, 512GB PCIe SSD", "asin": "B08537ZDYH" }, { "name": "Click to select 16GB DDR4 RAM, 512GB PCIe SSD", "asin": "B085383P7M" }, { "name": "Click to select 20GB DDR4 RAM, 1TB PCIe SSD", "asin": "B08537NDVZ" } ], "product_description": "Capacity:16GB DDR4 RAM, 512GB PCIe SSD\n\nProcessor\n\n Intel Core i7-1065G7 (1.3 GHz base frequency, up to 3.9 GHz with Intel Turbo Boost Technology, 8 MB cache, 4 cores)\n\nChipset\n\n Intel Integrated SoC\n\nMemory\n\n 16GB DDR4-2666 SDRAM\n\nVideo graphics\n\n Intel Iris Plus Graphics\n\nHard drive\n\n 512GB PCIe NVMe M.2 SSD\n\nDisplay\n\n 15.6\" diagonal HD SVA BrightView micro-edge WLED-backlit, 220 nits, 45% NTSC (1366 x 768)\n\nWireless connectivity\n\n Realtek RTL8821CE 802.11b/g/n/ac (1x1) Wi-Fi and Bluetooth 4.2 Combo\n\nExpansion slots\n\n 1 multi-format SD media card reader\n\nExternal ports\n\n 1 USB 3.1 Gen 1 Type-C (Data Transfer Only, 5 Gb/s signaling rate); 2 USB 3.1 Gen 1 Type-A (Data Transfer Only); 1 AC smart pin; 1 HDMI 1.4b; 1 headphone/microphone combo\n\nMinimum dimensions (W x D x H)\n\n 9.53 x 14.11 x 0.70 in\n\nWeight\n\n 3.75 lbs\n\nPower supply type\n\n 45 W Smart AC power adapter\n\nBattery type\n\n 3-cell, 41 Wh Li-ion\n\nBattery life mixed usage\n\n Up to 11 hours and 30 minutes\n\n Video Playback Battery life\n\n Up to 10 hours\n\nWebcam\n\n HP TrueVision HD Camera with integrated dual array digital microphone\n\nAudio features\n\n Dual speakers\n\nOperating system\n\n Windows 10 Home 64\n\nAccessories\n\n YZAKKA USB External DVD drive + USB extension cord 6ft, HDMI cable 6ft and Mouse Pad", "link_to_all_reviews": "https://www.amazon.com/HP-Computer-Quard-Core-Bluetooth-Accessories/product-reviews/B085383P7M/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews" }
Read More – Learn to scrape Yelp business data
Scrape Amazon products from the Search Results Page
The Amazon search result page scraper will scrape the following details from search result page.
- Product Name
- Price
- URL
- Rating
- Number of Reviews
The steps and code for scraping search results is very similar to the product page scraper.
Markup the data fields using Selectorlib
Here is our selectorlib yml file. Lets calls it search_results.yml
products: css: 'div[data-component-type="s-search-result"]' xpath: null multiple: true type: Text children: title: css: 'h2 a.a-link-normal.a-text-normal' xpath: null type: Text url: css: 'h2 a.a-link-normal.a-text-normal' xpath: null type: Link rating: css: 'div.a-row.a-size-small span:nth-of-type(1)' xpath: null type: Attribute attribute: aria-label reviews: css: 'div.a-row.a-size-small span:nth-of-type(2)' xpath: null type: Attribute attribute: aria-label price: css: 'span.a-price:nth-of-type(1) span.a-offscreen' xpath: null type: Text
The Code
The code is almost identical to the previous scraper, except that we iterate through each product and save them as a separate line.
Let’s create a file searchresults.py
and paste the code below into it. Here is what the code does
- Open a file called search_results_urls.txt and read search result page URLs
- Scrape the data
- Save to a JSON Lines file called search_results_output.jsonl
from selectorlib import Extractor import requests import json from time import sleep # Create an Extractor by reading from the YAML file e = Extractor.from_yaml_file('search_results.yml') def scrape(url): headers = { 'dnt': '1', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'sec-fetch-site': 'same-origin', 'sec-fetch-mode': 'navigate', 'sec-fetch-user': '?1', 'sec-fetch-dest': 'document', 'referer': 'https://www.amazon.com/', 'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8', } # Download the page using requests print("Downloading %s"%url) r = requests.get(url, headers=headers) # Simple check to check if page was blocked (Usually 503) if r.status_code > 500: if "To discuss automated access to Amazon data please contact" in r.text: print("Page %s was blocked by Amazon. Please try using better proxies\n"%url) else: print("Page %s must have been blocked by Amazon as the status code was %d"%(url,r.status_code)) return None # Pass the HTML of the page and create return e.extract(r.text) # product_data = [] with open("search_results_urls.txt",'r') as urllist, open('search_results_output.jsonl','w') as outfile: for url in urllist.read().splitlines(): data = scrape(url) if data: for product in data['products']: product['search_url'] = url print("Saving Product: %s"%product['title']) json.dump(product,outfile) outfile.write("\n") # sleep(5)
Running the Amazon Scraper to Scrape Search Result
You can start your scraper by typing the command:
python3 searchresults.py
Once the scrape is complete you should see a file called search_results_output.jsonl
with your data.
Here is an example for the URL
https://www.amazon.com/s?k=laptops
https://github.com/scrapehero-code/amazon-scraper/blob/master/search_results_output.jsonl
What to do if you get blocked while scraping Amazon
We are adding this extra section to talk about some methods you could use to not get blocked while scraping Amazon. Amazon is very likely to flag you as a “BOT” if you start scraping hundreds of pages using the code above. The idea is to avoid getting flagged as BOT while scraping and running into problems. How do we solve such challenges?
Mimic human behavior as much as possible.
While we cannot guarantee that you will not be blocked. Here are some tips and tricks on how to avoid getting blocked by amazon
Use proxies and rotate them
Let us say we are scraping hundreds of products on Amazon.com from a laptop, which usually has just one IP address. Amazon would know that we are a bot in no time, as NO HUMAN would ever visit hundreds of product pages in a minute. To look more like a human – make requests to Amazon.com through a pool of IP Addresses or proxies. The rule of thumb here is to have 1 proxy or IP address make not more than 5 requests to Amazon in a minute. If you are scraping about 100 pages per minute, we need about 100/5 = 20 Proxies. You can read more about rotating proxies here
Specify the User Agents of latest browsers and rotate them
If you look at the code above, you will a line where we had set User-Agent String for the request we are making.
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'
Just like proxies, it always good to have a pool of User Agent Strings. Just make sure you’re using user-agent strings of the latest and popular browsers and rotate the strings for each request you make to Amazon. You can learn more about rotating user agent string in python here. It is also a good idea to create a combination of (User-Agent, IP Address) so that it looks more human than a bot.
Reduce the number of ASINs scraped per minute
You can try slowing down the scrape a bit, to give Amazon fewer chance of flagging you as a bot. But about 5 requests per IP per minute isn’t much throttling. If you need to go faster, add more proxies. You can modify the speed by increasing or decreasing the delay in the sleep function
Retry, Retry, Retry
When you are blocked by Amazon, make sure you retry that request. If you look at the code block above we have added 20 retries. Our code retries immediately after the scrape fails, you could do an even better job here by creating a retry queue using a list, and retry them after all the other products are scraped from Amazon.
If you don't like or want to code, ScrapeHero Cloud is just right for you!
Skip the hassle of installing software, programming and maintaining the code. Download this data using ScrapeHero cloud within seconds.
Get Started for FreeHow to Solve Amazon Scraping Challenges
This Amazon scraper should work for small-scale scraping and hobby projects. It can get you started on your road to building bigger and better scrapers. However, if you do want to scrape Amazon for thousands of pages at short intervals here are some important things to keep in mind:
Use a Web Scraping Framework like PySpider or Scrapy
When you’re crawling a massive site like Amazon.com, you need to spend some time to figure out how to run your entire crawl smoothly. Choose an open-source framework for building your scraper, like Scrapy or PySpider which are both based in Python. These frameworks have pretty active communities and can take care of handling a lot of the errors that happen while scraping without disturbing the entire scraper. Most of them also let you use multiple threads to speed up scraping – if you are using a single computer. You can deploy Scrapy to your own servers using ScrapyD.
If you need speed, Distribute and Scale-Up using a Cloud Provider
There is a limit to the number of pages you can scrape from Amazon when using a single computer. If you’re scraping Amazon on a large scale, you need a lot of servers to get data within a reasonable time. You could consider hosting your scraper in the cloud and use a scalable Version of the Framework – like Scrapy Redis. For broader crawls, use message brokers like Redis, Rabbit MQ, Kafka, to run multiple spider instances to speed up crawls.
Use a scheduler if you need to run the scraper periodically
If you are using a scraper to get updated prices of products, you need to refresh your data frequently to keep track of the changes. Use CRON (in UNIX) or Task Scheduler in Windows to schedule the crawler, if you are using the script in this tutorial. If you are using Scrapy, scrapyd+cron can help schedule your spiders so you can refresh the data on a regular interval.
Use a database to store the Scraped Data from Amazon
If you are scraping a large number of products from Amazon, writing data to a file would soon become inconvenient. Retrieving data becomes tough, and you might even end up getting gibberish inside the file when multiple processes write to a single file. Use a database even if you are scraping from a single computer. MySQL will be just fine for moderate workloads, and you can use simple analytics on the scraped data tools like Tableau, PowerBI or Metabase by connecting them to your database. For larger write loads you can look into some of the NoSQL databases like MongoDB, Cassandra, etc.
Use Request Headers, Proxies, and IP Rotation to prevent getting Captchas from Amazon
Amazon has a lot of anti-scraping measures. If you are throttling Amazon, they will block you in no time and you’ll start seeing captchas instead of product pages. To prevent that, while going through each Amazon product page, it’s better to change headers by replacing your UserAgent value. This makes requests look like they’re coming from a browser and not a script.
To crawl Amazon on a very large scale, use Proxies and IP Rotation to reduce the number of captchas you get. You can learn more techniques to prevent getting blocked by Amazon and other sites here – How to prevent getting blacklisted while scraping. You can also use python to solve some basic captchas using an OCR called Tesseract.
Write some simple data quality tests
Scraped data is always messy. An XPath that works for a page might not work for another variation of the same page on the same site. Amazon has LOTS of product page layouts. If you spend an hour writing basic sanity check for your data – like verifying if the price is a decimal, you’ll know when your scraper breaks and you’ll also be able to minimize its impact. Incorporating data quality checks to your code are helpful especially if you are scraping Amazon data for price monitoring, seller monitoring, stock monitoring etc.
We hope this tutorial gave you a better idea on how to scrape Amazon or similar e-commerce websites. As a company, we understand e-commerce data having worked with it before. If you are interested in professional help with scraping complex websites, let us know, and we will be glad to help.
How to use Amazon Product Data
- Monitor Amazon products for change in Price, Stock Count/Availability, Rating, etc.
By using a web scraper, you can update your data feeds on a timely basis to monitor any product changes. These data feeds can help you form pricing strategies by looking at your competition – other sellers or brands. - Scrape Amazon Product Details that you can’t get with the Product Advertising API
Amazon provides a Product Advertising API, but like most other “API”s, this API doesn’t provide all the information that Amazon has on a product page. A web scraper can help you extract all the details displayed on the product page. - Analyze how a particular Brand sells on Amazon
If you’re a retailer, you can monitor your competitor’s products and see how well they do in the market and make adjustments to reprice and sell your products. You could also use it to monitor your distribution channel to identify how your products are sold on Amazon by sellers, and if it is causing you any harm. - Find Customer Opinions from Amazon Product Reviews
Reviews offer abundant amounts of information. If you’re targeting an established set of sellers who have been selling reasonable volumes, you can extract the reviews of their products to find what you should avoid and what you could quickly improve on while trying to sell similar products on Amazon.
Getting blocked scraping Amazon data?
Turn the Internet into meaningful, structured and usable data
Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.
Responses
I don’t get the output. No error too. In the json file all the values are ‘null’, for eg:
[
{
“CATEGORY”: null,
“ORIGINAL_PRICE”: null,
“NAME”: null,
“URL”: “http://www.amazon.com/dp/B0046UR4F4”,
“SALE_PRICE”: null,
“AVAILABILITY”: null
},
]
I got my mistake. We need to give our own headers={ }. The Useragent is different for different users. This can be easily get from using the link give below.
http://www.whoishostingthis.com/tools/user-agent/
Glad to hear you got it working !
Thank for this. This solved the issue.
Hi , the user agent trick didn’t work for me, scrapehero is there something changed on the amazon code that i get this results:
[
{
“CATEGORY”: null,
“ORIGINAL_PRICE”: null,
“NAME”: null,
“URL”: “http://www.amazon.com/dp/B0046UR4F4”,
“SALE_PRICE”: null,
“AVAILABILITY”: null
},
{
“CATEGORY”: null,
“ORIGINAL_PRICE”: null,
“NAME”: null,
“URL”: “http://www.amazon.com/dp/B00JGTVU5A”,
“SALE_PRICE”: null,
“AVAILABILITY”: null
},
We will have a look at the code and see if it still works and get back with a comment as soon as our paying job allows 😉
thanks bro..
Nice python script. Great work for beginner.
Are there any cheap web hosting solutions what have Python installed? Hoping I could set up my required Amazon products, update prices daily then point a website/app to the .json file on my new shared hosting.
Maybe even AWS, Azure etc or a Cloud IDE. Just looking for a simple solution to start off with.
Most VPSs or shared hosting plans support Python. Just ask them before buying.
Nice implementation! Very well done! Just a question…What is the purpose of the sleep() functions? How comes Amazon does not return a typical robot/spider message to use their api?
Hi,
sleep just pauses the execution for a bit so that we dont hammer the server.
Can you clarify the second part of the question – not sure what that means.
Thanks
In the beginning I did not use headers in the requests.get() so in the HTML (html.fromstring()) content there was the following message “To discuss automated access to Amazon data please contact mail. For information about migrating to our APIs refer to our Marketplace APIs at link, or our Product Advertising API at link for advertising use cases.” from Amazon.
You should mimic the browser as much as possible including headers, cookies and sessions – that with IP rotation will work for small scale data gathering
Any wayto extract the reviews based in ASIN number for particular product
Sure but it would need modification to this code.
The tutorial provides the basis for it but you will need to identify the xpaths for the review and grab the content that way.
Hi Dheeraj, we have put together another tutorial on review extraction – Tutorial: How to scrape amazon product reviews
Does this code works for extracting 1500 products?… Adding IP rotation off course. Please let me know.
Hi Saul,
The code should work but at those numbers (1500 products) the code is not the problem.
Everything else related to web scraping that we have written about on our site starts to matter.
Please try the code by modifying it and let us know.
Thanks
I was trying to read a csv file as:
AsinList = csv.DictReader(open(os.path.join(os.path.dirname(__file__),”asinnumbers.csv”)))
But I am getting the error below:
Traceback (most recent call last):
File “amazon_scraper.py”, line 66, in
ReadAsin()
File “amazon_scraper.py”, line 57, in ReadAsin
url = “http://www.amazon.com/dp/”+i
TypeError: cannot concatenate ‘str’ and ‘dict’ objects
Any recommendations? I already google about, but could not find anything.
Hi Saul,
You are trying to concatenate a dictionary object with “http://www.amazon.com/dp/”.
Can you try replacing
url = “http://www.amazon.com/dp/”+i
with
url = “http://www.amazon.com/dp/”+i[‘asin’].
This is assuming that your CSV looks like this
asin,
B00JGTVU5A
B00GJYCIVK,
B00EPGK7CQ,
B00EPGKA4G,
B00YW5DLB4,
B00KGD0628,
B00O9A48N2,
B00O9A4MEW,
B00UZKG8QU
Thanks a lot for this amazing tutorial, but, after using the script for few days, now is not working well, I am getting much as bellow:
“CATEGORY”: null,
“ORIGINAL_PRICE”: null,
“NAME”: null,
“URL”: “http://www.amazon.com/dp/B00FF01SSS”,
“SALE_PRICE”: null,
“AVAILABILITY”: null
And as I told you, everything was working amazing well, even I add the code below to switch headers every time…
navegador = randint(0,2)
if navegador==0:
headers = {‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36’}
print ‘Using Chrome’
elif navegador==1:
headers = {‘User-Agent’: ‘Mozilla/5.0 (Windows NT 6.3; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0’}
print ‘Using Firefox’
else:
headers = {‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10240’}
print ‘Using Edge’
And, everything was perfect, til today, any ideas why?
Thanks!
Does anyone know of a commercial version of this process? I am looking to scrape Amazon data for an inventory system. We have the ASINs on incoming excel sheets, but need to pull product data and images to populate the inventory. We’d be happy to pay for a pre-existing version of this process rather than build it ourselves or hire a developer.
Great! I am too looking for this please mail me on arlmathsir@gmail.com if you found any solution.
The main issue I see with this is that it only gets the offer from the Buy Box, but not every offer available from Amazon. I’m trying to do this now to see if I can get it to work; just not overly familiar with python. But I know the URLs stay pretty much the same: http://www.amazon.com/gp/offer-listing/{ASIN}/ref=olp_f_freeShipping?ie=UTF8&f_freeShipping=true&f_new=true&f_primeEligible=true
Hi James,
You are correct, the tutorial only scrapes the buy box price.
You will need to modify the code to get the 3rd party sellers.
Thanks
Hello ScrapeHero,
what of I want to get other product details, how can I change the code, I assume it’s the following parts
“XPATH_NAME = ‘//h1[@id=”title”]//text()’
XPATH_SALE_PRICE = ‘//span[contains(@id,”ourprice”) or contains(@id,”saleprice”)]/text()’
XPATH_ORIGINAL_PRICE = ‘//td[contains(text(),”List Price”) or contains(text(),”M.R.P”) or contains(text(),”Price”)]/following-sibling::td/text()’
XPATH_CATEGORY = ‘//a[@class=”a-link-normal a-color-tertiary”]//text()’
XPATH_AVAILABILITY = ‘//div[@id=”availability”]//text()'”
thanks
Hi Dan,
Yes – you will need to add or update XPATHS to get additional data.
I Try to get Image url using this xpath :
XPATH_IMG = ‘//div[@class=”imgTagWrapper”]/img/@src//text()’
but the result is Null, can you give me the point to achieved this
Yes i am also getting same error Did you Find the solution if yes please help
Hejsan from Sweden,
I am a total “dummie” regarding python. I tried to use this code with Python 3 instead. There you have pip and requests included as I understand. Anyway, I do not get a data.json file respectively the provided code is not running and if i check it through python they mention missing parentheses. I just wonder if the code should work for python 3 as well and if not, why? Is it a different language?
best regards,
Chris
Hi Chris,
Yes it is almost a new language – v2 code will not work in 3 for most cases especially with libraries used.
Try downloading and running in V2.
Thanks
Hi Chris,
I am running the following version of python:
Python 3.5.2 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:53:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
I changed the code only a little to fit python 3. Pasted the code below. Let me know if you need any help.
from lxml import html
import csv,os,json
import requests
#from exceptions import ValueError
from time import sleep
def AmzonParser(url):
headers = {‘User-Agent’: ‘Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36’}
page = requests.get(url,headers=headers)
while True:
sleep(3)
try:
doc = html.fromstring(page.content)
XPATH_NAME = ‘//h1[@id=”title”]//text()’
XPATH_SALE_PRICE = ‘//span[contains(@id,”ourprice”) or contains(@id,”saleprice”)]/text()’
XPATH_ORIGINAL_PRICE = ‘//td[contains(text(),”List Price”) or contains(text(),”M.R.P”) or contains(text(),”Price”)]/following-sibling::td/text()’
XPATH_CATEGORY = ‘//a[@class=”a-link-normal a-color-tertiary”]//text()’
XPATH_AVAILABILITY = ‘//div[@id=”availability”]//text()’
RAW_NAME = doc.xpath(XPATH_NAME)
RAW_SALE_PRICE = doc.xpath(XPATH_SALE_PRICE)
RAW_CATEGORY = doc.xpath(XPATH_CATEGORY)
RAW_ORIGINAL_PRICE = doc.xpath(XPATH_ORIGINAL_PRICE)
RAw_AVAILABILITY = doc.xpath(XPATH_AVAILABILITY)
NAME = ‘ ‘.join(”.join(RAW_NAME).split()) if RAW_NAME else None
SALE_PRICE = ‘ ‘.join(”.join(RAW_SALE_PRICE).split()).strip() if RAW_SALE_PRICE else None
CATEGORY = ‘ > ‘.join([i.strip() for i in RAW_CATEGORY]) if RAW_CATEGORY else None
ORIGINAL_PRICE = ”.join(RAW_ORIGINAL_PRICE).strip() if RAW_ORIGINAL_PRICE else None
AVAILABILITY = ”.join(RAw_AVAILABILITY).strip() if RAw_AVAILABILITY else None
if not ORIGINAL_PRICE:
ORIGINAL_PRICE = SALE_PRICE
if page.status_code!=200:
raise ValueError(‘captha’)
data = {
‘NAME’:NAME,
‘SALE_PRICE’:SALE_PRICE,
‘CATEGORY’:CATEGORY,
‘ORIGINAL_PRICE’:ORIGINAL_PRICE,
‘AVAILABILITY’:AVAILABILITY,
‘URL’:url,
}
return data
except Exception as e:
print(e)
def ReadAsin():
# AsinList = csv.DictReader(open(os.path.join(os.path.dirname(__file__),”Asinfeed.csv”)))
AsinList = [‘B0046UR4F4’,
‘B00JGTVU5A’,
‘B00GJYCIVK’,
‘B00EPGK7CQ’,
‘B00EPGKA4G’,
‘B00YW5DLB4’,
‘B00KGD0628’,
‘B00O9A48N2’,
‘B00O9A4MEW’,
‘B00UZKG8QU’,]
extracted_data = []
for i in AsinList:
url = “http://www.amazon.com/dp/”+i
print(“Processing: “+url)
extracted_data.append(AmzonParser(url))
sleep(5)
f=open(‘data.json’,’w’)
json.dump(extracted_data,f,indent=4)
if __name__ == “__main__”:
ReadAsin()
Hello there!
What is my item price change according to its color?
Great script, love it 🙂
Thanks a lot for this very useful script. I m going to the next step : Scalable do-it-yourself scraping – How to build and run scrapers on a large scale
Hi In the bulk extraction for product details ! is it limited to 10 , Would it be possible to extract more than 10 Product details
Yes its possible for more than 10 ID’s.
AVAILABILITY does not work in .cn website.
Hi there!
What if I have a list of Urls in this form (ASIN + Merchant ID) and only want to scrape the actual quantity?
https://www.amazon.co.uk/dp/B00NI02DB8?m=A2XF3BWCLY1PQM
Quantity: 30
Quantity:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Quantity:1
Thanks!
I keep on getting this error: SSLError: HTTPSConnectionPool(host=’www.amazon.com’, port=443): Max retries exceeded with url: /dp/B00YG0JV96 (Caused by SSLError(SSLError(“bad handshake: Error([(‘SSL routines’, ‘tls_process_server_certificate’, ‘certificate verify failed’)],)”,),))
What am I missing?
Hi Harrison,
It is most likely an old version of python.
use verify=False. like this requests.get(url, headers=headers, verify=False)
Getting Captha i.e. the error, do let me know how to fix it.
Processing: http://www.amazon.com/dp/B00J0K55L0
captha
captha
captha
captha
captha
captha
captha
ERROR: execution aborted
I am looking to modify this script to also scrape Walmart, Gamestop, Target, etc what resources can you point me to to modify this script to include those?
Ge This Error
File “data1.py”, line 48
print e
^
SyntaxError: Missing parentheses in call to ‘print’. Did you mean print(print e)?
You are trying to run a python 2 script using python 3. Try running this script using Python 2 and it should work.
Thanks a lot for your tutorial, the are awesome, I am a total beginner and with some courses with python I success to Scrape Amazon.com with your tutorials 🙂
Thanks a lot for the tutorial. Is there any way to save the same output to a .CSV file instead of JSON?
Is someone who has success prime scrape in amazon? please!! advise me, if there are any successful people who scrape whether you are prime of additional items classified by size & color. my head will be crushed.
Amazon URL responding slow, who can I make it fast?
Is there a way I can import my ASINs from an Excel file and export the prices and findings to another Excel file?
Sure anything is possible through programming however not in the scope of this article.
Libraries that manipulate excel or excel macros can help you do that easily.
Hey, I think your code might need some modifications because it returns no value currently~
how to scrape the rating and # of reviews for a product
https://scrapehero-amazon-product-info-v1.p.mashape.com/product-details?asin=B01HSIIFQ2 this URL is not working.
And can you please give me similar in PHP…
Thanks
Hi Vinod,
Please go to https://www.scrapehero.com/amazon-api-subscription/ to use the API.
APIs are language agnostic so it will work with PHP too.
I am interested on the API, but I need to get all variations from an ASIN, is that possible?
Sure – please reach out to us using our website contact form.
Thanks
How to scrape the feedback from consumer?
Thanks in advance
just have the same question. And there is still no reply on this. disappointed.
These enhancements are exercises for the reader and our code is for learning purposes only.
Thank you
@ ScrapeHero
Can you please give some idea like how to crawl data from amazon for a specific city ?
I am getting this errors:
Amazon_Scraper.py”, line 72, in
ReadAsin()
Amazon_Scraper.py”, line 67, in ReadAsin
f=open(‘data.json’,’w’)
PermissionError: [Errno 13] Permission denied: ‘data.json’
Looks like the output file cannot be written due to lack of permissions.
Please google for such generic python errors.
Hello.
I want to be able to do the following with python.
Initiate a search for any category of products using following parameters:
No. Reviews
Average review rating
Average monthly sales
Average monthly revenues
Based on the above parameters, I want python to give me products who fall on the above criteria.
Please tell me if it’s possible?
If it’s possible, my next question would be how would we use python to access monthly sales and monthly revenue for a particular product?
Please looking forward to your reply.
Is there any way to scrape the Asin automatically? I mean, I want to scrapy over 1000+ products and I don’t want to make a list with that much Asin numbers.
You can try our cloud for free for that https://cloud.scrapehero.com
So how would one scrape an ecommerce site of their sale/clearance items automatically on a weekly basis and compare to Amazon’s prices?
I just wonder this there any technical way you can track the number of sales of a product from Amazon?
I am getting error while reading data in python??
raise JSONDecodeError(“Extra data”, s, end)
JSONDecodeError: Extra data
often occurs ;
traceback most recent call last at line — > data = scrape(url) and return e.extract(r.text)……
The results returned from the search results never match with the results trough searches manually. Usually, the search results are multiple pages. But the search_results_1.jsonl file only contains a few records.
You use this tool to scrape Amazon search results page for free – https://www.scrapehero.com/marketplace/amazon-product-search/
Thanks. When I tried the tool using the url: ‘https://www.amazon.com/s?k=printer’
it only returns me a few records. But you can see that there are at least 20 pages there.
Is it possible to download all the images instead of just one ?and if yes How
How can I scrap ASIN from that or how to select ASIN in Seletors.yml file
Comments are closed.