This is an open thread and the goal is to solicit comments on what the best web scraping service may look like. Please go ahead a type away and write down the ideas or requirements…
This article outlines a few methods for scraping Amazon product data. This could effectively export data to Excel or other formats for easier access and use.
There are three methods for scraping Amazon product data:
- Scraping Amazon in Python or JavaScript
- Using the ScrapeHero Cloud, Amazon Product Details and Pricing Scraper, a no-code tool
- Using the Amazon Product Details and Pricing API by ScrapeHero Cloud
Building an Amazon Scraper in Python or JavaScript
In this section, we will guide you on how to scrape Amazon product data using either Python or JavaScript. We will utilize the browser automation framework called Playwright to emulate browser behavior in our code.
You could also use Python Requests, BeautifulSoup, or LXML to build an Amazon scraper without using a browser or a browser automation library. But bypassing the anti-scraping mechanisms put in place can be challenging and is beyond the scope of this article.
You could also use Python Requests, BeautifulSoup, or LXML to build an Amazon scraper without using a browser or a browser automation library. But bypassing the anti-scraping mechanisms put in place can be challenging and is beyond the scope of this article.
Here are the steps for scraping Amazon product data using Playwright:
Step 1: Choose either Python or JavaScript as your programming language.
Step 2: Install Playwright for your preferred language.
pip install playwright
# to download the necessary browsers
playwright install
npm install playwright@latest
Step 3: Write your code to emulate browser behavior and extract the desired data from Amazon using the Playwright API. You can use the code:
import asyncio
import json
from playwright.async_api import async_playwright
url = "https://www.amazon.com/Imploding-Kittens-First-Expansion-Exploding/dp/B01HSIIFQ2/]?th=1"
async def extract_data(page) -> list:
"""
Parsing details from the product page
Args:
page: webpage of the browser
Returns:
list: details of product on amazon
"""
# Initializing selectors and xpaths
title_xpath = "h1[id='title']"
asin_selector = "//td/div[@id='averageCustomerReviews']"
rating_xpath = "//div[@id='prodDetails']//i[contains(@class,'review-stars')]/span"
ratings_count_xpath = "//div[@id='prodDetails']//span[@id='acrCustomerReviewText']"
selling_price_xpath = "//input[@id='priceValue']"
listing_price_xpath = "//div[@id='apex_desktop_qualifiedBuybox']//span[@class='a-price a-text-price']/span[@class='a-offscreen']"
img_link_xpath = "//div[contains(@class,'imgTagWrapper')]//img"
brand_xpath = (
"//tr[contains(@class,'po-brand')]//span[@class='a-size-base po-break-word']"
)
status_xpath = "//div[@id='availabilityInsideBuyBox_feature_div']//div[@id='availability']/span"
description_ul_xpath = (
"//ul[@class='a-unordered-list a-vertical a-spacing-mini']/li"
)
product_description_xpath = "//div[@id='productDescription']//span"
# Waiting for the page to finish loading
await page.wait_for_selector(title_xpath)
# Extracting the elements
product_title = (
await page.locator(title_xpath).inner_text()
if await page.locator(title_xpath).count()
else None
)
asin = (
await page.locator(asin_selector).get_attribute("data-asin")
if await page.locator(asin_selector).count()
else None
)
rating = (
await page.locator(rating_xpath).inner_text()
if await page.locator(rating_xpath).count()
else None
)
rating_count = (
await page.locator(ratings_count_xpath).inner_text()
if await page.locator(ratings_count_xpath).count()
else None
)
selling_price = (
await page.locator(selling_price_xpath).get_attribute("value")
if await page.locator(selling_price_xpath).count()
else None
)
listing_price = (
await page.locator(listing_price_xpath).inner_text()
if await page.locator(listing_price_xpath).count()
else None
)
brand = (
await page.locator(brand_xpath).inner_text()
if await page.locator(brand_xpath).count()
else None
)
product_description = (
await page.locator(product_description_xpath).inner_text()
if await page.locator(product_description_xpath).count()
else None
)
image_link = (
await page.locator(img_link_xpath).get_attribute("src")
if await page.locator(img_link_xpath).count()
else None
)
status = (
await page.locator(status_xpath).inner_text()
if await page.locator(status_xpath).count()
else None
)
# full_description is found as list, so iterating the list elements to get the descriptions
full_description_list = []
desc_lists = page.locator(description_ul_xpath)
desc_count = await desc_lists.count()
for index in range(desc_count):
li_element = desc_lists.nth(index=index)
desc = (
await li_element.locator("//span").inner_text()
if await li_element.locator("//span").count()
else None
)
full_description_list.append(desc)
full_description = " | ".join(full_description_list)
# cleaning data
product_title = clean_data(product_title)
asin = clean_data(asin)
rating = clean_data(rating)
rating_count = clean_data(rating_count)
selling_price = clean_data(selling_price)
listing_price = clean_data(listing_price)
brand = clean_data(brand)
image_link = clean_data(image_link)
status = clean_data(status)
product_description = clean_data(product_description)
full_description = clean_data(full_description)
data_to_save = {
"product_title": product_title,
"asin": asin,
"rating": rating,
"rating_count": rating_count,
"selling_price": selling_price,
"listing_price": listing_price,
"brand": brand,
"image_links": image_link,
"status": status,
"product_description": product_description,
"full_description": full_description,
}
save_data(data_to_save, "Data.json")
async def run(playwright) -> None:
# Initializing the browser and creating a new page.
browser = await playwright.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
await page.set_viewport_size({"width": 1920, "height": 1080})
page.set_default_timeout(300000)
# Navigating to the homepage
await page.goto(url, wait_until="domcontentloaded")
await extract_data(page)
await context.close()
await browser.close()
def clean_data(data: str) -> str:
"""
Cleaning data by removing extra white spaces and Unicode characters
Args:
data (str): data to be cleaned
Returns:
str: cleaned string
"""
if not data:
return None
cleaned_data = " ".join(data.split()).strip()
cleaned_data = cleaned_data.encode("ascii", "ignore").decode("ascii")
return cleaned_data
def save_data(product_page_data: dict, filename: str):
"""Converting a list of dictionaries to JSON format
Args:
product_page_data (list): details of each product
filename (str): name of the JSON file
"""
with open(filename, "w") as outfile:
json.dump(product_page_data, outfile, indent=4)
async def main() -> None:
async with async_playwright() as playwright:
await run(playwright)
if __name__ == "__main__":
asyncio.run(main())
const { chromium, firefox } = require('playwright');
const fs = require('fs');
const { title } = require('process');
const url = "https://www.amazon.com/Imploding-Kittens-First-Expansion-Exploding/dp/B01HSIIFQ2/]?th=1";
/**
* Save data as list of dictionaries
as json file
* @param {object} data
*/
function saveData(data) {
let dataStr = JSON.stringify(data, null, 2)
fs.writeFile("data.json", dataStr, 'utf8', function (err) {
if (err) {
console.log("An error occurred while writing JSON Object to File.");
return console.log(err);
}
console.log("JSON file has been saved.");
});
}
function cleanData(data) {
if (!data) {
return;
}
// removing extra spaces and unicode characters
let cleanedData = data.split(/\s+/).join(" ").trim();
cleanedData = cleanedData.replace(/[^\x00-\x7F]/g, "");
return cleanedData;
}
// The data extraction function used to extract
// necessary data from the element.
async function extractData(data, type) {
let count = await data.count();
if (count) {
if (type == 'innerText') {
return await data.innerText()
}else {
return await data.getAttribute(type)
}
}
return null
};
async function parsePage(page) {
// initializing xpaths
let titleXPath = "h1[id='title']";
let asinSelector = "//td/div[@id='averageCustomerReviews']";
let ratingXPath = "//div[@id='prodDetails']//i[contains(@class,'review-stars')]/span";
let ratingsCountXPath = "//div[@id='prodDetails']//span[@id='acrCustomerReviewText']";
let sellingPriceXPath = "//input[@id='priceValue']";
let listingPriceXPath = "//div[@id='apex_desktop_qualifiedBuybox']//span[@class='a-price a-text-price']/span[@class='a-offscreen']";
let imgLinkXPath = "//div[contains(@class,'imgTagWrapper')]//img";
let brandXPath = "//tr[contains(@class,'po-brand')]//span[@class='a-size-base po-break-word']";
let statusXPath = "//div[@id='availabilityInsideBuyBox_feature_div']//div[@id='availability']/span";
let descriptionULXPath = "//ul[@class='a-unordered-list a-vertical a-spacing-mini']/li";
let productDescriptionXPath = "//div[@id='productDescription']//span";
// wait until page loads
await page.waitForSelector(titleXPath);
// extract data using xpath
let productTitle = page.locator(titleXPath);
productTitle = await extractData(productTitle, type ='innerText');
let asin = page.locator(asinSelector);
asin = await extractData(asin, type = 'data-asin');
let rating = page.locator(ratingXPath);
rating = await extractData(rating, type ='innerText');
let ratingCount = page.locator(ratingsCountXPath);
ratingCount = await extractData(ratingCount, type ='innerText');
let sellingPrice = page.locator(sellingPriceXPath);
sellingPrice = await extractData(sellingPrice, type='value');
let listingPrice = page.locator(listingPriceXPath);
listingPrice = await extractData(listingPrice, type ='innerText');
let brand = page.locator(brandXPath);
brand = await extractData(brand, type ='innerText');
let productDescription = page.locator(productDescriptionXPath);
productDescription = await extractData(productDescription, type ='innerText');
let imageLink = page.locator(imgLinkXPath);
imageLink = await extractData(imageLink, type ='src');
let status = page.locator(statusXPath);
status = await extractData(status, type ='innerText');
// since fulldescription is in <li> element , iteration is needed let fullDescriptionList = []; let descLists = page.locator(descriptionULXPath); let descCount = await descLists.count(); for (let index = 0; index < descCount; index++) { let liElement = descLists.nth(index); let desc = liElement.locator('//span'); desc = await extractData(desc, type ='innerText'); fullDescriptionList.push(desc); } let fullDescription = fullDescriptionList.join(" | ") || null;
// cleaning data
productTitle = cleanData(productTitle)
asin = cleanData(asin)
rating = cleanData(rating)
ratingCount = cleanData(ratingCount)
sellingPrice = cleanData(sellingPrice)
listingPrice = cleanData(listingPrice)
brand = cleanData(brand)
imageLink = cleanData(imageLink)
status = cleanData(status)
productDescription = cleanData(productDescription)
fullDescription = cleanData(fullDescription)
let dataToSave = {
productTitle: productTitle,
asin: asin,
rating: rating,
ratingCount: ratingCount,
sellingPrice: sellingPrice,
listingPrice: listingPrice,
brand: brand,
imageLinks: imageLink,
status: status,
productDescription: productDescription,
fullDescription: fullDescription,
};
saveData(dataToSave);
}
/**
* The main function initiates a browser object and handles the navigation.
*/
async function run() {
// initializing browser and creating new page
const browser = await chromium.launch({ headless: false });
const context = await browser.newContext();
const page = await context.newPage();
await page.setViewportSize({ width: 1920, height: 1080 });
page.setDefaultTimeout(30000);
// Navigating to the home page
await page.goto(url, { waitUntil: 'domcontentloaded' });
await parsePage(page);
await context.close();
await browser.close();
};
run();
This code shows how to scrape Amazon using the Playwright library in Python and JavaScript.
The corresponding scripts have two main functions, namely:
- run function: This function takes a Playwright instance as an input and performs the scraping process. The function launches a Chromium browser instance, navigates to an Amazon page, fills in a search query, clicks the search button, and waits for the results to be displayed on the page.
The extract_details function is then called to extract the product details and store the data in a JSON file. - extract_data function: This function takes a Playwright page object as input and returns a list of dictionaries containing product details. The details include name, brand, seller, rating, sale price, etc.
Finally, the main function uses the async_playwright context manager to execute the run function. A JSON file containing the listings of the Amazon product data script you just executed would be created.
Step 4: Run your code for Scraping Amazon product data.
Using No-Code Amazon Product Details and Pricing Scraper by ScrapeHero Cloud
The Amazon Product Details and Pricing Scraper by ScrapeHero Cloud is a convenient method for scraping product details from Amazon. It provides an easy, no-code method for scraping data, making it accessible for individuals with limited technical skills.
This section will guide you through the steps to set up and use the Amazon Product Details and Pricing scraper.
1. Sign up or log in to your ScrapeHero Cloud account.
If you don’t like or want to code, ScrapeHero Cloud is just right for you!
Skip the hassle of installing software, programming and maintaining the code. Download this data using ScrapeHero cloud within seconds.
Get Started for Free2. Go to Amazon Product Details and Pricing Scraper by ScrapeHero Cloud in the marketplace.
3. Add the scraper for scraping Amazon product data to your account. (Don’t forget to verify your email if you haven’t already.)
4. You need to add the product URL or ASIN to start the scraper. If it’s just a single query, enter it in the field provided.
- You can get the product URL from the Amazon search results page.
- You can get the product’s ASIN from the product information section of a product listing page.
5. To scrape results for multiple queries, add multiple product URLs or ASINs to the SearchQuery field and save the settings.
6. To start the scraper, click on the Gather Data button.
7. The scraper will start fetching data for your queries, and you can track its progress under the Jobs tab.
8. Once finished, you can view or download the data from it.
9. You can also extract data from Amazon to Excel from here. Just click on the Download Data button and select “Excel” and open the downloaded file using Microsoft Excel.
Using Amazon Product Details and Pricing API by ScrapeHero Cloud
The ScrapeHero Cloud Amazon Product Details and Pricing API is an alternate tool for extracting product details from Amazon. This user-friendly API enables those with minimal technical expertise to obtain product data effortlessly from Amazon.
This section will walk you through the steps to configure and utilize the Amazon Product Details and Pricing API provided by ScrapeHero Cloud.
- Sign up or log in to your ScrapeHero Cloud account.
- Go to the Amazon Product Details and Pricing API by ScrapeHero Cloud in the marketplace.
- Click on the subscribe button.
Note: As this is a paid API, you must subscribe to one of the available plans to use the API. - After subscribing to a plan, head over to the Documentation tab to get the necessary steps to integrate the API into your application.
Uses Cases of Amazon Product Data
If you’re unsure as to why you should scrape Amazon product data, here are a few use cases where this data would be helpful:
Market Analysis and Competitive Intelligence
By scraping Amazon product data, businesses can analyze market trends, understand consumer preferences, and monitor competitor activities.Price Optimization
By scraping Amazon prices using an Amazon price scraper, retailers and sellers can use the data obtained to optimize their pricing strategies by analyzing the pricing patterns of similar products.
Product Development and Innovation
Manufacturers and brands can scrape Amazon product data to identify gaps in the market, understand consumer pain points, and gather ideas for product improvements or new product features.Reputation and Brand Management
The data obtained by the Amazon data scraper can be used to monitor product reviews and ratings on Amazon. It also helps businesses manage their online reputation and respond to customer feedback effectively.
Also Read: Scrape Amazon Reviews using Google ChromeInventory and Supply Chain Management
With Amazon scraping, businesses can better forecast demand, optimize stock levels, and reduce inventory holding costs by analyzing sales velocity, seasonal trends, and consumer demand patterns on Amazon.
Frequently Asked Questions
1. Can you scrape data from Amazon?
Yes. You can scrape Amazon product data by using a Python or JavaScript scraper. If you do not want to code, then use ScrapeHero Amazon Product Details and Pricing Scraper.
You can also choose the Amazon Product Details and Pricing API by ScrapeHero Cloud to integrate with any application to stream product data.
2. How to scrape Amazon product information using BeautifulSoup?
To scrape Amazon product information using BeautifulSoup, send GET requests to the product’s page using the Requests library, then parse the HTML response using BeautifulSoup to extract essential information like name, price, and description.
3. How can you scrape Amazon using Selenium(Python)?
For web scraping Amazon using Selenium(Python), you have to set up Selenium WebDriver for automating a web browser and navigating to the Amazon product page. Later, use locators like By.XPATH to find and interact with search elements.
4. Does Amazon allow review scraping?
Amazon does not directly support or encourage web scraping. But it is not illegal to scrape publicly available data.
You can scrape Amazon product reviews using Python or JavaScript. ScrapeHero provides an Amazon Product Reviews and Ratings Scraper, which is a no-code tool for this purpose. You can also use the Amazon Reviews API by ScrapeHero Cloud for integrating with applications.
5. What is the subscription fee for the Amazon Product Details and Pricing Scraper by ScrapeHero?
To learn about the pricing, visit the ScrapeHero pricing page.
6. Is it legal to scrape from Amazon?
The legality of web scraping depends on the jurisdiction, but it is generally considered legal if you are scraping publicly available data. Please refer to Legal Information to learn more about the legality of web scraping.
Continue Reading ..
- The best web scraping service
- Web Scraping Public Data for the Healthcare Sector
Learn how web scraping public data for the healthcare sector benefits different industry players to improve healthcare-related services.
- Web Scraping With ChatGPT Advanced Data Analysis (Code Interpreter)
Learn how the feature Advanced Data Analysis or Code Interpreter in ChatGPT is used for web scraping.
Posted in: Featured, How to, ScrapeHero Cloud
Published On: March 26, 2024