Tor is quite useful when you have to use requests without revealing your IP address, especially when you are web scraping. This tutorial will use a wrapper in python that helps you with the same.
Are you aware of the fact that Python Requests are not specifically developed for web scraping? Then why use the Request library for web scraping in Python? This is because requests enable you to send HTTP requests and later handle responses very easily. It also provides a high-level interface where HTTP requests can be made.
Through this article, you will learn more about Python web scraping with the Request library. You can build scrapers, collect website data, or even automate repetitive tasks.
Note: There are various Python frameworks and libraries that are used for web scraping aside from Requests.
An Overview of Scraping Web Pages With Python Requests
Let’s learn Python web scraping with the Request library in detail, which includes how to send GET and POST requests, set headers, handle cookies, and manage sessions.
You can also understand how HTTP requests are made, how responses are handled, and finally how the required data can be extracted from the HTML by using Requests. Additionally the article covers various techniques and strategies for parsing HTML data using the LXML library.
Step by Step Installation Process
Before you begin Python Requests web scraping, you must install Python. Next, install the required libraries, in this case, Requests and LXML. To install them use the commands:
pip install requests
pip install lxml
How to Create Your First Python Scraper
This Python web scraping tutorial explains how the extraction of data is made simpler if Requests is used for web scraping. You can create your own web scraper in Python by following certain steps.
The workflow of the scraper:
- Open the website https://scrapeme.live/shop
- Collect all product URLs by navigating through the first few listing pages
- Collect details such as
- Name
- Description
- Price
- Stock
- Image URL
- Product URL
- Now you can save all the data you collected to a CSV file
Importing the Required Libraries
You can begin scraping web pages with Python by importing the required data libraries.
import requests
from lxml import html
import csv
Sending a Request to the Website
The Requests module can be used here to collect data from the websites. Note that it is the Requests library that allows Python to send HTTP requests.
Let’s send a request to https://scrapeme.live/shop
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Sa"
"fari/537.36",
"Accept-Language": "en-US,en;q=0.5"
}
response = requests.get(url, headers=headers)
Before proceeding further, the response that is received from the website must be validated, and this is done using the response status code. Every website’s validation criteria will also be different.
def verify_response(response):
return True if response.status_code == 200 else False
Based on the status code, you determine whether the response is valid or not. If the status code value is 200 then the response is considered valid, or else it is invalid. For the invalid response, you will be able to add retries, which solves the invalid response issue.
max_retry = 3
while max_retry >= 1:
response = requests.get(url, headers=headers)
if verify_response(response):
return response
else:
max_retry -= max_retry
The next step after receiving a valid response is to parse the HTML response.
You have the response from the listing page. Now you can collect the product URLs.
From the screenshot, it is clear that node ‘a’ which has the class name class=”woocommerce-LoopProduct-link woocommerce-loop-product__link” contains the URL to the product page. Since node “a” comes under node ‘li’, its XPATH is written as //li/a[contains(@class, “product__link”)].
The next product’s URL is in the “href” attribute of that node. So using the lxml module, it is possible for you to access the attribute value as shown below:
from lxml import html
parser = html.fromstring(response.text)
product_urls = parser.xpath('//li/a[contains(@class, "product__link")]/@href')
Similarly, the next page URL can be obtained from the next button in HTML.
There are two results produced for the same XPath, and to get the next page URL from the ‘a’ node, you may select the first result. Give the XPath inside a bracket () and index it. Now the XPath //a[@class=”next page-numbers”] becomes (//a[@class=”next page-numbers”])[1]/@href.
from lxml import html
parser = html.fromstring(response.text)
next_page_url = parser.xpath('(//a[@class="next page-numbers"])[1]/@href')[0]
Collect all the product URLs and save them into a list. Now you have to paginate through the listing page, adding the product URLs to the mentioned list. When all paginations are done, send the request to the product URLs.
You might have noticed that a list of string elements is returned by the parser.xpath(). For all product pages, there is a general XPath. Price may be listed for some products, and for some products, price may not be available since they will be out of stock.
If such a case occurs, the parser.xpath returns a null list. An error will be raised once you call the null list with [0] indexing, stopping the remaining code from running. So a function, ‘clean_string’ is created to handle such a situation.
def clean_string(list_or_txt, connector=' '):
if not list_or_txt:
return None
return ' '.join(connector.join(list_or_txt).split())
Let’s now learn about collecting the name, description, price, stock, and image URL of the data points.
Collecting the Name
From the image, it is clear that node h1 contains the name of the product. You can see that the product page does not have any other h1 node. Simply call the XPath //h1 for selecting that particular node.
Use the following code since the text is inside the node:
title = parser.xpath('//h1[contains(@class, "product_title")]/text()')
Title = clean_string(title)
Collecting the Description
Here the product description is inside the node p. You can also see that it is inside the div with the class name substring ‘product-details__short-description’. Collect the text inside it as follows:
description = parser.xpath('//div[contains(@class,"product-details__short-description")]//text()')
description = clean_string(description)
Collecting the Stock
From the image, it is evident that stock is directly present inside the node p, whose class contains the string ‘in-stock’. Use the code to collect data from it:
stock = parser.xpath('//p[contains(@class, "in-stock")]/text()')
stock = clean_string(stock)
if stock:
stock = stock.replace(' in stock', '')
Collecting the Price
Here the price can be directly seen in the node p having class price. So use the code to get the actual price value of the product:
price = parser.xpath('//p[@class="price"]//text()')
price = clean_string(price)
Collecting the Image URL
In the screenshot, the attribute href of the node ‘a’ is highlighted. It is from this href attribute that you will get the image URL.
image_url = parser.xpath('//div[contains(@class, "woocommerce-product-gallery__image")]/a/@href')
Image_url = clean_string(list_or_txt=image_url, connector=' | ')
Complete Code for Python Web Scraping With Request Library
import csv
from lxml import html
import requests
def verify_response(response):
"""
Verify if we received valid response or not
"""
return True if response.status_code == 200 else False
def send_request(url):
"""
Send request and handle retries.
:param url:
:return: Response we received after sending request to the URL.
"""
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Sa"
"fari/537.36",
"Accept-Language": "en-US,en;q=0.5"
}
max_retry = 3
while max_retry >= 1:
response = requests.get(url, headers=headers)
if verify_response(response):
return response
else:
max_retry -= max_retry
print("Invalid response received even after retrying. URL with the issue is:", url)
raise Exception("Stopping the code execution as invalid response received.")
def get_next_page_url(response):
"""
Collect pagination URL.
:param response:
:return: next listing page url
"""
parser = html.fromstring(response.text)
next_page_url = parser.xpath('(//a[@class="next page-numbers"])[1]/@href')[0]
return next_page_url
def get_product_urls(response):
"""
Collects all product URL from a listing page response.
:param response:
:return: list of urls. List of product page urls returned.
"""
parser = html.fromstring(response.text)
product_urls = parser.xpath('//li/a[contains(@class, "product__link")]/@href')
return product_urls
def clean_stock(stock):
"""
Clean the data stock by removing unwanted text present in it.
:param stock:
:return: Stock data. Stock number will be returned by removing extra string.
"""
stock = clean_string(stock)
if stock:
stock = stock.replace(' in stock', '')
return stock
else:
return None
def clean_string(list_or_txt, connector=' '):
"""
Clean and fix list of objects received. We are also removing unwanted white spaces.
:param list_or_txt:
:param connector:
:return: Cleaned string.
"""
if not list_or_txt:
return None
return ' '.join(connector.join(list_or_txt).split())
def get_product_data(url):
"""
Collect all details of a product.
:param url:
:return: All data of a product.
"""
response = send_request(url)
parser = html.fromstring(response.text)
title = parser.xpath('//h1[contains(@class, "product_title")]/text()')
price = parser.xpath('//p[@class="price"]//text()')
stock = parser.xpath('//p[contains(@class, "in-stock")]/text()')
description = parser.xpath('//div[contains(@class,"product-details__short-description")]//text()')
image_url = parser.xpath('//div[contains(@class, "woocommerce-product-gallery__image")]/a/@href')
product_data = {
'Title': clean_string(title), 'Price': clean_string(price), 'Stock': clean_stock(stock),
'Description': clean_string(description), 'Image_URL': clean_string(list_or_txt=image_url, connector=' | '),
'Product_URL': url}
return product_data
def save_data_to_csv(data, filename):
"""
save list of dict to csv.
:param data: Data to be saved to csv
:param filename: Filename of csv
"""
keys = data[0].keys()
with open(filename, 'w', newline='') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=keys)
writer.writeheader()
writer.writerows(data)
def start_scraping():
"""
Starting function.
"""
listing_page_url = 'https://scrapeme.live/shop/'
product_urls = list()
for listing_page_number in range(1, 6):
response = send_request(listing_page_url)
listing_page_url = get_next_page_url(response)
products_from_current_page = get_product_urls(response)
product_urls.extend(products_from_current_page)
results = []
for url in product_urls:
results.append(get_product_data(url))
save_data_to_csv(data=results, filename='scrapeme_live_Python_data.csv')
print('Data saved as csv')
if __name__ == "__main__":
start_scraping()
Sending GET Requests Using Cookies and Headers
Now let’s learn how to send the requests using headers and cookies.
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Sa"
"fari/537.36",
"Accept-Language": "en-US,en;q=0.5"
}
url = "https://scrapeme.live/shop/"
response = requests.get(url, headers=headers, cookies=cookies)
Sending POST Requests
Let’s have a look at making POST requests with the Python Requests library.
payload = {“key1”: “value1”, “key2”: “value2”}
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Sa"
"fari/537.36",
"Accept-Language": "en-US,en;q=0.5"
}
url = "https://scrapeme.live/shop/"
response =requests.post(url, headers=headers, json=payload)
Why Web Scraping With Python?
Python is considered the best programming language for web scraping, as it has many native libraries that are dedicated to web scraping. The Python syntax is also easy to understand and learn, as it is similar to reading a statement in the English language.
Scraping web pages with Python is a common trend due to several reasons:
-
Ease of Use
Python is a simple and readable programming language that is accessible to both beginners and programming experts. Due to its straightforward syntax, developers can quickly understand the concepts of web scraping.
-
Large and Active Community
The vast and active developer community of Python continuously contributes to open-source libraries and frameworks. Because of this, there are plenty of resources, tutorials, and code snippets to learn web scraping. You can solve problems using this collective knowledge.
-
Abundance of Libraries
Python libraries such as BeautifulSoup and LXML are specifically designed for web scraping. These libraries help to parse and navigate HTML and XML documents with their powerful tools.
The libraries also assist you in extracting data from web pages, manipulating HTML structures, and handling various data formats, making web scraping in Python an important topic of discussion.
Also Read: Scrape Reddit using Python and BeautifulSoup -
Requests Library
Requests are Python libraries that enable you to make HTTP requests while also handling responses. It can be identified as a high-level interface that is used for sending HTTP requests like GET and POST, setting headers, handling cookies, and managing sessions.
- Data Manipulation and Analysis
Python libraries like Pandas and NumPy are some of the most powerful and prominent data manipulation and analysis libraries that can be used for processing, cleaning, and analyzing data efficiently.
You can rely on these libraries to filter, sort, aggregate, and visualize the data for data-driven decision-making - Integration With Other Tools and Technologies
When web scraping with Python, it can seamlessly integrate with other web scraping tools and technologies. It can also be combined with database systems such as MySQL and MongoDB for storing and managing the scraped data. Moreover, it also goes well with the Django or Flask frameworks for building web applications.
Wrapping Up
This tutorial has given you a detailed explanation of using the Request library for web scraping and how you can employ it to collect all the necessary data. For small-scale web scraping projects, the scraper you created through this article will be enough.
If your needs are more specific, like web scraping Amazon product details, then you can use ScrapeHero Cloud, which is a hassle-free, no-code, and affordable means of scraping popular websites.
But what if you need enterprise-grade web scraping? Then you can consider ScrapeHero web scraping services, which are bespoke, custom, and more advanced. Also, only a data service provider like ScrapeHero can provide you with access to valuable data that is otherwise difficult to obtain.
Frequently Asked Questions
1. Which Python library is used for web scraping?
For web scraping in Python, the most commonly used libraries are BeautifulSoup, Requests, and Selenium.
We can help with your data or automation needs
Turn the Internet into meaningful, structured and usable data