Ultimate Guide to Web Scraping With Python Using Request Library

Share:

web scraping with python requests

Are you aware of the fact that Python Requests are not specifically developed for web scraping? Then why use the Request library for web scraping in Python? This is because requests enable you to send HTTP requests and later handle responses very easily. It also provides a high-level interface where HTTP requests can be made.

Through this article, you will learn more about Python web scraping with the Request library. You can build scrapers, collect website data, or even automate repetitive tasks.

Note: There are various Python frameworks and libraries that are used for web scraping aside from Requests.

An Overview of Scraping Web Pages With Python Requests

Process of web scraping with Python Request.

Let’s learn Python web scraping with the Request library in detail, which includes how to send GET and POST requests, set headers, handle cookies, and manage sessions.

You can also understand how HTTP requests are made, how responses are handled, and finally how the required data can be extracted from the HTML by using Requests. Additionally the article covers various techniques and strategies for parsing HTML data using the LXML library.

Step by Step Installation Process

Before you begin Python Requests web scraping, you must install Python. Next, install the required libraries, in this case, Requests and LXML. To install them use the commands:

pip install requests
pip install lxml

How to Create Your First Python Scraper

This Python web scraping tutorial explains how the extraction of data is made simpler if Requests is used for web scraping. You can create your own web scraper in Python by following certain steps.

The workflow of the scraper:

  1. Open the website https://scrapeme.live/shop
  2. Collect all product URLs by navigating through the first few listing pages
  3. Collect details such as
    • Name
    • Description
    • Price
    • Stock
    • Image URL
    • Product URL
  4. Now you can save all the data you collected to a CSV file

Importing the Required Libraries

You can begin scraping web pages with Python by importing the required data libraries.

import requests
from lxml import html
import csv

Sending a Request to the Website

The Requests module can be used here to collect data from the websites. Note that it is the Requests library that allows Python to send HTTP requests.

Let’s send a request to https://scrapeme.live/shop

headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Sa"
                      "fari/537.36",
        "Accept-Language": "en-US,en;q=0.5"
    }
response = requests.get(url, headers=headers)

Before proceeding further, the response that is received from the website must be validated, and this is done using the response status code. Every website’s validation criteria will also be different.

def verify_response(response):
    return True if response.status_code == 200 else False

Based on the status code, you determine whether the response is valid or not. If the status code value is 200 then the response is considered valid, or else it is invalid. For the invalid response, you will be able to add retries, which solves the invalid response issue.

max_retry = 3
while max_retry >= 1:
    response = requests.get(url, headers=headers)
    if verify_response(response):
        return response
    else:
        max_retry -= max_retry

The next step after receiving a valid response is to parse the HTML response.

You have the response from the listing page. Now you can collect the product URLs.

finding the product url when scraping web pages with python requests

From the screenshot, it is clear that node ‘a’ which has the class name class=”woocommerce-LoopProduct-link woocommerce-loop-product__link” contains the URL to the product page. Since node “a” comes under node ‘li’, its XPATH is written as //li/a[contains(@class, “product__link”)].

The next product’s URL is in the “href” attribute of that node. So using the lxml module, it is possible for you to access the attribute value as shown below:

from lxml import html
parser = html.fromstring(response.text)
product_urls = parser.xpath('//li/a[contains(@class, "product__link")]/@href')

Similarly, the next page URL can be obtained from the next button in HTML.

Obtaining the next page URL during web scraping in python

There are two results produced for the same XPath, and to get the next page URL from the ‘a’ node, you may select the first result. Give the XPath inside a bracket () and index it. Now the XPath //a[@class=”next page-numbers”] becomes (//a[@class=”next page-numbers”])[1]/@href.

from lxml import html
parser = html.fromstring(response.text)
next_page_url = parser.xpath('(//a[@class="next page-numbers"])[1]/@href')[0]

Collect all the product URLs and save them into a list. Now you have to paginate through the listing page, adding the product URLs to the mentioned list. When all paginations are done, send the request to the product URLs.

You might have noticed that a list of string elements is returned by the parser.xpath(). For all product pages, there is a general XPath. Price may be listed for some products, and for some products, price may not be available since they will be out of stock.

If such a case occurs, the parser.xpath returns a null list. An error will be raised once you call the null list with [0] indexing, stopping the remaining code from running. So a function, ‘clean_string’ is created to handle such a situation.

def clean_string(list_or_txt, connector=' '):
    if not list_or_txt:
        return None
    return ' '.join(connector.join(list_or_txt).split())

Let’s now learn about collecting the name, description, price, stock, and image URL of the data points.

Collecting the Name

Collecting the name of the product during web scraping with Python.

From the image, it is clear that node h1 contains the name of the product. You can see that the product page does not have any other h1 node. Simply call the XPath //h1 for selecting that particular node.

Use the following code since the text is inside the node:

title = parser.xpath('//h1[contains(@class, "product_title")]/text()')
Title = clean_string(title)

Collecting the Description

Collecting the description of the product when web scraping with python

Here the product description is inside the node p. You can also see that it is inside the div with the class name substring ‘product-details__short-description’. Collect the text inside it as follows:

description = parser.xpath('//div[contains(@class,"product-details__short-description")]//text()')
description = clean_string(description)

Collecting the Stock

Collecting the stock of the product when web scraping with python

From the image, it is evident that stock is directly present inside the node p, whose class contains the string ‘in-stock’. Use the code to collect data from it:

stock = parser.xpath('//p[contains(@class, "in-stock")]/text()')
stock = clean_string(stock)
if stock:
    stock = stock.replace(' in stock', '')

Collecting the Price

Collecting the price of the product when web scraping with python

Here the price can be directly seen in the node p having class price. So use the code to get the actual price value of the product:

price = parser.xpath('//p[@class="price"]//text()')
price = clean_string(price)

Collecting the Image URL

Collecting the image URL of the product when web scraping with python

In the screenshot, the attribute href of the node ‘a’ is highlighted. It is from this href attribute that you will get the image URL.

image_url = parser.xpath('//div[contains(@class, "woocommerce-product-gallery__image")]/a/@href')
Image_url = clean_string(list_or_txt=image_url, connector=' | ')

Complete Code for Python Web Scraping With Request Library

import csv
from lxml import html
import requests

def verify_response(response):
    """
    Verify if we received valid response or not
    """
    return True if response.status_code == 200 else False

def send_request(url):
    """
    Send request and handle retries.
    :param url:
    :return: Response we received after sending request to the URL.
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Sa"
                      "fari/537.36",
        "Accept-Language": "en-US,en;q=0.5"
    }
    max_retry = 3
    while max_retry >= 1:
        response = requests.get(url, headers=headers)
        if verify_response(response):
            return response
        else:
            max_retry -= max_retry
    print("Invalid response received even after retrying. URL with the issue is:", url)
    raise Exception("Stopping the code execution as invalid response received.")

def get_next_page_url(response):
    """
    Collect pagination URL.
    :param response:
    :return: next listing page url
    """
    parser = html.fromstring(response.text)
    next_page_url = parser.xpath('(//a[@class="next page-numbers"])[1]/@href')[0]
    return next_page_url

def get_product_urls(response):
    """
    Collects all product URL from a listing page response.
    :param response:
    :return: list of urls. List of product page urls returned.
    """
    parser = html.fromstring(response.text)
    product_urls = parser.xpath('//li/a[contains(@class, "product__link")]/@href')
    return product_urls

def clean_stock(stock):
    """
    Clean the data stock by removing unwanted text present in it.
    :param stock:
    :return: Stock data. Stock number will be returned by removing extra string.
    """
    stock = clean_string(stock)
    if stock:
        stock = stock.replace(' in stock', '')
        return stock
    else:
        return None

def clean_string(list_or_txt, connector=' '):
    """
    Clean and fix list of objects received. We are also removing unwanted white spaces.
    :param list_or_txt:
    :param connector:
    :return: Cleaned string.
    """
    if not list_or_txt:
        return None
    return ' '.join(connector.join(list_or_txt).split())

def get_product_data(url):
    """
    Collect all details of a product.
    :param url:
    :return: All data of a product.
    """
    response = send_request(url)
    parser = html.fromstring(response.text)
    title = parser.xpath('//h1[contains(@class, "product_title")]/text()')
    price = parser.xpath('//p[@class="price"]//text()')
    stock = parser.xpath('//p[contains(@class, "in-stock")]/text()')
    description = parser.xpath('//div[contains(@class,"product-details__short-description")]//text()')
    image_url = parser.xpath('//div[contains(@class, "woocommerce-product-gallery__image")]/a/@href')
    product_data = {
        'Title': clean_string(title), 'Price': clean_string(price), 'Stock': clean_stock(stock),
        'Description': clean_string(description), 'Image_URL': clean_string(list_or_txt=image_url, connector=' | '),
        'Product_URL': url}
    return product_data

def save_data_to_csv(data, filename):
    """
    save list of dict to csv.
    :param data: Data to be saved to csv
    :param filename: Filename of csv
    """
    keys = data[0].keys()
    with open(filename, 'w', newline='') as csv_file:
        writer = csv.DictWriter(csv_file, fieldnames=keys)
        writer.writeheader()
        writer.writerows(data)

def start_scraping():
    """
    Starting function.
    """
    listing_page_url = 'https://scrapeme.live/shop/'
    product_urls = list()
    for listing_page_number in range(1, 6):
        response = send_request(listing_page_url)
        listing_page_url = get_next_page_url(response)
        products_from_current_page = get_product_urls(response)
        product_urls.extend(products_from_current_page)
        results = []
    for url in product_urls:
        results.append(get_product_data(url))
    save_data_to_csv(data=results, filename='scrapeme_live_Python_data.csv')
    print('Data saved as csv')

if __name__ == "__main__":
    start_scraping()

Sending GET Requests Using Cookies and Headers

Now let’s learn how to send the requests using headers and cookies.

headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Sa"
                      "fari/537.36",
        "Accept-Language": "en-US,en;q=0.5"
    }

url = "https://scrapeme.live/shop/"
response = requests.get(url, headers=headers, cookies=cookies)

Sending POST Requests

Let’s have a look at making POST requests with the Python Requests library.

payload = {“key1”: “value1”, “key2”: “value2”}
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Sa"
                     "fari/537.36",
        "Accept-Language": "en-US,en;q=0.5"
    }
url = "https://scrapeme.live/shop/"
response =requests.post(url, headers=headers, json=payload)

Why Web Scraping With Python?

Python is considered the best programming language for web scraping, as it has many native libraries that are dedicated to web scraping. The Python syntax is also easy to understand and learn, as it is similar to reading a statement in the English language.
Scraping web pages with Python is a common trend due to several reasons:

  • Ease of Use

    Python is a simple and readable programming language that is accessible to both beginners and programming experts. Due to its straightforward syntax, developers can quickly understand the concepts of web scraping.

  • Large and Active Community

    The vast and active developer community of Python continuously contributes to open-source libraries and frameworks. Because of this, there are plenty of resources, tutorials, and code snippets to learn web scraping. You can solve problems using this collective knowledge.

  • Abundance of Libraries

    Python libraries such as BeautifulSoup and LXML are specifically designed for web scraping. These libraries help to parse and navigate HTML and XML documents with their powerful tools.

    The libraries also assist you in extracting data from web pages, manipulating HTML structures, and handling various data formats, making web scraping in Python an important topic of discussion.

  • Requests Library

    Requests are Python libraries that enable you to make HTTP requests while also handling responses. It can be identified as a high-level interface that is used for sending HTTP requests like GET and POST, setting headers, handling cookies, and managing sessions.

  • Data Manipulation and Analysis
    Python libraries like Pandas and NumPy are some of the most powerful and prominent data manipulation and analysis libraries that can be used for processing, cleaning, and analyzing data efficiently.
    You can rely on these libraries to filter, sort, aggregate, and visualize the data for data-driven decision-making
  • Integration With Other Tools and Technologies
    When web scraping with Python, it can seamlessly integrate with other web scraping tools and technologies. It can also be combined with database systems such as MySQL and MongoDB for storing and managing the scraped data. Moreover, it also goes well with the Django or Flask frameworks for building web applications.

Wrapping Up

This tutorial has given you a detailed explanation of using the Request library for web scraping and how you can employ it to collect all the necessary data. For small-scale web scraping projects, the scraper you created through this article will be enough.

If your needs are more specific, like web scraping Amazon product details, then you can use ScrapeHero Cloud, which is a hassle-free, no-code, and affordable means of scraping popular websites.

But what if you need enterprise-grade web scraping? Then you can consider ScrapeHero web scraping services, which are bespoke, custom, and more advanced. Also, only a data service provider like ScrapeHero can provide you with access to valuable data that is otherwise difficult to obtain.

Frequently Asked Questions

1. Which Python library is used for web scraping?

For web scraping in Python, the most commonly used libraries are BeautifulSoup, Requests, and Selenium.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Table of content

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

Search engine web crawling

From Crawling to Ranking! This is How Search Engines Use Web Crawling to Index Websites!

Search engine crawling indexes web pages, making it essential for ranking and visibility in search results.
Scrape Yelp Reviews

Need to Scrape Yelp Reviews? Check Out This Tutorial

Learn how you can scrape Yelp reviews using Selenium.
Geo-Restrictions in Web Scraping

These Proven Strategies Can Overcome Geo-Restrictions in Web Scraping

Here are some effective strategies for bypassing geo-restrictions in web scraping.
ScrapeHero Logo

Can we help you get some data?