web scraping

7 min read

An Introductory Guide to Web Scraping With Cheerio and Axios

Amal R
Last Updated: December 10, 2024

Building a Cheerio Scraper
How to Set Up Cheerio for Web Scraping
Scraper Workflow
How to Use Proxies and Headers in Axios?
How to Send POST Requests Using Axios?
Features of Cheerio
Wrapping Up
Frequently Asked Questions

What makes web scraping using Cheerio unique when compared to other libraries? It is its jQuery-based API. Cheerio is used for web scraping in server-side implementations and is made for Node.js.

Cheerio can use CSS-style selectors to crawl web pages and gather data. Additionally, it has the ability to load HTML as a string and return an object with built-in data extraction capabilities.

This article deals with the fundamentals of web scraping using Cheerio. You will learn how to create a Cheerio scraper and then extract the necessary data from a website. Let’s begin.

Building a Cheerio Scraper

For web scraping with Cheerio, choose any website of your choice. In this case, let’s use the website ScrapeMe. You can iterate through each product and extract the necessary data from the page.

Also Read: How To Scrape Amazon Product Data and Prices using Python 3

How to Set Up Cheerio for Web Scraping

For setting up Cheerio for web scraping, use the local installation:

Step 1: Create a directory for this Cheerio project

Step 2: Open the terminal inside the directory and type the command:

npm init

It will create a file package.json

Step 3: Type the command in the same terminal:

npm install axios
npm install cheerio
npm install objects-to-csv

Scraper Workflow

The workflow of the Cheerio scraper is as mentioned:

Navigate to the listing page
Extract the product page URLs of each product on the listing page
Navigate to the product page
Extract the required data field from the product page
Repeat steps 1 – 4 for each listing page URL
Save data into a CSV file

Let’s import the required libraries in order to begin web scraping with Axios and Cheerio.

const axios = require('axios');
const cheerio = require('cheerio');
const ObjectsToCsv = require("objects-to-csv");

Next, navigate to the listing page. To perform the navigation, use the code line:

const { data } = await axios.get(listingUrl);

This code returns the HTML content of the listing page with the status code 200. The axios.get() method will take the input URL string as a parameter and return the response. It also supports passing URL parameters, headers, proxies, etc., as function parameters.

The await keyword is used to wait for the promise to complete. The await is only valid in the async function. To use await, the execution context must be asynchronous in nature. Prefix the function declaration with async, where the asynchronous operation will be executed, as shown.

async function main() {
    const res = await axios.post('https://httpbin.org/post', { hello: 'world' }, {
    headers: {
        'content-type': 'text/json'
        }
    });
}
main();

Next, you need to extract the data from the HTML content. For that, use the cheerio.load() method.

const parser = cheerio.load(data);

The load method is the easiest way to parse HTML or XML documents with Cheerio. It takes HTML content as an argument and returns a Cheerio object.

Now, select all the listed products. First, to get data for each product, find the HTML element that contains the required data. If you inspect the listed products, you can see that every product is listed inside a <li> tag, with a common class name product.

Select all such products by looking for all <li> tags with a class name product, which can be represented as the CSS selector li.product.

const products = parser("li.product");

From each product listing, let’s extract the below data points:

Product Name
Product URL
Price
Image URL

If you inspect the HTML elements, you can see that the product URL is present in selector <a>.

const productPageUrl = parser(product).find("a").attr("href");

The method find(“a”) finds all a(anchor) tags that are descendants of the current Cheerio object. In this case, it would find all elements within the li.product element.

The method attr(“href”) retrieves the value of the href attribute of the first element in the Cheerio object. In this case, it retrieves the value of the href attribute of the first anchor(a) tag within the li.product element.

Next, navigate to the product page. Like on the listing page, send the request to the product URL using Axios and use the Cheerio library to parse the data.

const { data } = await axios.get(productPageUrl);
const parser = cheerio.load(data);

Extract the data points listed for each product:

Description
Title
Price
Stock
SKU
Image URL

By inspecting the data, you can see that the selectors are:

Description: div.woocommerce-product-details__short-description
Title: h1.product_title
Price: p.price>span.woocommerce-Price-amount.amount
Stock: p.stock.in-stock
SKU: span.sku
Image URL: figure>div>a.href

const description = parser("div.woocommerce-product-details__short-description").text();
const title = parser("h1.product_title").text();
const price = parser("p.price&gt;span.woocommerce-Price-amount.amount").text();
const stock = parser("p.stock.in-stock").text();
const sku = parser("span.sku").text();
const imageUrl = parser("figure&gt;div&gt;a").attr("href");

The method text() retrieves the text content of the first element in the Cheerio object. Save this data field into an object for each iteration and push it to productDataFields[]

productDataFields.push({ title, price, stock, sku, imageUrl, description })

After completing the iterations, you need to save the data into a CSV file. Here, you just need to pass productDataFields[] to ObjectsToCsv() method and pass the PATH as a string to toDisk() method.

const csv = new ObjectsToCsv(productDataFields)
csv.toDisk("");

Access the complete code for web scraping using Cheerio on GitHub

Also Read: How to Scrape Google Reviews: Code and No Code Approaches

How to Use Proxies and Headers in Axios?

When web scraping with Cheerio, HTTP headers are important in conveying additional information between the client and server, along with HTTP requests and responses. A few applications of headers include:

Authorization: Headers serve as a means for transmitting authentication data, like a user’s credentials or an API key, to the server.
Protection: Headers play a key role in establishing security measures, such as defining the origin of a request or guarding against cross-site scripting (XSS) attacks.
Content Management: Through headers, clients can negotiate the format or encoding of the content that the server returns, making it possible to request specific content.

Also Read: Building an Amazon Product Reviews API using Python Flask

Similarly, proxies play a very important role when it comes to web scraping using Cheerio.

const axios = require('axios');
const res = await axios.get('http://httpbin.org/get?answer=42', {
    proxy: {
        host: '',
        port: 
    },
    Headers: {
        'content-type': 'text/json'
        }
});

How to Send POST Requests Using Axios?

Make a POST request using Axios to a given endpoint and trigger events. To perform an HTTP POST request, use the axios.post() method, which takes two parameters: the endpoint URL and an object containing the data you want to send to the server.

The importance of POST requests in web scraping with Cheerio:

POST requests are often used to submit data to a server, such as search queries. This can be useful when you need to submit data to a server in order to retrieve specific information.
POST requests are secure as they don’t expose data in the URL.
Websites use POST requests to dynamically load content on a page, like pagination and filters.

Also Read: How to Build Web Scrapers Quickly Using Playwright Codegen

const axios = require('axios');
const res = await axios.post('https://httpbin.org/post', { hello: 'world' }, {
    headers: {
        'content-type': 'text/json'
    }
});

Features of Cheerio

Familiar Syntax: Cheerio incorporates a portion of the jQuery core and eliminates any inconsistent DOM (Document Object Model) and unwanted browser elements, presenting a superior API.
Extremely Fast: Cheerio operates on a straightforward sequential DOM, leading to quick parsing, handling, and display.
Highly Adaptable: Cheerio utilizes the parse5 parser and can also utilize htmlparser2. It has the ability to parse almost any HTML or XML document.

Also Read: Why Do Businesses Need a Custom API for Web Scraping?

Wrapping Up

Web scraping with Axios and Cheerio can be an option when it comes to small-scale data extraction. It is true that Cheerio offers an easy-to-use API for parsing and modifying HTML, but it has limitations in terms of anti-scraping measures, dynamic websites, and performance issues.

Also Read: How to Scrape Websites Without Getting Blocked

So it is always better to switch to the best methods of web scraping, just like the ones ScrapeHero provides. If you need affordable, fast, and reliable products that offer a no-code approach then you can consider ScrapeHero Cloud. It is instant, easy-to-use, has predefined data points, and no coding on your part is involved. You will even get 25 free credits when you sign-up.

If your needs are much larger, say enterprise-grade scraping, then you can use ScrapeHero web scraping services. They are bespoke, custom-made, advanced and cater to all industries globally.

Frequently Asked Questions

Can Cheerio run in the browser?

Cheerio cannot run directly in a web browser like client-side JavaScript libraries (e.g., jQuery). This is because Cheerio does not have a DOM of its own. Instead, it implements a subset of the jQuery core to manipulate the virtual DOM, which is perfect for server-side processing where an actual DOM is not present.
Is Node.js good for web scraping?

Yes, Node.js is quite effective for web scraping. Web scraping with Node.js and Cheerio is popular among developers as it offers several advantages, such as a rich set of libraries, a JavaScript ecosystem, handling dynamic content, etc. for this purpose.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data

Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help