This is an open thread and the goal is to solicit comments on what the best web scraping service may look like. Please go ahead a type away and write down the ideas or requirements…
What makes web scraping using Cheerio unique when compared to other libraries? It is its jQuery-based API. Cheerio is used for web scraping in server-side implementations and is made for Node.js.
Cheerio can use CSS-style selectors to crawl web pages and gather data. Additionally, it has the ability to load HTML as a string and return an object with built-in data extraction capabilities.
This article deals with the fundamentals of web scraping using Cheerio. You will learn how to create a Cheerio scraper and then extract the necessary data from a website. Let’s begin.
Building a Cheerio Scraper
For web scraping with Cheerio, choose any website of your choice. In this case, let’s use the website ScrapeMe. You can iterate through each product and extract the necessary data from the page.
How to Set Up Cheerio for Web Scraping
For setting up Cheerio for web scraping, use the local installation:
Step 1: Create a directory for this Cheerio project
Step 2: Open the terminal inside the directory and type the command:
npm init
It will create a file package.json
Step 3: Type the command in the same terminal:
npm install axios
npm install cheerio
npm install objects-to-csv
Scraper Workflow
The workflow of the Cheerio scraper is as mentioned:
- Navigate to the listing page
- Extract the product page URLs of each product on the listing page
- Navigate to the product page
- Extract the required data field from the product page
- Repeat steps 1 – 4 for each listing page URL
- Save data into a CSV file
Let’s import the required libraries in order to begin web scraping with Axios and Cheerio.
const axios = require('axios');
const cheerio = require('cheerio');
const ObjectsToCsv = require("objects-to-csv");
Next, navigate to the listing page. To perform the navigation, use the code line:
const { data } = await axios.get(listingUrl);
This code returns the HTML content of the listing page with the status code 200. The axios.get() method will take the input URL string as a parameter and return the response. It also supports passing URL parameters, headers, proxies, etc., as function parameters.
The await keyword is used to wait for the promise to complete. The await is only valid in the async function. To use await, the execution context must be asynchronous in nature. Prefix the function declaration with async, where the asynchronous operation will be executed, as shown.
async function main() {
const res = await axios.post('https://httpbin.org/post', { hello: 'world' }, {
headers: {
'content-type': 'text/json'
}
});
}
main();
Next, you need to extract the data from the HTML content. For that, use the cheerio.load() method.
const parser = cheerio.load(data);
The load method is the easiest way to parse HTML or XML documents with Cheerio. It takes HTML content as an argument and returns a Cheerio object.
Now, select all the listed products. First, to get data for each product, find the HTML element that contains the required data. If you inspect the listed products, you can see that every product is listed inside a <li> tag, with a common class name product.
Select all such products by looking for all <li> tags with a class name product, which can be represented as the CSS selector li.product.
const products = parser("li.product");
From each product listing, let’s extract the below data points:
- Product Name
- Product URL
- Price
- Image URL
If you inspect the HTML elements, you can see that the product URL is present in selector <a>.
const productPageUrl = parser(product).find("a").attr("href");
The method find(“a”) finds all a(anchor) tags that are descendants of the current Cheerio object. In this case, it would find all elements within the li.product element.
The method attr(“href”) retrieves the value of the href attribute of the first element in the Cheerio object. In this case, it retrieves the value of the href attribute of the first anchor(a) tag within the li.product element.
Next, navigate to the product page. Like on the listing page, send the request to the product URL using Axios and use the Cheerio library to parse the data.
const { data } = await axios.get(productPageUrl);
const parser = cheerio.load(data);
Extract the data points listed for each product:
- Description
- Title
- Price
- Stock
- SKU
- Image URL
By inspecting the data, you can see that the selectors are:
- Description: div.woocommerce-product-details__short-description
- Title: h1.product_title
- Price: p.price>span.woocommerce-Price-amount.amount
- Stock: p.stock.in-stock
- SKU: span.sku
- Image URL: figure>div>a.href
const description = parser("div.woocommerce-product-details__short-description").text();
const title = parser("h1.product_title").text();
const price = parser("p.price>span.woocommerce-Price-amount.amount").text();
const stock = parser("p.stock.in-stock").text();
const sku = parser("span.sku").text();
const imageUrl = parser("figure>div>a").attr("href");
The method text() retrieves the text content of the first element in the Cheerio object. Save this data field into an object for each iteration and push it to productDataFields[]
productDataFields.push({ title, price, stock, sku, imageUrl, description })
After completing the iterations, you need to save the data into a CSV file. Here, you just need to pass productDataFields[] to ObjectsToCsv() method and pass the PATH as a string to toDisk() method.
const csv = new ObjectsToCsv(productDataFields)
csv.toDisk("");
Access the complete code for web scraping using Cheerio on GitHub
How to Use Proxies and Headers in Axios?
When web scraping with Cheerio, HTTP headers are important in conveying additional information between the client and server, along with HTTP requests and responses. A few applications of headers include:
- Authorization: Headers serve as a means for transmitting authentication data, like a user’s credentials or an API key, to the server.
- Protection: Headers play a key role in establishing security measures, such as defining the origin of a request or guarding against cross-site scripting (XSS) attacks.
- Content Management: Through headers, clients can negotiate the format or encoding of the content that the server returns, making it possible to request specific content.
Similarly, proxies play a very important role when it comes to web scraping using Cheerio.
const axios = require('axios');
const res = await axios.get('http://httpbin.org/get?answer=42', {
proxy: {
host: '',
port:
},
Headers: {
'content-type': 'text/json'
}
});
How to Send POST Requests Using Axios?
Make a POST request using Axios to a given endpoint and trigger events. To perform an HTTP POST request, use the axios.post() method, which takes two parameters: the endpoint URL and an object containing the data you want to send to the server.
The importance of POST requests in web scraping with Cheerio:
- POST requests are often used to submit data to a server, such as search queries. This can be useful when you need to submit data to a server in order to retrieve specific information.
- POST requests are secure as they don’t expose data in the URL.
- Websites use POST requests to dynamically load content on a page, like pagination and filters.
const axios = require('axios');
const res = await axios.post('https://httpbin.org/post', { hello: 'world' }, {
headers: {
'content-type': 'text/json'
}
});
Features of Cheerio
- Familiar Syntax: Cheerio incorporates a portion of the jQuery core and eliminates any inconsistent DOM (Document Object Model) and unwanted browser elements, presenting a superior API.
- Extremely Fast: Cheerio operates on a straightforward sequential DOM, leading to quick parsing, handling, and display.
- Highly Adaptable: Cheerio utilizes the parse5 parser and can also utilize htmlparser2. It has the ability to parse almost any HTML or XML document.
Wrapping Up
Web scraping with Axios and Cheerio can be an option when it comes to small-scale data extraction. It is true that Cheerio offers an easy-to-use API for parsing and modifying HTML, but it has limitations in terms of anti-scraping measures, dynamic websites, and performance issues.
So it is always better to switch to the best methods of web scraping, just like the ones ScrapeHero provides. If you need affordable, fast, and reliable products that offer a no-code approach then you can consider ScrapeHero Cloud. It is instant, easy-to-use, has predefined data points, and no coding on your part is involved. You will even get 25 free credits when you sign-up.
If your needs are much larger, say enterprise-grade scraping, then you can use ScrapeHero web scraping services. They are bespoke, custom-made, advanced and cater to all industries globally.
Frequently Asked Questions
-
Can Cheerio run in the browser?
Cheerio cannot run directly in a web browser like client-side JavaScript libraries (e.g., jQuery). This is because Cheerio does not have a DOM of its own. Instead, it implements a subset of the jQuery core to manipulate the virtual DOM, which is perfect for server-side processing where an actual DOM is not present.
-
Is Node.js good for web scraping?
Yes, Node.js is quite effective for web scraping. Web scraping with Node.js and Cheerio is popular among developers as it offers several advantages, such as a rich set of libraries, a JavaScript ecosystem, handling dynamic content, etc. for this purpose.
We can help with your data or automation needs
Turn the Internet into meaningful, structured and usable data