This is an open thread and the goal is to solicit comments on what the best web scraping service may look like. Please go ahead a type away and write down the ideas or requirements…
BeautifulSoup can parse web pages and extract meaningful information from them. It supports HTML and XML documents. Want to know how to use BeautifulSoup? Read on. This tutorial will show you how to begin web scraping with BeautifulSoup.
How Does BeautifulSoup Work?
BeautifulSoup provides a user-friendly interface for parsers like lxml, html.parser, etc. It doesn’t have in-built parsing capabilities. However, it creates a parse tree from an already parsed code, enabling easy navigation of the HTML structure.
How to Begin Web Scraping Using BeautifulSoup
This tutorial uses the following HTML code to illustrate web scraping with Python BeautifulSoup.
<!DOCTYPE html>
<html lang="en">
<head>
<title>Demo Page</title>
</head>
<body>
<div class="product-container">
<h3>Products</h3>
<div class="product" style="border: solid; margin:1%">
<p class="product-name"> <b> Name: </b> <span> Abra </span></p>
<p class="price"> <b> Price: </b> <span> 100.00 </span> </p>
<a href="/pokemon/abra">Buy</a>
</div>
<div class="product" style="border: solid; margin:1%">
<p class="product-name"> <b> Name: </b> <span> Absol </span></p>
<p class="price"> <b> Price: </b> <span> 80.00 </span> </p>
<a href="/pokemon/absol">Buy</a>
</div>
<div class="product" style="border: solid; margin:1%">
<p class="product-name"> <b> Name: </b> <span> Altaria </span></p>
<p class="price"> <b> Price: </b> <span> 120.00 </span> </p>
<a href="/pokemon/altaria">Buy</a>
</div>
<div class="product" style="border: solid; margin:1%">
<p class="product-name"> <b> Name: </b> <span> Arctozolt </span></p>
<p class="price"> <b> Price: </b> <span> 110.00 </span> </p>
<a href="/pokemon/arctozolt">Buy</a>
</div>
<div class="product" style="border: solid; margin:1%">
<p class="product-name"> <b> Name: </b> <span> Barbaracle </span></p>
<p class="price"> <b> Price: </b> <span> 100.00 </span> </p>
<a href="/pokemon/barbaracle">Buy</a>
</div>
</div>
</body>
</html>
Install BeautifulSoup
Since BeautifulSoup is an external library, you must install it separately. You can use Python’s package manager, pip, for installation.
pip install beautifulsoup4
Import BeautifulSoup Library
The name of the BeautifulSoup library inside Python is bs4. It has several classes with different capabilities; here, you will use the BeautifulSoup class. You can use it in two ways:
- Import bs4 using the following statement and call bs4.BeautifulSoup()
import bs4
- Import the BeautifulSoup class directly using the following statement, allowing you to call BeautifulSoup().
from bs4 import BeautifulSoup
This tutorial uses the second method.
You can use the codes mentioned below to parse the HTML sample..
sample_html = ''
soup = BeautifulSoup(sample_html, 'html.parser')
The BeautifulSoup class accepts two arguments: an HTML code you want to parse and the parser you want to use. You need the second argument, as BeautifulSoup does not have parsing capabilities. BeautifulSoup uses other Python libraries, such as lxml, html.parser, and html5lib for parsing HTML.
You can skip the second argument. However, it is best to specify a parser, or BeautifulSoup will use the available parsers, which may vary with environments.
Access Tags from HTML
The next step is to access the data from the HTML tags.
- You can access individual tags as attributes of the soup object. For instance, the following code gets the h3 tag:
soup.h3
Output: <h3>Products</h3>
- You can use the .string method to extract the text within a tag:
soup.h3.string
Output: 'Products'
- You can use the dot operator on the parent tag to access child tags. For instance, the following code gets the h3 tag inside the root div tag.
soup.div.h3
Output: <h3>Products</h3>
Access the nth Tag from HTML
It is tough to access tags from a tree with many siblings. However, the find_all method of BeautifulSoup makes it easy. For example, you can use the following code to access all the <div> tags from the above HTML.
div_tags = soup.div.find_all('div')
div_tags
Output: [<div class="product">
<p class="product-name"> <b> Name: </b> <span> Abra </span></p>
<p class="price"> <b> Price: </b> <span> 100.00 </span> </p>
<a href="/pokemon/abra">Buy</a>
</div>,
...]
The find_all method accepts a tag name and returns all the matching tags as a Python list. You can access a specific child tag by specifying the list index.
The following code gets the child tag at index 0.
div_tags[0]
Output: <div class="product">
<p class="product-name"> <b> Name: </b> <span> Abra </span></p>
<p class="price"> <b> Price: </b> <span> 100.00 </span> </p>
<a href="/pokemon/abra">Buy</a>
</div>
The tags themselves are BeautifulSoup nodes and support all the methods. For example, the following code gets the name of the first product from the first child tag.
div_tags[0].p.span.string
Output: ' Abra '
Do you want to print the names of all the products? Use a for loop as shown in the following code.
for div_tag in div_tags:
print(div_tag.p.span.string)
Output: Abra
Absol
Altaria
Arctozolt
Barbaracle
Filter Tags by Attributes
HTML tags often have attributes such as class, id, name, etc. You can filter tags using these attributes. For example, the following code filters the <p> tag, and selects those with a class attribute of ‘price’:
- Filter tags by class attribute
soup.find_all('p', attrs={'class': 'price'})
Or it can be simplified like this:
soup.find_all('p', class_='price')
- Filter tags by id attribute
soup.find_all('p', attrs={'id': 'id of the tag'})
Or
soup.find_all('p', id='id of the tag')
- Filter elements by attribute without tag name
soup.find_all(id='id of the tag')
- Filter elements by non-standard attributes
soup.find_all('div', attrs={'data-class': 'value to search')
- Filter elements by multiple attributes
soup.find_all('div', attrs={'class': 'class to search', 'id': 'id to search'})
Extract the Text Inside Tags
You can scrape text from a website using BeautifulSoup with the following methods:
- Text attribute
- get_text() method
- String attribute
To begin, filter the price tags from the HTML
price_tags = soup.find_all('p', class_='price')
Here are the examples showing how you get the text from the HTML tags using the methods mentioned above.
Using text attribute
for price_tag in price_tags:
print(price_tag.span.text)
Output: 100.00
80.00
120.00
110.00
100.00
Using the get_text() Method
for price_tag in price_tags:
print(price_tag.span.get_text())
Output: 100.00
80.00
120.00
110.00
100.00
Using the string attribute
You can also use the string attribute of a node to extract text if the node doesn’t have any child nodes. However, if the node contains other nodes as children, the string attribute will return None.
for price_tag in price_tags:
print(price_tag.span.string)
Output: 100.00
80.00
120.00
110.00
100.00
Extract URLs from Anchor Tags
First, find an anchor tag to extract the URL from it.
anchor_tag = soup.find('a')
Note: The find method is very similar to the find_all method. However, find returns only the first match, whereas find_all returns all the matches.
Now, you can extract the corresponding URL using the following code.
anchor_tag['href']
Output: '/pokemon/abra'
Similarly, you can access all the attributes.
Moreover, you can extract all tag attributes to a dict using the following code:
anchor_tag.attrs
Output: {'href': '/pokemon/abra'}
Get HTML of Tags
In certain cases, you may need the raw HTML code of a particular tag. For that, you can use the Python str() function.
anchor_tag = soup.find('a')
str(anchor_tag)
Output: '<a href="/pokemon/abra">Buy</a>'
BeautifulSoup has a method called prettify() that returns the formatted HTML.
print(anchor_tag.prettify())
Output:<a href="/pokemon/abra">
Buy
</a>
Use CSS Selectors with BeautifulSoup
BeautifulSoup allows you to use CSS selectors to extract elements. You can use the select method on a soup object as follows.
soup.select('div p.product-name span')
Output:[ Abra ,
Absol ,
Altaria ,
Arctozolt ,
Barbaracle ]
The select() returns a list of matching elements. If you need only the first matching result, you can use select_one method.
soup.select_one('div p.product-name span')
Output: <span> Abra </span>
We can also use XPath for filtering the tags. But unfortunately, BeautifulSoup does not support XPath. You may have to rely on CSS selectors. If you need XPath support, you can look into the Python lxml library. For installation and to get started with lxml, go to their official documentation.
How to Choose the Best Parser for BeautifulSoup
As mentioned earlier, BeautifulSoup supports various parsing libraries, including html.parser and lxml. Look at the following table. It summarizes the advantages and disadvantages of each parser.
Parser | Usage | Advantages | Disadvantages |
Python’s html parser | BeautifulSoup(markup, “html.parser”) |
|
|
lxml’s HTML parser | BeautifulSoup(markup, “lxml”) |
|
|
lxml’s XML parser | BeautifulSoup(markup, “lxml-xml”) BeautifulSoup(markup, “xml”) |
|
|
html5lib | BeautifulSoup(markup, “html5lib”) |
|
|
If you want faster parsing, go with lxml’s HTML parser. However, you must install the lxml library using the following code in a terminal.
pip install lxml
Best Practices While Web Scraping With BeautifulSoup:
You saw how to scrape a website using BeautifulSoup. Here are some things to keep in mind while using the library.
Increase the Parsing Speed
BeautifulSoup is great for web scraping. However, it will be slower than the parsing libraries that it uses. Therefore, BeautifulSoup may not be the best for web scraping in Python, where time is of the essence. Use the parsers directly in these situations.
However, the following ways will help improve the performance of BeautifulSoup.
- Use lxml to parse BeautifulSoup to parse an HTML code significantly faster than other libraries. Check out this installation and usage guide.
- Use cchardet library for faster encoding detection.
Reduce Memory Usage
You can reduce memory consumption by only parsing a part of the document while using BeautifulSoup. This will also accelerate the search process.
Conclusion
You can perform web scraping using Python BeautifulSoup. It has an intuitive syntax to select the HTML/XML elements. Moreover, you can choose your favorite parsers to use with BeautifulSoup.
However, it is still challenging to analyze the structure of a web page and understand the elements to extract. If you are looking for a no-code solution, try ScrapeHero Cloud. It has affordable web scrapers you can try for free. For example, this ScrapeHero Amazon Scraper can get all the product details from Amazon.
It is also challenging to write a scalable web scraper. Check out ScrapeHero services. We offer a wide range of enterprise-grade web scraping services, including large-scale web crawling.