Requests and BeautifulSoup are excellent for static web scraping, but you need to handle two libraries, which can be avoided by using MechanicalSoup. Web scraping with MechanicalSoup allows you to use a single library to both fetch and parse HTML code.
The article shows you how to use MechanicalSoup for web scraping.
Web Scraping with MechanicalSoup: Understand the Elements
Understanding the elements you wish to scrape is essential to know whether or not you can use MechanicalSoup for web scraping. The library can not handle JavaScript. Therefore, it can not extract elements that are only visible after executing JavaScript.
This tutorial shows how to scrape Google using MechanicalSoup by extracting three details from the search results:
- Title
- Description
- URL
All these details are available without JavaScript execution.
Web Scraping with MechanicalSoup: Set Up the Environment
MechanicalSoup is among the external libraries for web scraping, so you must install it using pip.
pip install mechanicalsoup
You also need one internal module, json, which allows you to save the extracted data to a JSON file. However, you don’t have to install it as it comes with the Python standard library.
Web Scraping with MechanicalSoup: Write the Code
Start the code by importing the packages mentioned above:
import mechanicalsoup, json
Now, you can begin writing the code that:
- Searches a term on Google
- Navigates a specified number of pages
- Extracts the title, description, and URL of each result from each page
- Saves the extracted data to a JSON file
To keep the code clean, create a function to navigate and a function to extract details.
Define extract() to Extract Details From a Page
The function to extract details will accept a list containing results on a page and a list to store the details extracted from the results.
def extract(results, extracted_details):
result
for result in results:
try:
title = result.h3.text
description = result.find('div',{'class':'BNeawe s3v9rd AP7Wnd'}).text
url = result.a['href']
except:
continue
result_list.append(
{
'Title':title,
'Description':description,
'Url':url.replace('/url?q=','')
}
)
This code snippet iterates through a list containing results, and in each iteration:
- Tries to extract
- Title from an h3 tag
- Description from a div tag
- URL from an anchor tag
- Appends the extracted details to extracted_details
Define paginate() to Navigate the Pages
To navigate pages, create a function that runs a loop and follows the link to the next page until the loop count becomes equal to the specified number of pages to extract.
def paginate():
result_list = []
for page in range(pages):
soup = browser.page
resultArea = soup.find('div',{'id':'main'})
try:
results = resultArea.find_all('div',{'class':'Gx5Zad xpd EtOod pkphOe'})
except:
continue
extract(results, result_list)
next_page = soup.find('a',{'aria-label':'Next page'})
browser.follow_link(next_page)
return result_list
This code snippet defines an empty list to hold the details extracted from the results related to one term and uses a loop to paginate. In each iteration, the code:
- Gets the parsed HTML code using MechanicalSoup’s page attribute
- Locates the div element containing all the results
- Extracts all the div elements containing individual results
- Calls extract()
- Locates and navigates to the next page
After the loop is complete, the function returns extracted_details.
Call the Functions
You can call the functions to navigate and extract after defining them, but you need to perform specific steps before that.
First, create an object of the StatefulBrowser class of MechanicalSoup. This object allows you to maintain a persistent session, handle cookies, and follow redirects.
browser = mechanicalsoup.StatefulBrowser(
soup_config = {'features':'lxml'}, # use lxml
)
In the above code, the soup_config argument accepts configurations for BeautifulSoup; here, it tells BeautifulSoup to use lxml for parsing.
You can also update the headers of the object using the session.headers.update() method.
#define headers
headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,'
'*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'en-GB;q=0.9,en-US;q=0.8,en;q=0.7',
'dpr': '1',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}
#update headers
browser.session.headers.update(headers)
Next, store the search terms in an array.
search_terms = ["masks","laptops","mobiles","cycles","dumbbells","ropes"]
Set the number of pages you wish to scrape for each term.
pages = 5
Now, you only have to loop through each search term to extract details. But before starting a loop, define an empty dict to store extracted details related to all the search terms.
all_results = {}
And in the loop:
1. Visit “Google.com” using the .open() method of MechanicalSoup.
browser.open(‘https://www.google.com’)
2. Select the form that allows you to input the search term. MechanicalSoup has a select_form() method for that, which creates a dict from all the form inputs.
browser.select_form(‘form[action=”/search”]’)
3. Enter the search term by using the name of the input element as the key.
browser[‘q’] = term
4. Submit the selected form using the submit_selected() method.
browser.submit_selected()
5. Check if the status code is 429 (Too many requests) and exit the program if it is. If it’s not, the code will move on to the next step.
if response.status_code == 429:
exit(“Too Many Requests”)
6. Call paginate() and store the details in the empty dict defined outside the loop with the term as the key.
all_results[term] = paginate()
Finally, write the extracted results to a JSON file.
with open("googleSearchResults.json",'w',encoding='utf-8') as f:
json.dump(all_results,f,indent=4,ensure_ascii=False)
Here’s the complete code:
import mechanicalsoup, json
def extract(results, extracted_details):
for result in results:
if result.h3:
try:
title = result.h3.text
description = result.find('div',{'data-snf':'nke7rc'}).text
url = result.a['href']
except Exception as e:
print("extract error: ",e)
continue
extracted_details.append(
{
'Title':title,
'Description':description,
'Url':url.replace('/url?q=','')
}
)
def paginate():
extracted_details = []
for page in range(pages):
soup = browser.page
result_area = soup.find('div',{'id':'main'})
try:
results = result_area.find_all('div',{'class':'MjjYud'})
except Exception as e:
print("paginate error: ",e)
continue
extract(results,extracted_details)
next_page = soup.find('a',{'id':'pnnext'})
browser.follow_link(next_page)
return extracted_details
if __name__ == "__main__":
headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,'
'*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'en-GB;q=0.9,en-US;q=0.8,en;q=0.7',
'dpr': '1',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}
browser = mechanicalsoup.StatefulBrowser(
soup_config={'features':'lxml'},
)
browser.session.headers.update(headers)
pages = 2
search_terms = ["masks","laptops","mobiles","cycles","dumbbells","ropes"]
all_results ={}
for term in search_terms[:2]:
browser.open('https://www.google.com')
browser.select_form('form[action="/search"]')
browser['q'] = term
response = browser.submit_selected()
if response.status_code == 429:
exit("Too Many Requests")
all_results[term] = paginate()
print(term,"extracted")
with open("googleSearchResults.json",'w',encoding='utf-8') as f:
json.dump(all_results,f,indent=4,ensure_ascii=False)
MechanicalSoup Limitations
MechanicalSoup is an excellent library to use for Python web scraping in place of requests and BeautifulSoup. However, consider its limitations:
- MechanicalSoup may add an extra layer, making it slower than directly using Python requests and BeautifulSoup
- It can’t manage form inputs if the forms are generated using a JavaScript
- MechanicalSoup is incapable of performing advanced browser interactions, including Scrolling.
How Can a Web Scraping Service Help?
Web scraping with MechanicalSoup allows you to replace two libraries—requests and BeautifulSoup—with one. It also enables you to handle forms and links conveniently, as shown in this tutorial.
However, the code shown is only suitable for small-scale scraping. For large-scale, it’s better to get help from professional web scraping services.
A web scraping service, like ScrapeHero, can take care of all the technicalities, including choosing the libraries. You only need to give your data requirements. ScrapeHero is a fully managed web scraping service provider capable of building enterprise-grade web scrapers and crawlers.
FAQ
MechanicalSoup is built on top of requests and BeautifulSoup, which allows you to perform static scraping more conveniently, but Selenium is a full-fledged browser automation library for complex browser interaction and executing JavaScript.
We can help with your data or automation needs
Turn the Internet into meaningful, structured and usable data