web scraping

6 min read

Ditch Multiple Libraries by Web Scraping with MechanicalSoup

Matthew
Published: December 5, 2024

Web Scraping with MechanicalSoup: Understand the Elements
Web Scraping with MechanicalSoup: Set Up the Environment
Web Scraping with MechanicalSoup: Write the Code
MechanicalSoup Limitations
How Can a Web Scraping Service Help?
FAQ

Requests and BeautifulSoup are excellent for static web scraping, but you need to handle two libraries, which can be avoided by using MechanicalSoup. Web scraping with MechanicalSoup allows you to use a single library to both fetch and parse HTML code.

The article shows you how to use MechanicalSoup for web scraping.

Web Scraping with MechanicalSoup: Understand the Elements

Understanding the elements you wish to scrape is essential to know whether or not you can use MechanicalSoup for web scraping. The library can not handle JavaScript. Therefore, it can not extract elements that are only visible after executing JavaScript.

This tutorial shows how to scrape Google using MechanicalSoup by extracting three details from the search results:

Title
Description
URL

All these details are available without JavaScript execution.

Web Scraping with MechanicalSoup: Set Up the Environment

MechanicalSoup is among the external libraries for web scraping, so you must install it using pip.

pip install mechanicalsoup

You also need one internal module, json, which allows you to save the extracted data to a JSON file. However, you don’t have to install it as it comes with the Python standard library.

Web Scraping with MechanicalSoup: Write the Code

Start the code by importing the packages mentioned above:

import mechanicalsoup, json

Now, you can begin writing the code that:

Searches a term on Google
Navigates a specified number of pages
Extracts the title, description, and URL of each result from each page
Saves the extracted data to a JSON file

To keep the code clean, create a function to navigate and a function to extract details.

Define extract() to Extract Details From a Page

The function to extract details will accept a list containing results on a page and a list to store the details extracted from the results.

def extract(results, extracted_details):
    result
    for result in results:
            try:
                title = result.h3.text
                description = result.find('div',{'class':'BNeawe s3v9rd AP7Wnd'}).text
                url = result.a['href']
            except:
                continue
            
            result_list.append(
                {
                    'Title':title,
                    'Description':description,
                    'Url':url.replace('/url?q=','')
                }
            )

This code snippet iterates through a list containing results, and in each iteration:

Tries to extract
1. Title from an h3 tag
2. Description from a div tag
3. URL from an anchor tag
Appends the extracted details to extracted_details

Define paginate() to Navigate the Pages

To navigate pages, create a function that runs a loop and follows the link to the next page until the loop count becomes equal to the specified number of pages to extract.

def paginate():
    result_list = []
    for page in range(pages):
        soup = browser.page
        resultArea = soup.find('div',{'id':'main'})
        try:
            results = resultArea.find_all('div',{'class':'Gx5Zad xpd EtOod pkphOe'})
        
        except:
            continue
        extract(results, result_list)
        next_page = soup.find('a',{'aria-label':'Next page'})
        browser.follow_link(next_page)
    return result_list

This code snippet defines an empty list to hold the details extracted from the results related to one term and uses a loop to paginate. In each iteration, the code:

Gets the parsed HTML code using MechanicalSoup’s page attribute
Locates the div element containing all the results
Extracts all the div elements containing individual results
Calls extract()
Locates and navigates to the next page

After the loop is complete, the function returns extracted_details.

Call the Functions

You can call the functions to navigate and extract after defining them, but you need to perform specific steps before that.

First, create an object of the StatefulBrowser class of MechanicalSoup. This object allows you to maintain a persistent session, handle cookies, and follow redirects.

browser = mechanicalsoup.StatefulBrowser(
        soup_config = {'features':'lxml'}, # use lxml
)

In the above code, the soup_config argument accepts configurations for BeautifulSoup; here, it tells BeautifulSoup to use lxml for parsing.

You can also update the headers of the object using the session.headers.update() method.

#define headers
headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,'
                    '*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
          'accept-language': 'en-GB;q=0.9,en-US;q=0.8,en;q=0.7',
          'dpr': '1',
          'sec-fetch-dest': 'document',
          'sec-fetch-mode': 'navigate',
          'sec-fetch-site': 'none',
          'sec-fetch-user': '?1',
          'upgrade-insecure-requests': '1',
          'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                        'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}

#update headers
browser.session.headers.update(headers)

Next, store the search terms in an array.

search_terms = ["masks","laptops","mobiles","cycles","dumbbells","ropes"]

Set the number of pages you wish to scrape for each term.

pages = 5

Now, you only have to loop through each search term to extract details. But before starting a loop, define an empty dict to store extracted details related to all the search terms.

all_results = {}

And in the loop:

1. Visit “Google.com” using the .open() method of MechanicalSoup.

browser.open(‘https://www.google.com’)

2. Select the form that allows you to input the search term. MechanicalSoup has a select_form() method for that, which creates a dict from all the form inputs.

browser.select_form(‘form[action=”/search”]’)

3. Enter the search term by using the name of the input element as the key.

browser[‘q’] = term

4. Submit the selected form using the submit_selected() method.

browser.submit_selected()

5. Check if the status code is 429 (Too many requests) and exit the program if it is. If it’s not, the code will move on to the next step.

if response.status_code == 429:
    exit(“Too Many Requests”)

6. Call paginate() and store the details in the empty dict defined outside the loop with the term as the key.

all_results[term] = paginate()

Finally, write the extracted results to a JSON file.

with open("googleSearchResults.json",'w',encoding='utf-8') as f:
        json.dump(all_results,f,indent=4,ensure_ascii=False)

Here’s the complete code:

import mechanicalsoup, json

def extract(results, extracted_details):

    for result in results:
        if result.h3:
            try:
                title = result.h3.text
                description = result.find('div',{'data-snf':'nke7rc'}).text
                url = result.a['href']
            except Exception as e:
                print("extract error: ",e)
                continue

            extracted_details.append(
                {
                    'Title':title,
                    'Description':description,
                    'Url':url.replace('/url?q=','')
                }
            )

def paginate():
    extracted_details = []

    for page in range(pages):
        soup = browser.page
        result_area = soup.find('div',{'id':'main'})


        try:
            results = result_area.find_all('div',{'class':'MjjYud'})

        except Exception as e:
            print("paginate error: ",e)
            continue

        extract(results,extracted_details)

        next_page = soup.find('a',{'id':'pnnext'})
        browser.follow_link(next_page)
    return extracted_details

if __name__ == "__main__":
    headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,'
                    '*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
          'accept-language': 'en-GB;q=0.9,en-US;q=0.8,en;q=0.7',
          'dpr': '1',
          'sec-fetch-dest': 'document',
          'sec-fetch-mode': 'navigate',
          'sec-fetch-site': 'none',
          'sec-fetch-user': '?1',
          'upgrade-insecure-requests': '1',
          'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                        'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}

    browser = mechanicalsoup.StatefulBrowser(
        soup_config={'features':'lxml'},
        )

    browser.session.headers.update(headers)
    pages = 2
    search_terms = ["masks","laptops","mobiles","cycles","dumbbells","ropes"]
    all_results ={}
    for term in search_terms[:2]:

        browser.open('https://www.google.com')
        browser.select_form('form[action="/search"]')

        browser['q'] = term
        response = browser.submit_selected()

        if response.status_code == 429:
            exit("Too Many Requests")

        all_results[term] = paginate()

        print(term,"extracted")

    with open("googleSearchResults.json",'w',encoding='utf-8') as f:
        json.dump(all_results,f,indent=4,ensure_ascii=False)

MechanicalSoup Limitations

MechanicalSoup is an excellent library to use for Python web scraping in place of requests and BeautifulSoup. However, consider its limitations:

MechanicalSoup may add an extra layer, making it slower than directly using Python requests and BeautifulSoup
It can’t manage form inputs if the forms are generated using a JavaScript
MechanicalSoup is incapable of performing advanced browser interactions, including Scrolling.

How Can a Web Scraping Service Help?

Web scraping with MechanicalSoup allows you to replace two libraries—requests and BeautifulSoup—with one. It also enables you to handle forms and links conveniently, as shown in this tutorial.

However, the code shown is only suitable for small-scale scraping. For large-scale, it’s better to get help from professional web scraping services.

A web scraping service, like ScrapeHero, can take care of all the technicalities, including choosing the libraries. You only need to give your data requirements. ScrapeHero is a fully managed web scraping service provider capable of building enterprise-grade web scrapers and crawlers.

FAQ

What is the difference between MechanicalSoup and Selenium?

MechanicalSoup is built on top of requests and BeautifulSoup, which allows you to perform static scraping more conveniently, but Selenium is a full-fledged browser automation library for complex browser interaction and executing JavaScript.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data

Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help