Ditch Multiple Libraries by Web Scraping with MechanicalSoup

Share:

Web scraping with mechanicalsoup

Requests and BeautifulSoup are excellent for static web scraping, but you need to handle two libraries, which can be avoided by using MechanicalSoup. Web scraping with MechanicalSoup allows you to use a single library to both fetch and parse HTML code.

The article shows you how to use MechanicalSoup for web scraping

Web Scraping with MechanicalSoup: Understand the Elements

Understanding the elements you wish to scrape is essential to know whether or not you can use MechanicalSoup for web scraping. The library can not handle JavaScript. Therefore, it can not extract elements that are only visible after executing JavaScript.

This tutorial shows how to scrape Google using MechanicalSoup by extracting three details from the search results:

  1. Title
  2. Description
  3. URL

All these details are available without JavaScript execution. 

Web Scraping with MechanicalSoup: Set Up the Environment

MechanicalSoup is among the external libraries for web scraping, so you must install it using pip.

pip install mechanicalsoup

You also need one internal module, json, which allows you to save the extracted data to a JSON file. However, you don’t have to install it as it comes with the Python standard library.

Web Scraping with MechanicalSoup: Write the Code

Start the code by importing the packages mentioned above:

import mechanicalsoup, json

Now, you can begin writing the code that:

  1. Searches a term on Google
  2. Navigates a specified number of pages
  3. Extracts the title, description, and URL of each result from each page
  4. Saves the extracted data to a JSON file

To keep the code clean, create a function to navigate and a function to extract details.

Define extract() to Extract Details From a Page

The function to extract details will accept a list containing results on a page and a list to store the details extracted from the results.

def extract(results, extracted_details):
    result
    for result in results:
            try:
                title = result.h3.text
                description = result.find('div',{'class':'BNeawe s3v9rd AP7Wnd'}).text
                url = result.a['href']
            except:
                continue
            
            result_list.append(
                {
                    'Title':title,
                    'Description':description,
                    'Url':url.replace('/url?q=','')
                }
            )

This code snippet iterates through a list containing results, and in each iteration:

  1. Tries to extract
    1. Title from an h3 tag
    2. Description from a div tag
    3. URL from an anchor tag
  2. Appends the extracted details to extracted_details

Define paginate() to Navigate the Pages

To navigate pages, create a function that runs a loop and follows the link to the next page until the loop count becomes equal to the specified number of pages to extract.

def paginate():
    result_list = []
    for page in range(pages):
        soup = browser.page
        resultArea = soup.find('div',{'id':'main'})
        try:
            results = resultArea.find_all('div',{'class':'Gx5Zad xpd EtOod pkphOe'})
        
        except:
            continue
        extract(results, result_list)
        next_page = soup.find('a',{'aria-label':'Next page'})
        browser.follow_link(next_page)
    return result_list

This code snippet defines an empty list to hold the details extracted from the results related to one term and uses a loop to paginate. In each iteration, the code:

  1. Gets the parsed HTML code using MechanicalSoup’s page attribute
  2. Locates the div element containing all the results
  3. Extracts all the div elements containing individual results
  4. Calls extract()
  5. Locates and navigates to the next page

After the loop is complete, the function returns extracted_details.

Call the Functions

You can call the functions to navigate and extract after defining them, but you need to perform specific steps before that. 

First, create an object of the StatefulBrowser class of MechanicalSoup. This object allows you to maintain a persistent session, handle cookies, and follow redirects.

browser = mechanicalsoup.StatefulBrowser(
        soup_config = {'features':'lxml'}, # use lxml
)

In the above code, the soup_config argument accepts configurations for BeautifulSoup; here, it tells BeautifulSoup to use lxml for parsing.

You can also update the headers of the object using the session.headers.update() method.

#define headers
headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,'
                    '*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
          'accept-language': 'en-GB;q=0.9,en-US;q=0.8,en;q=0.7',
          'dpr': '1',
          'sec-fetch-dest': 'document',
          'sec-fetch-mode': 'navigate',
          'sec-fetch-site': 'none',
          'sec-fetch-user': '?1',
          'upgrade-insecure-requests': '1',
          'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                        'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}

#update headers
browser.session.headers.update(headers)

Next, store the search terms in an array.

search_terms = ["masks","laptops","mobiles","cycles","dumbbells","ropes"]

Set the number of pages you wish to scrape for each term. 

pages = 5

Now, you only have to loop through each search term to extract details. But before starting a loop, define an empty dict to store extracted details related to all the search terms.

all_results = {}

And in the loop:

1. Visit “Google.com” using the .open() method of MechanicalSoup.

browser.open(‘https://www.google.com’)

2. Select the form that allows you to input the search term. MechanicalSoup has a select_form() method for that, which creates a dict from all the form inputs.

browser.select_form(‘form[action=”/search”]’)

3. Enter the search term by using the name of the input element as the key.

browser[‘q’] = term

4. Submit the selected form using the submit_selected() method.

browser.submit_selected()

5. Check if the status code is 429 (Too many requests) and exit the program if it is. If it’s not, the code will move on to the next step.

if response.status_code == 429:
    exit(“Too Many Requests”)

6. Call paginate() and store the details in the empty dict defined outside the loop with the term as the key.

all_results[term] = paginate()

Finally, write the extracted results to a JSON file. 

with open("googleSearchResults.json",'w',encoding='utf-8') as f:
        json.dump(all_results,f,indent=4,ensure_ascii=False)

Here’s the complete code:

import mechanicalsoup, json

def extract(results, extracted_details):

    for result in results:
        if result.h3:
            try:
                title = result.h3.text
                description = result.find('div',{'data-snf':'nke7rc'}).text
                url = result.a['href']
            except Exception as e:
                print("extract error: ",e)
                continue

            extracted_details.append(
                {
                    'Title':title,
                    'Description':description,
                    'Url':url.replace('/url?q=','')
                }
            )

def paginate():
    extracted_details = []

    for page in range(pages):
        soup = browser.page
        result_area = soup.find('div',{'id':'main'})


        try:
            results = result_area.find_all('div',{'class':'MjjYud'})

        except Exception as e:
            print("paginate error: ",e)
            continue

        extract(results,extracted_details)

        next_page = soup.find('a',{'id':'pnnext'})
        browser.follow_link(next_page)
    return extracted_details

if __name__ == "__main__":
    headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,'
                    '*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
          'accept-language': 'en-GB;q=0.9,en-US;q=0.8,en;q=0.7',
          'dpr': '1',
          'sec-fetch-dest': 'document',
          'sec-fetch-mode': 'navigate',
          'sec-fetch-site': 'none',
          'sec-fetch-user': '?1',
          'upgrade-insecure-requests': '1',
          'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                        'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}

    browser = mechanicalsoup.StatefulBrowser(
        soup_config={'features':'lxml'},
        )

    browser.session.headers.update(headers)
    pages = 2
    search_terms = ["masks","laptops","mobiles","cycles","dumbbells","ropes"]
    all_results ={}
    for term in search_terms[:2]:

        browser.open('https://www.google.com')
        browser.select_form('form[action="/search"]')

        browser['q'] = term
        response = browser.submit_selected()

        if response.status_code == 429:
            exit("Too Many Requests")

        all_results[term] = paginate()

        print(term,"extracted")

    with open("googleSearchResults.json",'w',encoding='utf-8') as f:
        json.dump(all_results,f,indent=4,ensure_ascii=False)

MechanicalSoup Limitations

MechanicalSoup is an excellent library to use for Python web scraping in place of requests and BeautifulSoup. However, consider its limitations:

  • MechanicalSoup may add an extra layer, making it slower than directly using Python requests and BeautifulSoup
  • It can’t manage form inputs if the forms are generated using a JavaScript
  • MechanicalSoup is incapable of performing advanced browser interactions, including Scrolling.

How Can a Web Scraping Service Help?

Web scraping with MechanicalSoup allows you to replace two libraries—requests and BeautifulSoup—with one. It also enables you to handle forms and links conveniently, as shown in this tutorial.

However, the code shown is only suitable for small-scale scraping.  For large-scale, it’s better to get help from professional web scraping services.

A web scraping service, like ScrapeHero, can take care of all the technicalities, including choosing the libraries. You only need to give your data requirements. ScrapeHero is a fully managed web scraping service provider capable of building enterprise-grade web scrapers and crawlers. 

FAQ

What is the difference between MechanicalSoup and Selenium?

MechanicalSoup is built on top of requests and BeautifulSoup, which allows you to perform static scraping more conveniently, but Selenium is a full-fledged browser automation library for complex browser interaction and executing JavaScript.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Table of content

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

Transform and map scraped data

How to Transform and Map Scraped Data with Python Libraries

Learn how you can transform and map data using Python.
Using NLP to clean and structure scraped data

How to Use NLP to Clean and Structure Scraped Data

Learn how to use NLP to clean and structure scraped data.
Search engine web crawling

From Crawling to Ranking! This is How Search Engines Use Web Crawling to Index Websites!

Search engine crawling indexes web pages, making it essential for ranking and visibility in search results.
ScrapeHero Logo

Can we help you get some data?