How to Easily Scrape Multimedia Content

Share:

Scrape multimedia content

You can scrape multimedia content from websites similar to scraping text, with one key difference: you need to handle multimedia—image, audio, or video—as binary. 

Confused how to get started? Read on. 

How to Scrape Multimedia Content: The Environment

Depending on the website, you might need to switch between request-based methods and browser automation libraries. 

For static sites—those that don’t use JavaScript to generate HTML—you can use Python’s requests library to fetch the HTML code. However, to scrape a dynamic website, opt for browser automation libraries like Selenium.

To parse the web page and extract content, BeautifulSoup, along with a parser like lxml, is a great choice. 

You can install all these packages using the Python package manager, pip.

How to Scrape Multimedia Content: The Code

The complexity of the code depends on how you want to extract multimedia.

From a Multimedia URL
The code is relatively straightforward if you have the URL of the required multimedia content. 

First, import requests and make an HTTP request to the URL.

import requests
response = requests.get(url)

Save the response content in binary mode.

# extract the file name
name = url.split(‘/’)[-1]
# save the file
with open(name,’wb’) as f:
    f.write(response.content)

If the multimedia content is large, it’s better to save it in chunks to avoid exhausting your RAM. 

size = 1024 * 1024
with open(name,’wb’) as f:
    for chunk in response.iter_content(chunk_size = size):
        If chunk:
            f.write(chunk)
        else:
            print(f”Failed: {response.status_code}”)

From a Web Page

To gather all multimedia content from a specific page, create a function that searches for all tag names that typically hold multimedia and retrieves the corresponding URLs. This function—which we’ll call extract()—will accept a BeautifulSoup object and extract multimedia from the page. 

It starts by iterating through a list of tag names that usually contain multimedia on a web page.

In each iteration of the loop, extract() finds all the elements with the current tag name.

for name in tag_names:
        tags = soup.find_all(name)

Then, the code iterates through all the tags and: 

1. Gets the URL from the tag’s src attribute.

url = urljoin(site_url,tag['src']) if tag['src'] else None

2. Makes an HTTP request to the URL.

file = requests.get(url)

3. Saves the response to a file in binary mode.

name = url.split('/')[-1].split('?')[0]
                with open(f"{domain}/{name}",'wb') as f:
                    f.write(file.content)
                print('File downloaded: ',name)

You can now call extract(), but first, you’ll need to get the URL of the target web page from which you want to scrape multimedia. Use the argparse package to obtain this URL from the user:
1. Create an ArgumentParser() object.

argparser = argparse.ArgumentParser()

2. Add an argument for the URL.

argparser.add_argument('-u','--url',help='Add the URL of the target web page',required=True)

To determine whether to use Selenium or requests, add an optional boolean argument; if specified during script execution, this will trigger Selenium.

argparser.add_argument('-d','--dynamic',action='store_true',help='Using dynamic mode')

Parse both arguments

args = argparser.parse_args()
  • Store the target URL in a variable called site_url
site_url = args.url
  • Store the boolean value in the variable dynamic
mode = args.dynamic

Define headers that may be necessary to hide your scraper from any anti-scraping measures employed by the target website.

headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,'
                    '*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
          'accept-language': 'en-GB;q=0.9,en-US;q=0.8,en;q=0.7',
          'dpr': '1',
          'sec-fetch-dest': 'document',
          'sec-fetch-mode': 'navigate',
          'sec-fetch-site': 'none',
          'sec-fetch-user': '?1',
          'upgrade-insecure-requests': '1',
          'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                        'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}

Next, check if the variable dynamic is true or false. If it is false, use the Python requests to make an HTTP request to the URL extracted from the arguments.

if mode == False:
        response = requests.get(site_url,headers=headers).text

If it is true:

  1. Launch the Selenium browser.
  2. Navigate to the URL stored in site_url
  3. Retrieve the HTML source code using the page_source attribute.
else:
        browser = webdriver.Chrome()
        browser.get(site_url)
        time.sleep(3)
        response = browser.page_source

Want to learn more about web scraping with Selenium? Check this informative article on Selenium web scraping.

Now, parse the retrieved HTML source code using BeauitulSoup and lxml, creating a BeautifulSoup object. The function extract() will extract the multimedia URLs from this object.

soup = BeautifulSoup(response,'lxml')

Define an array containing the tag names; extract() will iterate through this array while searching for multimedia URLs.

tag_names = [‘img’,’source’]

The downloaded media files can clutter the directory containing the script. Therefore, create a separate directory for each target webpage:

1. Derive a name from the target website’s URL.

domain = site_url.split('/')[-1].split('?')[0]

2. Use os.makedirs() to create this folder.

os.makedirs(domain,exist_ok=True)

Finally, call extract(). 

Using a CSV File

So far, this code allows you to perform multimedia scraping from a single webpage. However, you can also scrape multimedia content from multiple web pages with a single script by getting their URLs from a CSV file. 

To do this, create a CSV file containing URLs and a boolean value indicating whether or not Selenium should be used. 

Now, you can move your current scraping logic to a function, scrape_multimedia(), which you can call in a loop for each URL extracted from the CSV file. 

This function:

  1. Accepts a URL and a boolean value. 
  2. Uses requests to fetch the HTML page if the value is false; otherwise, Selenium.
  3. Parses the HTML page
  4. Creates a directory

Calls extract() with the tag names, BeautifulSoup object, URL, and directory name as arguments

def scrape_multimedia(site_url,mode=False):

        if mode == False:
            response = requests.get(site_url,headers=headers).text
        else:
            browser = webdriver.Chrome()
            browser.get(site_url)
            time.sleep(3)
            response = browser.page_source
        soup = BeautifulSoup(response,'lxml')
        tag_names = ['img','source']
        domain = site_url.split('/')[-1].split('?')[0]
        os.makedirs(domain,exist_ok=True)
        extract(tag_names,soup,site_url,domain)

You can then read your previously created CSV file and iterate through its rows. In each iteration:

  1. Extract the URL and boolean value
  2. Call scrape_multimedia() with these extracted values as arguments
urls = pandas.read_csv('urls.csv')

    for url in urls.iterrows():
        site_url = url[1][0]
        mode = url[1][1]
        scrape_multimedia(site_url,mode)

Code Limitations

Although this code is suitable for web scraping multimedia files at a small scale, it has some limitations:

  • No techniques to bypass anti-scraping measures: Websites may block your scraper. This code lacks advanced techniques like proxy rotation or CAPTCHA-solving to overcome the block, making it unsuitable for large-scale web scraping.
  • Cannot scrape streaming videos: Streaming videos operate differently than static videos on websites. This code doesn’t capture them.
  • Does not check the content type before downloading: The code assumes that source URLs contain valid extensions; URLs without extensions will download unidentified files.
  • Might Download Duplicate Files: Videos might have multiple source URLs for different formats. This code will download all of them, leading to duplicates. 

Wrapping Up

Python libraries like requests, Selenium, and BeautifulSoup allow you to scrape multimedia content. Simply, find your target multimedia URL, make an HTTP request to it, and save the result by writing in binary mode.

However, you need to take care of the technicalities yourself. Moreover, you need to be more careful about compliance when scraping multimedia content.  

A web scraping service like ScrapeHero will take care of all these hassles for you.

ScrapeHero offers a fully managed web scraping service capable of building high-quality scrapers and crawlers. We can ensure compliance and take care of all the technical aspects of web scraping.

Table of content

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

Transform and map scraped data

How to Transform and Map Scraped Data with Python Libraries

Learn how you can transform and map data using Python.
Using NLP to clean and structure scraped data

How to Use NLP to Clean and Structure Scraped Data

Learn how to use NLP to clean and structure scraped data.
Search engine web crawling

From Crawling to Ranking! This is How Search Engines Use Web Crawling to Index Websites!

Search engine crawling indexes web pages, making it essential for ranking and visibility in search results.
ScrapeHero Logo

Can we help you get some data?