web scraping

6 min read

How to Easily Scrape Multimedia Content

Matthew
Published: January 20, 2025

How to Scrape Multimedia Content: The Environment
How to Scrape Multimedia Content: The Code
Code Limitations
Wrapping Up

You can scrape multimedia content from websites similar to scraping text, with one key difference: you need to handle multimedia—image, audio, or video—as binary.

Confused how to get started? Read on.

How to Scrape Multimedia Content: The Environment

Depending on the website, you might need to switch between request-based methods and browser automation libraries.

For static sites—those that don’t use JavaScript to generate HTML—you can use Python’s requests library to fetch the HTML code. However, to scrape a dynamic website, opt for browser automation libraries like Selenium.

To parse the web page and extract content, BeautifulSoup, along with a parser like lxml, is a great choice.

You can install all these packages using the Python package manager, pip.

How to Scrape Multimedia Content: The Code

The complexity of the code depends on how you want to extract multimedia.

From a Multimedia URL
The code is relatively straightforward if you have the URL of the required multimedia content.

First, import requests and make an HTTP request to the URL.

import requests
response = requests.get(url)

Save the response content in binary mode.

# extract the file name
name = url.split(‘/’)[-1]
# save the file
with open(name,’wb’) as f:
    f.write(response.content)

If the multimedia content is large, it’s better to save it in chunks to avoid exhausting your RAM.

size = 1024 * 1024
with open(name,’wb’) as f:
    for chunk in response.iter_content(chunk_size = size):
        If chunk:
            f.write(chunk)
        else:
            print(f”Failed: {response.status_code}”)

From a Web Page

To gather all multimedia content from a specific page, create a function that searches for all tag names that typically hold multimedia and retrieves the corresponding URLs. This function—which we’ll call extract()—will accept a BeautifulSoup object and extract multimedia from the page.

It starts by iterating through a list of tag names that usually contain multimedia on a web page.

In each iteration of the loop, extract() finds all the elements with the current tag name.

for name in tag_names:
        tags = soup.find_all(name)

Then, the code iterates through all the tags and:

1. Gets the URL from the tag’s src attribute.

url = urljoin(site_url,tag['src']) if tag['src'] else None

2. Makes an HTTP request to the URL.

file = requests.get(url)

3. Saves the response to a file in binary mode.

name = url.split('/')[-1].split('?')[0]
                with open(f"{domain}/{name}",'wb') as f:
                    f.write(file.content)
                print('File downloaded: ',name)

You can now call extract(), but first, you’ll need to get the URL of the target web page from which you want to scrape multimedia. Use the argparse package to obtain this URL from the user:
1. Create an ArgumentParser() object.

argparser = argparse.ArgumentParser()

2. Add an argument for the URL.

argparser.add_argument('-u','--url',help='Add the URL of the target web page',required=True)

To determine whether to use Selenium or requests, add an optional boolean argument; if specified during script execution, this will trigger Selenium.

argparser.add_argument('-d','--dynamic',action='store_true',help='Using dynamic mode')

Parse both arguments

args = argparser.parse_args()

Store the target URL in a variable called site_url

site_url = args.url

Store the boolean value in the variable dynamic

mode = args.dynamic

Define headers that may be necessary to hide your scraper from any anti-scraping measures employed by the target website.

headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,'
                    '*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
          'accept-language': 'en-GB;q=0.9,en-US;q=0.8,en;q=0.7',
          'dpr': '1',
          'sec-fetch-dest': 'document',
          'sec-fetch-mode': 'navigate',
          'sec-fetch-site': 'none',
          'sec-fetch-user': '?1',
          'upgrade-insecure-requests': '1',
          'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                        'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}

Next, check if the variable dynamic is true or false. If it is false, use the Python requests to make an HTTP request to the URL extracted from the arguments.

if mode == False:
        response = requests.get(site_url,headers=headers).text

If it is true:

Launch the Selenium browser.
Navigate to the URL stored in site_url.
Retrieve the HTML source code using the page_source attribute.

else:
        browser = webdriver.Chrome()
        browser.get(site_url)
        time.sleep(3)
        response = browser.page_source

Want to learn more about web scraping with Selenium? Check this informative article on Selenium web scraping.

Now, parse the retrieved HTML source code using BeauitulSoup and lxml, creating a BeautifulSoup object. The function extract() will extract the multimedia URLs from this object.

soup = BeautifulSoup(response,'lxml')

Define an array containing the tag names; extract() will iterate through this array while searching for multimedia URLs.

tag_names = [‘img’,’source’]

The downloaded media files can clutter the directory containing the script. Therefore, create a separate directory for each target webpage:

1. Derive a name from the target website’s URL.

domain = site_url.split('/')[-1].split('?')[0]

2. Use os.makedirs() to create this folder.

os.makedirs(domain,exist_ok=True)

Finally, call extract().

Using a CSV File

So far, this code allows you to perform multimedia scraping from a single webpage. However, you can also scrape multimedia content from multiple web pages with a single script by getting their URLs from a CSV file.

To do this, create a CSV file containing URLs and a boolean value indicating whether or not Selenium should be used.

Now, you can move your current scraping logic to a function, scrape_multimedia(), which you can call in a loop for each URL extracted from the CSV file.

This function:

Accepts a URL and a boolean value.
Uses requests to fetch the HTML page if the value is false; otherwise, Selenium.
Parses the HTML page
Creates a directory

Calls extract() with the tag names, BeautifulSoup object, URL, and directory name as arguments

def scrape_multimedia(site_url,mode=False):

        if mode == False:
            response = requests.get(site_url,headers=headers).text
        else:
            browser = webdriver.Chrome()
            browser.get(site_url)
            time.sleep(3)
            response = browser.page_source
        soup = BeautifulSoup(response,'lxml')
        tag_names = ['img','source']
        domain = site_url.split('/')[-1].split('?')[0]
        os.makedirs(domain,exist_ok=True)
        extract(tag_names,soup,site_url,domain)

You can then read your previously created CSV file and iterate through its rows. In each iteration:

Extract the URL and boolean value
Call scrape_multimedia() with these extracted values as arguments

urls = pandas.read_csv('urls.csv')

    for url in urls.iterrows():
        site_url = url[1][0]
        mode = url[1][1]
        scrape_multimedia(site_url,mode)

Code Limitations

Although this code is suitable for web scraping multimedia files at a small scale, it has some limitations:

No techniques to bypass anti-scraping measures: Websites may block your scraper. This code lacks advanced techniques like proxy rotation or CAPTCHA-solving to overcome the block, making it unsuitable for large-scale web scraping.
Cannot scrape streaming videos: Streaming videos operate differently than static videos on websites. This code doesn’t capture them.

Does not check the content type before downloading: The code assumes that source URLs contain valid extensions; URLs without extensions will download unidentified files.

Might Download Duplicate Files: Videos might have multiple source URLs for different formats. This code will download all of them, leading to duplicates.

Wrapping Up

Python libraries like requests, Selenium, and BeautifulSoup allow you to scrape multimedia content. Simply, find your target multimedia URL, make an HTTP request to it, and save the result by writing in binary mode.

However, you need to take care of the technicalities yourself. Moreover, you need to be more careful about compliance when scraping multimedia content.

A web scraping service like ScrapeHero will take care of all these hassles for you.

ScrapeHero offers a fully managed web scraping service capable of building high-quality scrapers and crawlers. We can ensure compliance and take care of all the technical aspects of web scraping.