You can scrape multimedia content from websites similar to scraping text, with one key difference: you need to handle multimedia—image, audio, or video—as binary.
Confused how to get started? Read on.
How to Scrape Multimedia Content: The Environment
Depending on the website, you might need to switch between request-based methods and browser automation libraries.
For static sites—those that don’t use JavaScript to generate HTML—you can use Python’s requests library to fetch the HTML code. However, to scrape a dynamic website, opt for browser automation libraries like Selenium.
To parse the web page and extract content, BeautifulSoup, along with a parser like lxml, is a great choice.
You can install all these packages using the Python package manager, pip.
How to Scrape Multimedia Content: The Code
The complexity of the code depends on how you want to extract multimedia.
From a Multimedia URL
The code is relatively straightforward if you have the URL of the required multimedia content.
First, import requests and make an HTTP request to the URL.
import requests
response = requests.get(url)
Save the response content in binary mode.
# extract the file name
name = url.split(‘/’)[-1]
# save the file
with open(name,’wb’) as f:
f.write(response.content)
If the multimedia content is large, it’s better to save it in chunks to avoid exhausting your RAM.
size = 1024 * 1024
with open(name,’wb’) as f:
for chunk in response.iter_content(chunk_size = size):
If chunk:
f.write(chunk)
else:
print(f”Failed: {response.status_code}”)
From a Web Page
To gather all multimedia content from a specific page, create a function that searches for all tag names that typically hold multimedia and retrieves the corresponding URLs. This function—which we’ll call extract()—will accept a BeautifulSoup object and extract multimedia from the page.
It starts by iterating through a list of tag names that usually contain multimedia on a web page.
In each iteration of the loop, extract() finds all the elements with the current tag name.
for name in tag_names:
tags = soup.find_all(name)
Then, the code iterates through all the tags and:
1. Gets the URL from the tag’s src attribute.
url = urljoin(site_url,tag['src']) if tag['src'] else None
2. Makes an HTTP request to the URL.
file = requests.get(url)
3. Saves the response to a file in binary mode.
name = url.split('/')[-1].split('?')[0]
with open(f"{domain}/{name}",'wb') as f:
f.write(file.content)
print('File downloaded: ',name)
You can now call extract(), but first, you’ll need to get the URL of the target web page from which you want to scrape multimedia. Use the argparse package to obtain this URL from the user:
1. Create an ArgumentParser() object.
argparser = argparse.ArgumentParser()
2. Add an argument for the URL.
argparser.add_argument('-u','--url',help='Add the URL of the target web page',required=True)
To determine whether to use Selenium or requests, add an optional boolean argument; if specified during script execution, this will trigger Selenium.
argparser.add_argument('-d','--dynamic',action='store_true',help='Using dynamic mode')
Parse both arguments
args = argparser.parse_args()
- Store the target URL in a variable called site_url
site_url = args.url
- Store the boolean value in the variable dynamic
mode = args.dynamic
Define headers that may be necessary to hide your scraper from any anti-scraping measures employed by the target website.
headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,'
'*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'en-GB;q=0.9,en-US;q=0.8,en;q=0.7',
'dpr': '1',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}
Next, check if the variable dynamic is true or false. If it is false, use the Python requests to make an HTTP request to the URL extracted from the arguments.
if mode == False:
response = requests.get(site_url,headers=headers).text
If it is true:
- Launch the Selenium browser.
- Navigate to the URL stored in site_url.
- Retrieve the HTML source code using the page_source attribute.
else:
browser = webdriver.Chrome()
browser.get(site_url)
time.sleep(3)
response = browser.page_source
Now, parse the retrieved HTML source code using BeauitulSoup and lxml, creating a BeautifulSoup object. The function extract() will extract the multimedia URLs from this object.
soup = BeautifulSoup(response,'lxml')
Define an array containing the tag names; extract() will iterate through this array while searching for multimedia URLs.
tag_names = [‘img’,’source’]
The downloaded media files can clutter the directory containing the script. Therefore, create a separate directory for each target webpage:
1. Derive a name from the target website’s URL.
domain = site_url.split('/')[-1].split('?')[0]
2. Use os.makedirs() to create this folder.
os.makedirs(domain,exist_ok=True)
Finally, call extract().
Using a CSV File
So far, this code allows you to perform multimedia scraping from a single webpage. However, you can also scrape multimedia content from multiple web pages with a single script by getting their URLs from a CSV file.
To do this, create a CSV file containing URLs and a boolean value indicating whether or not Selenium should be used.
Now, you can move your current scraping logic to a function, scrape_multimedia(), which you can call in a loop for each URL extracted from the CSV file.
This function:
- Accepts a URL and a boolean value.
- Uses requests to fetch the HTML page if the value is false; otherwise, Selenium.
- Parses the HTML page
- Creates a directory
Calls extract() with the tag names, BeautifulSoup object, URL, and directory name as arguments
def scrape_multimedia(site_url,mode=False):
if mode == False:
response = requests.get(site_url,headers=headers).text
else:
browser = webdriver.Chrome()
browser.get(site_url)
time.sleep(3)
response = browser.page_source
soup = BeautifulSoup(response,'lxml')
tag_names = ['img','source']
domain = site_url.split('/')[-1].split('?')[0]
os.makedirs(domain,exist_ok=True)
extract(tag_names,soup,site_url,domain)
You can then read your previously created CSV file and iterate through its rows. In each iteration:
- Extract the URL and boolean value
- Call scrape_multimedia() with these extracted values as arguments
urls = pandas.read_csv('urls.csv')
for url in urls.iterrows():
site_url = url[1][0]
mode = url[1][1]
scrape_multimedia(site_url,mode)
Code Limitations
Although this code is suitable for web scraping multimedia files at a small scale, it has some limitations:
- No techniques to bypass anti-scraping measures: Websites may block your scraper. This code lacks advanced techniques like proxy rotation or CAPTCHA-solving to overcome the block, making it unsuitable for large-scale web scraping.
- Cannot scrape streaming videos: Streaming videos operate differently than static videos on websites. This code doesn’t capture them.
- Does not check the content type before downloading: The code assumes that source URLs contain valid extensions; URLs without extensions will download unidentified files.
- Might Download Duplicate Files: Videos might have multiple source URLs for different formats. This code will download all of them, leading to duplicates.
Wrapping Up
Python libraries like requests, Selenium, and BeautifulSoup allow you to scrape multimedia content. Simply, find your target multimedia URL, make an HTTP request to it, and save the result by writing in binary mode.
However, you need to take care of the technicalities yourself. Moreover, you need to be more careful about compliance when scraping multimedia content.
A web scraping service like ScrapeHero will take care of all these hassles for you.
ScrapeHero offers a fully managed web scraping service capable of building high-quality scrapers and crawlers. We can ensure compliance and take care of all the technical aspects of web scraping.