Web Scraping Netflix: How to do it?

Share:

Web Scraping Netflix

Netflix is a dynamic website, with most of its dynamic content only appearing after you log in. However, web scraping Netflix focuses on public pages that are not dynamic. This makes it possible to scrape data from Netflix using HTTP requests; this tutorial will guide you through the process.

The tutorial shows you how to scrape Netflix data with Python using two libraries: Python requests and BeautifulSoup.

Note: This article covers scraping movie details, not the movies themselves. Extracting movies from Netflix without permission is illegal as it violates copyright laws. 

Data Scraped from Netflix

This tutorial will help you extract details of movies and series. Initially, it extracts the links to movies and series from these URLs:

You can use F12 to inspect the webpage to identify the tags containing the links to movies/series. 

Netflix page showing links on the inspect panel

Once the code extracts the URLs, it extracts specific details from each one. Here is what it collects:

  1. Name
  2. Genre
  3. Synopsis
  4. Related Genres
  5. Audio
  6. Subtitles
  7. Release Date
  8. Rating
  9. Mood
  10. Duration/Seasons
  11. Cast
  12. URL

Movie details scraped from Netflix

Web Scraping Netflix: The Environment

This code requires two external packages, which you need to install:

Python requests fetches the HTML source code of the target web page. Meanwhile, BeautifulSoup helps you parse and extract the required data. 

You can install these packages using the Python package manager pip.

pip install bs4 requests

Additionally, the code also uses the json module to save the extracted data to a JSON file. 

Want to learn more about BeautifulSoup? Check this tutorial: Web Scraping Using BeautifulSoup.

Web Scraping Netflix: The Code

Flow chart showing code flow

Start the code for web scraping Netflix by importing the necessary packages. 

import requests
import json


from bs4 import BeautifulSoup

The code can scrape both series and movie details, but it doesn’t do both at the same time. Users will see a prompt asking whether they want to scrape series or movies. 

The input function allows you to display such a prompt.

searchType = input("Enter \n1) for series\n2) for movies")

Based on the user’s choice, the function picks the correct URL from the dict variable. 

selector = {
        "1": series_url,
        "2": movies_url
    }
try:
        url = selectvor[searchType]
    except:
        print("Wrong entry. Enter either 1 or 2.")
        getType()

The try-except block prevents errors if you enter an invalid character. In that case, the function notifies you and calls itself again, making getType() a recursive function. 

Once you have the correct URL, you can start extracting the data. 

Create a function getDetails() that accepts a URL as input and scrapes the details of the movies/series available on that page. Begin by sending an HTTP request to the URL using Python requests.

response = requests.get(url)

Next, parse the response, but only if the status code is 200. 

if response.status_code == 200:
    soup = BeautifulSoup(response.text,'lxml')
else:
    return {"Error":"Server blocked your scraper or the URL is not correct."}

Netflix organizes movies in various sections, so it’s easier to extract each section separately. 

You can select the sections using the find_all method of BeautifulSoup.

sections = soup.find_all('section',{'class':'nm-collections-row'})

Loop through the sections and extract the details. Before the loop, define an empty dict all_sections to store the extracted details. 

all_sections = {}

While looping, use try-except blocks to avoid errors. 

Start by extracting the section name. This is a bit tricky as some sections use the span tag while others use the h2 tag. 

So, use an extra try-except block to handle both cases. 

try:
    section_name = section.find('span',{'class':'nm-collections-row-name'}).text    
except:
    section_name = section.find('h2',{'class':'nm-collections-row-name'}).text

Next, find the URLs that lead to the movies/series pages. 

movie_urls = [item.a['href'] for item in section.ul.find_all('li')]

Now, loop through the URLs to extract movie details. Store the extracted information in an empty dict all_details. 

Inside the loop, the function will:

  1. Send an HTTP request to each URL using Python requests
    movie_response = requests.get(url)
  2. Parse the response using BeautifulSoup
    movie_soup = BeautifulSoup(movie_response.text,'lxml')
  3. Extract the required data using the find or find_all method of BeautifulSoup:
    1. Audio from a span tag with the class ‘item-audio’
      audio = movie_soup.find('span',{'class':'item-audio'})
    2. Subtitles from span tags with the class ‘item-subtitle’
      subtitles = movie_soup.find_all('span',{'class':'item-subtitle'})
    3. Synopsis from a div tag with the class ‘title-info-synopsis’
      synopsis = movie_soup.find('div',{'class':'title-info-synopsis'})
    4. Date from a span tag with the attribute data-uia ‘item-year’
      date = movie_soup.find('span',{'data-uia':'item-year'})
    5. Rating from a span tag with the class ‘maturity-number’
      rating = movie_soup.find('span',{'class':'maturity-number'})
    6. Moods from span tags with the class ‘item-mood-tag’
      moods = movie_soup.find_all('span',{'class':'item-mood-tag'})
    7. Duration/Seasons from a span tag with the class ‘duration’
      duration = movie_soup.find('span',{'class':'duration'})
    8. Cast from span tags with the class ‘item-cast’
      cast = movie_soup.find_all('span',{'class':'item-cast'})
    9. Genre from a span tag with the class ‘item-genre’
      genre = movie_soup.find('a',{'data-uia':'item-genre'})
    10. Related genres from span tags with the class ‘item-genres’
      related_genres = movie_soup.find_all('a',{'class':'item-genres'})
  4. Add the extracted details to the dict all details with the movie name as the key.
all_details[movie_soup.h1.text] = {      
   
        "Genre":genre.text if genre else "Unknown",
        "Synopsis":synopsis.text if synopsis else "Unknown",
        "Related Genres": [sub_genre.text for sub_genre in related_genres] if related_genres else "Unknown",
        "Audio":audio.text if audio else "Unknown",
        "Subtitles":[subtitle.text for subtitle in subtitles] if subtitles else "Unknown",                        
        "Release Date":date.text if date else "Unknown",
        "Rating":rating.text if rating else "Unknown",
        "Mood":[mood.text for mood in moods] if moods else "Unknown",
        "Duration/Seasons":duration.text if duration else "Unknown",
        "Cast":[item.text for item in cast] if cast else "Unknown",
        "URL":url
    }

And outside the loop, add the extracted details of all the movies/series in a section to the dict all_sections. Use the section name as the key. 

all_sections[section_name] = all_details

Finally, this function returns the dict all_sections. 

return all_sections

After extracting data, it’s time to save it. The saveData() function does this for you. It accepts the extracted data as the argument and uses the json.dump() method to save it to a JSON file. 

def saveData(data):
    with open("netflix.json","w",encoding="utf-8") as f:
        json.dump(data,f,indent=4,ensure_ascii=False)

Now, put it all together:

  1. Define variables holding the URLs for listing pages of movies and series.
    series_url = "https://www.netflix.com/in/browse/genre/83"
    movies_url = "https://www.netflix.com/in/browse/genre/34399"
  2. Call getType() to get the URL based on the user’s input
    url = getType()
  3. Call getDetails() with the URL as the argument to retrieve movie details.
    movieDetails = getDetails(url)
  4. Call saveData() to save the extracted data.
    saveData(movieDetails)

Code Limitations

You can use this code to make your own Netflix scraper. But there are some limitations to keep in mind:

  • HTML Structure Changes: The code relies on Netflix.com’s current HTML structure to find data. If Netflix changes the structure, the scraper might break. You’ll then need to update the code to adapt to the changes.
  • Anti-Scraping Measures: This code isn’t built to scrape without getting blocked. It lacks techniques to bypass anti-scraping measures, which could be a problem if you send too many requests. Netflix may block your IP; this also makes it unsuitable for large-scale web scraping.

Wrapping Up: Why Use a Web Scraping Service?

You saw how to scrape Netflix using Python requests and BeautifulSoup. These tools allow you to fetch HTML source code and extract details of movies and series. 

However, the limitations mentioned, like the need to handle anti-scraping measures and continuous monitoring, can make web scraping Netflix challenging. So, if you only need data, why not avoid the hassle by using a web scraping service like ScrapeHero? 

ScrapeHero is a full-service web scraping service provider. we can build enterprise-grade web scrapers according to your specifications. This will let you focus on using the data rather than worrying about technical details. 

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Table of content

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

Transform and map scraped data

How to Transform and Map Scraped Data with Python Libraries

Learn how you can transform and map data using Python.
Using NLP to clean and structure scraped data

How to Use NLP to Clean and Structure Scraped Data

Learn how to use NLP to clean and structure scraped data.
Search engine web crawling

From Crawling to Ranking! This is How Search Engines Use Web Crawling to Index Websites!

Search engine crawling indexes web pages, making it essential for ranking and visibility in search results.
ScrapeHero Logo

Can we help you get some data?