Static scraping is pretty simple; you send an HTTP request and parse the response. However, it’s a different story when you want to scrape JavaScript-rich websites. Why? Because the HTTP response may lack the data you want.
Websites that use a lot of JavaScript typically generate HTML elements by executing that JavaScript. Since requests-based methods don’t run JavaScript, you need alternative strategies to extract data.
This guide will outline three main approaches for scraping content from JavaScript-heavy websites.
Scrape JavaScript-rich Websites: Methods
The three main ways to scrap a dynamic website are:
- API Endpoint
- Script Tags
- Headless Browsers
From an API Endpoint
Many websites fetch dynamic data by executing JavaScript and making a GET request to an API endpoint. Therefore, by utilizing the same endpoint, you can extract data through GET requests.
Typically, the response from this GET request is in JSON format, which allows you to retrieve data without needing HTML parsers.
Pros
- Requires less coding
- Works with request-based methods
- Low resource usage
Cons
- Finding the right endpoint can be tricky
- Extracting data from JSON can be tedious
- Not all websites provide API endpoints
From Script Tags
Another way websites display HTML content via JavaScript is by embedding all the necessary information within a script tag in JSON format. The website then executes JavaScript to render the content when needed. In this case, you can still use HTTP requests to extract data.
However, this approach requires a bit more effort. You need to
- Parse the HTTP response using an HTML parser like BeautifulSoup.
- Identify the script tag containing the JSON string.
- Extract that string.
Finding the right script tag may vary in difficulty depending on the HTML structure:
- If the script tag has a clear ID or class indicating it contains JSON data, you can use BeautifulSoup’s find method directly.
- If not, you’ll need to loop through all script tags and search for one that contains a specific keyword identifying it as the JSON source.
Pros
- Compatible with request-based methods
- Low resource consumption
Cons
- Extracting the JSON string can be challenging
- Finding the specific data within JSON can be challenging
Using Browser Automation Libraries
You can also use browser automation libraries that execute JavaScript, enabling you to scrape Javascript-rich websites. Popular options for scraping JavaScript based websites include:
- Selenium
- Playwright
- Puppeteer
Web Scraping With Selenium
You can install Selenium with a single command using pip.
pip install selenium
To use Selenium for web scraping, you need to import two of its modules
- webdriver: Controls the browser.
- By: Specifies how to locate elements.
from selenium import webdriver
from selenium.webdriver.common.by import By
Selenium starts in headful mode (with a visible browser window) by default. To run it in a headless mode (without a window), add an argument to the browser option. For example, for Chrome:
browser = webdriver.Chrome()
options = webdriver.ChromeOptions()
options.add_argument('--headless')
browser = webdriver.Chrome(options=options)
The next step is to navigate to the target website using get().
options = webdriver.ChromeOptions()
options.add_argument('--headless')
browser = webdriver.Chrome(options=options)
You can then extract data using either of these methods:
1. Use find_element() or find_elements() to directly extract data
movie_titles = browser.find_elements(By.XPATH,"//div[@id='title-wrapper']")
2. Retrieve the HTML code of the web page and use other parsers, like BeautifulSoup, to extract data
from bs4 import BeautifulSoup
html = browser.page_source
soup = BeautifulSoup(html)
movie_titles = soup.find_all(‘div’,{‘id’:’title-wrapper’})
Web Scraping With Playwright
To install Playwright, run:
pip install Playwright
playwright install
Playwright offers both synchronous and asynchronous APIs; asynchronous usage is generally more reliable. Therefore, import these packages:
from playwright.async_api import async_playwright
import asyncio
To scrape asynchronously, you need to build an asynchronous function:
async def your_function():
The Playwright browser starts in headless mode by default. If you want to run it in headful mode, just add the argument ‘headless=False’.
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
Once the browser is up and running, you can open a new page and navigate to your desired website. The browser will render JavaScript and display content just like you would when browsing the web.
page = await browser.new_page()
await page.goto("https://www.youtube.com/feed/trending")
To select elements, you can use the query_selector_all() method in Playwright. This method supports various selector types, including CSS and XPath. Here’s how to use XPath:
movie_titles = await page.query_selector_all("xpath=//div[@id='title-wrapper']")
The code above finds all elements that match the selector. Finally, you can execute this function using the asyncio.run() method.
asyncio.run(your_function())
Web Scraping Using Puppeteer
Both of the headless browsers mentioned above work with popular languages like Python and JavaScript. However, Puppeteer is specifically designed for JavaScript web scraping.
You can install it using the Node package manager, npm:
npm install puppeteer
Puppeteer actions—such as launching browsers and navigating pages—are asynchronous; thus, you’ll need to define an asynchronous function.
const puppeteer = require('puppeteer');
async function scrapeTitles() {
// code to scrape
}
You can perform web scraping with Puppeteer in either headless or headful mode, but the headless mode is the default. To run in headful mode, you need to use the ‘headless: false’ argument.
const window = await puppeteer.launch({
headless: false,
defaultViewport: { width: 1280, height: 800 }
});
After launching Puppeteer, open a new page and navigate to your target web page.
const tab = await window.newPage();
await tab.goto("https://www.youtube.com/feed/trending", {
waitUntil: 'networkidle0'
});
Next, extract data using ‘evaluate(),’ which executes code within the context of the opened tab.
const titles = await tab.evaluate(() => {
const getAllElementsByXPath = (path) => {
const elements = [];
const query = document.evaluate(
path,
document,
null,
XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
null
);
for (let i = 0; i < query.snapshotLength; i++) { elements.push(query.snapshotItem(i)); } return elements; }; const titleElements = getAllElementsByXPath("//div[@id='title-wrapper']"); return titleElements.map(element => element.textContent.trim());
});
The above code uses an arrow function to
- Evaluate XPath and locate elements using document.evaluate()
- Extract and store them in an array named elements
- Extract the text from each element
Pros
- Allows for interactions like clicking and filling forms
- Works on most websites
Cons
- More resource intensive
- Slower than request-based methods
Why Use a Web Scraping Service?
The techniques discussed here will help you scrape JavaScript-rich websites. For scraping tasks, you can utilize API endpoints, extract data from script tags, or employ a headless browser.
However, you don’t have to handle all this coding yourself. A web scraping service like ScrapeHero can assist you.
ScrapeHero is a fully managed web scraping service provider capable of developing enterprise-grade scrapers and crawlers. We’ll take care of selecting the appropriate method for your needs, allowing you to focus on using data.