Upgrade Your Web Scraping Skills: Scrape JavaScript-Rich Websites

Share:

Scrape JavaScript-Rich Websites

Table of Content

Static scraping is pretty simple; you send an HTTP request and parse the response. However, it’s a different story when you want to scrape JavaScript-rich websites. Why? Because the HTTP response may lack the data you want.

Websites that use a lot of JavaScript typically generate HTML elements by executing that JavaScript. Since requests-based methods don’t run JavaScript, you need alternative strategies to extract data.

This guide will outline three main approaches for scraping content from JavaScript-heavy websites. 

Scrape JavaScript-rich Websites: Methods

The three main ways to scrap a dynamic website are:

  • API Endpoint
  • Script Tags
  • Headless Browsers

From an API Endpoint

Many websites fetch dynamic data by executing JavaScript and making a GET request to an API endpoint. Therefore, by utilizing the same endpoint, you can extract data through GET requests.

Typically, the response from this GET request is in JSON format, which allows you to retrieve data without needing HTML parsers. 

For more details on this method, check out the article on scraping Stocktwits data.

Pros

  • Requires less coding
  • Works with request-based methods
  • Low resource usage

Cons

  • Finding the right endpoint can be tricky
  • Extracting data from JSON can be tedious
  • Not all websites provide API endpoints

From Script Tags

Another way websites display HTML content via JavaScript is by embedding all the necessary information within a script tag in JSON format. The website then executes JavaScript to render the content when needed. In this case, you can still use HTTP requests to extract data.

However, this approach requires a bit more effort. You need to

  1. Parse the HTTP response using an HTML parser like BeautifulSoup. 
  2. Identify the script tag containing the JSON string.
  3. Extract that string.

Finding the right script tag may vary in difficulty depending on the HTML structure:

  • If the script tag has a clear ID or class indicating it contains JSON data, you can use BeautifulSoup’s find method directly. 
  • If not, you’ll need to loop through all script tags and search for one that contains a specific keyword identifying it as the JSON source. 

Want to learn in detail about this method? Read this article on web scraping YouTube.

Pros

  • Compatible with request-based methods
  • Low resource consumption

Cons

  • Extracting the JSON string can be challenging
  • Finding the specific data within JSON can be challenging

Using Browser Automation Libraries

You can also use browser automation libraries that execute JavaScript, enabling you to scrape Javascript-rich websites. Popular options for scraping JavaScript based websites include:

  • Selenium
  • Playwright
  • Puppeteer

Web Scraping With Selenium

You can install Selenium with a single command using pip.

pip install selenium

To use Selenium for web scraping, you need to import two of its modules

  1. webdriver: Controls the browser. 
  2. By: Specifies how to locate elements. 
from selenium import webdriver
from selenium.webdriver.common.by import By

Selenium starts in headful mode (with a visible browser window) by default. To run it in a headless mode (without a window), add an argument to the browser option. For example, for Chrome:

browser =  webdriver.Chrome()

options = webdriver.ChromeOptions()
options.add_argument('--headless')
browser =  webdriver.Chrome(options=options)

The next step is to navigate to the target website using get().

options = webdriver.ChromeOptions()
options.add_argument('--headless')
browser =  webdriver.Chrome(options=options)

You can then extract data using either of these methods:

1. Use find_element() or find_elements() to directly extract data

movie_titles = browser.find_elements(By.XPATH,"//div[@id='title-wrapper']")

2. Retrieve the HTML code of the web page and use other parsers, like BeautifulSoup, to extract data

from bs4 import BeautifulSoup

html = browser.page_source
soup = BeautifulSoup(html)
movie_titles = soup.find_all(‘div’,{‘id’:’title-wrapper’})

Want to know more in detail? Read this article on Selenium web scraping.

Web Scraping With Playwright

To install Playwright, run:

pip install Playwright
playwright install

Playwright offers both synchronous and asynchronous APIs; asynchronous usage is generally more reliable. Therefore, import these packages: 

from playwright.async_api import  async_playwright
import asyncio

To scrape asynchronously, you need to build an asynchronous function:

async  def your_function():

The Playwright browser starts in headless mode by default. If you want to run it in headful mode, just add the argument ‘headless=False’.

async with async_playwright() as p:
    browser = await p.chromium.launch(headless=False)

Once the browser is up and running, you can open a new page and navigate to your desired website. The browser will render JavaScript and display content just like you would when browsing the web. 

page = await browser.new_page()
await page.goto("https://www.youtube.com/feed/trending")

To select elements, you can use the query_selector_all() method in Playwright. This method supports various selector types, including CSS and XPath. Here’s how to use XPath:

movie_titles = await page.query_selector_all("xpath=//div[@id='title-wrapper']")

The code above finds all elements that match the selector. Finally, you can execute this function using the asyncio.run() method.

asyncio.run(your_function())

Confused about what to use? Check out this article on Playwright vs Selenium.

Web Scraping Using Puppeteer

Both of the headless browsers mentioned above work with popular languages like Python and JavaScript. However, Puppeteer is specifically designed for JavaScript web scraping.

You can install it using the Node package manager, npm:

npm install puppeteer

Puppeteer actions—such as launching browsers and navigating pages—are asynchronous; thus, you’ll need to define an asynchronous function.

const puppeteer = require('puppeteer');

async function scrapeTitles() {
	// code to scrape
}

You can perform web scraping with Puppeteer in either headless or headful mode, but the headless mode is the default. To run in headful mode, you need to use the ‘headless: false’ argument.

const window = await puppeteer.launch({
        headless: false,
        defaultViewport: { width: 1280, height: 800 }
    });

After launching Puppeteer, open a new page and navigate to your target web page.

const tab = await window.newPage();

await tab.goto("https://www.youtube.com/feed/trending", {
    waitUntil: 'networkidle0'
});

Next, extract data using ‘evaluate(),’ which executes code within the context of the opened tab.

const titles = await tab.evaluate(() => {
    const getAllElementsByXPath = (path) => {
        const elements = [];
        const query = document.evaluate(
            path,
            document,
            null,
            XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
            null
        );

        for (let i = 0; i < query.snapshotLength; i++) { elements.push(query.snapshotItem(i)); } return elements; }; const titleElements = getAllElementsByXPath("//div[@id='title-wrapper']"); return titleElements.map(element => element.textContent.trim());
});
The above code uses an arrow function to
  1. Evaluate XPath and locate elements using document.evaluate()
  2. Extract and store them in an array named elements
  3. Extract the text from each element

Pros

  • Allows for interactions like clicking and filling forms
  • Works on most websites

Cons

  • More resource intensive
  • Slower than request-based methods

Want to know about other tools for web scraping in JavaScript? Here’s an article on JavaScript web scraping tools and frameworks.

Why Use a Web Scraping Service?

The techniques discussed here will help you scrape JavaScript-rich websites. For scraping tasks, you can utilize API endpoints, extract data from script tags, or employ a headless browser.

However, you don’t have to handle all this coding yourself. A web scraping service like ScrapeHero can assist you. 

ScrapeHero is a fully managed web scraping service provider capable of developing enterprise-grade scrapers and crawlers. We’ll take care of selecting the appropriate method for your needs, allowing you to focus on using data.

Table of content

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

NoSQL vs. SQL databases

Stuck Choosing a Database? Explore NoSQL vs. SQL Databases in Detail

Find out which SQL and NoSQL databases are best suited to store your scraped data.
Web scraping with mechanicalsoup

Ditch Multiple Libraries by Web Scraping with MechanicalSoup

Learn how you can replace Python requests and BeautifulSoup with MechanicalSoup.
playwright vs. selenium

Playwright vs. Selenium: Choosing a Headless Browser for Effective Web Scraping

Learn the difference between Playwright and Selenium.
ScrapeHero Logo

Can we help you get some data?