Need to Scrape Data from Authenticated Sites? Here’s How

Share:

Scrape data from authenticated sites

To scrape data from authenticated sites, you need to consider three things. 

Authenticating mechanisms the sites use, the tools you need to negotiate those mechanisms, and the legal framework regarding the access. 

This guide explores tools, techniques, and ethical considerations to responsibly scrape authenticated websites. Let’s dive in.

Note: This article is for informational purposes. ScrapeHero does not scrape data from authenticated sites.

Understanding Authenticated Websites

Authenticated web scraping involves extracting data from websites that require user authentication, such as logging in with a username and password. Some common authentication mechanisms include:

  • Basic Authentication: This method requires users to provide a username and password.
  • CSRF Tokens: The site also requires a dynamically generated token in addition to the username and password.
  • Web Application Firewalls(WAF):  WAF analyzes the incoming requests before they reach the web application. It checks for patterns to detect attacks but also poses significant challenges while scraping.
  • Multi-Factor Authentication (MFA): Some websites require additional verification steps, such as Time-based One-Time Passwords (TOTP)

Understand these mechanisms to know what tools you’ll need to scrape data from authenticated sites.

Before you learn how to scrape data from authenticated sites, it’s crucial to understand the legalities surrounding web scraping. Many websites have specific rules regarding automated data extraction, and violating these rules can lead to serious consequences. Here are some key points to consider:

  • Robots.txt: Most websites have a robots.txt file that outlines what parts of the site can bots access. Always check this file to see if scraping is allowed. If it explicitly forbids scraping, it’s best to steer clear.
  • Terms of Service: Many websites have terms that prohibit automated data extraction. Ignoring these can lead to account bans or legal trouble. It’s essential to read and understand these terms before proceeding with your scraping efforts.
  • Privacy Laws: Laws like GDPR in Europe and CCPA in California have strict rules on handling personal data. If you’re planning to scrape password-protected sites, obtain consent and handle personal data responsibly.

Want to learn more about the ethics in web scraping? Read this article on ethical web scraping and its legalities in the US.

Scrape Data from Authenticated Sites: Python Libraries

Here are some popular Python libraries for web scraping:

  • Requests: This library allows you to send HTTP requests easily and manage cookies, making it easier to log in and maintain sessions throughout your scraping session.
  • BeautifulSoup: Ideal for parsing HTML once you’ve accessed the content, BeautifulSoup makes it easy to navigate and extract the information you need from web pages.
  • Playwright: Essential for interacting with dynamic content that loads via JavaScript; Playwright automates browser actions, allowing you to simulate user interactions on the site.

Besides these libraries for web scraping the code in this tutorial also uses pyotp for handling MFA.

Basic Authentication

For sites using basic authentication, you can simply make a post request to the endpoint with the username and password as the payload.

However, you need to find items to include in your payload. To do so, inspect the login page and go to the network tab.

The inspect panel on the login page

Enter your username and password and click ‘Login’. The inspect panel will show many requests; click the one named login. Then, go to the payload section to determine the items to include in the payload.

The inspect panel showing the payload

Here’s a function that accepts a login URL, target URL, username, and password and returns the HTML code of the protected page.

def basic_auth_scrape(login_url, target_url, username, password):
    """Basic authentication using requests."""
    session = requests.Session()
   
    # Login payload
    login_data = {
        'email': username,
        'password': password
    }
   
    # Perform login
    response = session.post(login_url, data=login_data)

    if response.ok:
        # Scrape the target page
        protected_page = session.get(target_url)
        return protected_page.text
    return None

This function:

  1. Initializes a requests.session() object, which will automatically handle cookies.
  2. Makes a POST request to the login URL with the payload as you saw in the inspect panel.
  3. Makes a GET request to the target URL if the login is successful.
  4. Retrieves the target URL’s HTML source code.  

Handling CSRF Tokens

Many sites also use Cross-Site Request Forgery tokens to ensure that malicious sites don’t make requests to a website authenticated by you. These tokens are different on every login page, so you need to scrape them every time. 

You can check the payload after logging in to determine whether the site uses CSRF or not. This time, in addition to the username and password, there will also be a token under the payload tab.

Inspect panel showing payload with the CSRF token.

This CSRF token was already present on the login page.

Login Page

That means you can scrape this CSRF token and add it to the payload. Here’s a function that’ll do precisely that:

def csrf_auth_scrape(login_url, target_url, username, password):
    """Authentication with CSRF token handling."""
    session = requests.Session()
   
    # Get the login page first to extract CSRF token
    login_page = session.get(login_url)
    soup = BeautifulSoup(login_page.text, 'lxml')
   
    # Find CSRF token (adjust selector based on website)
    csrf_token = soup.find('input', {'name': 'csrf_token'})['value']
   
    # Login payload with CSRF token
    login_data = {
        'email': username,
        'password': password,
        'csrf_token': csrf_token
    }
   
    # Perform login
    response = session.post(login_url, data=login_data)
   
    if response.ok:
        # Scrape the target page
        protected_page = session.get(target_url)
        return protected_page.text
    return None

This function:

  1. Initializes a requests.Session() object.
  2. Makes a GET request to the login URL.
  3. Parses the login page using BeautifulSoup.
  4. Finds the CSRF token from an input tag.
  5. Makes a POST request to the login URL with username, password, and the extracted CSRF token.
  6. Makes a GET request to the target URL and extracts the HTML page.

Web Application Firewall

Some websites also use a web application firewall (WAF) that monitors the incoming traffic, checking for patterns using rule-based, statistical, or machine-learning algorithms to prevent attacks. However, these WAFs can also detect and block your scraper.

In such cases, you need to log in using a browser automation library like Playwright. These libraries can mimic human behavior, allowing your web scraper to remain undetected. 

Here’s a function that uses Playwright to input the credentials.

def playwright_auth_scrape(login_url, target_url, username, password):
    """Scraping using Playwright for JavaScript-heavy sites."""
    from playwright.sync_api import sync_playwright
    content = None
   
    with sync_playwright() as p:
        try:
            browser = p.chromium.launch(headless=False)
            context = browser.new_context()
            page = context.new_page()
           
            # Navigate to login page
            page.goto(login_url, wait_until="networkidle")
           
            # Fill login form (adjust selectors based on website)
            page.fill('input[name="email"]', username)
            page.fill('input[name="password"]', password)
           
            # Click submit and wait for the navigation
            page.click('button[type="submit"]')
            page.wait_for_load_state("networkidle")
           
            # Navigate to target URL
            page.goto(target_url, wait_until="networkidle")
           
            # Get the content
            content = page.inner_html('body')
           
        except Exception as e:
            logging.error(f"Playwright error: {str(e)}")
            return None
           
        finally:
            if 'browser' in locals():
                browser.close()
               
    return content

This code: 

  1. Launches a Playwright browser in headless mode.
  2. Creates a new context and a page within that context.
  3. Navigates to the login URL and waits until the page completely loads.
  4. Fills in the credentials using the fill() method.
  5. Clicks the submit button using the click() method. 
  6. Extracts the inner HTML of the body tag.

Multi-Factor Authentication (MFA)

MFA adds another layer of complexity when trying to scrape data from authenticated websites. After entering your credentials, you also need to enter a one-time password. This password is either delivered to you (via email or SMS) or generated using a time-based algorithm (time-based one-time password or TOTP). 

Here is a function to login to a website using MFA with a TOTP:

def mfa_auth_scrape(login_url, target_url, username, password, mfa_secret=None):
    """Authentication with MFA support using Playwright."""
    logger = logging.getLogger(__name__)
   
    with sync_playwright() as p:
        try:
            browser = p.chromium.launch(headless=False)
            page = browser.new_page()
           
            # Handle login
            logger.info("Attempting login with MFA...")
            page.goto(login_url, wait_until="networkidle")
            page.fill('input[name="email"]', username)
            page.fill('input[name="password"]', password)
            page.click('button[type="submit"]')
           
            # Handle MFA if secret provided
            if mfa_secret:
                logger.info("Handling MFA...")
                try:
                    totp = pyotp.TOTP(mfa_secret)
                    mfa_code = totp.now()
                    page.wait_for_selector('input[name="totp_code"]')
                    page.fill('input[name="totp_code"]', mfa_code)
                    page.press('input[name="totp_code"]', "Enter")
                except Exception as e:
                    logger.error(f"MFA failed: {str(e)}")
                    return None

            # Get target page content
            logger.info("Navigating to target page...")
            page.wait_for_load_state("networkidle")
            page.goto(target_url, wait_until="networkidle")
            content = page.content()
            return content
           
        except Exception as e:
            logger.error(f"MFA auth error: {str(e)}")
            return None
           
        finally:
            if 'browser' in locals():
                browser.close()

This function is a modified version of playwright_auth_scrape(). After logging in with credentials, the function:

  1. Passes the secret to pyotp.TOTP()
  2. Generates the current code using the now() method
  3. Selects the appropriate input field and enters the TOTP code
  4. Presses ‘Enter’
  5. Navigates to the target web page and returns the HTML code

MFA verification screen

Below, you can view the complete code to scrape data from authenticated sites. This code scrapes article information from a mock server.

The mock server page containing a list of articles this code scrapes

import requests
from bs4 import BeautifulSoup
from playwright.sync_api import sync_playwright
import time
import argparse
import logging
import json
import pyotp

def extract_data(content):
    """Extract data from HTML content."""
    soup = BeautifulSoup(content, 'lxml')
    # Extract data (adjust selectors based on website)
    articles = soup.find_all('div', {'class': 'card-body'})
    data = []

    for article in articles:
        title = article.find('h5', {'class': 'card-title'}).text
        subtitle = article.find('h6', {'class': 'card-subtitle'}).text
        content = article.find('p', {'class': 'card-text'}).text

        data.append({
            'title': title,
            'subtitle': subtitle,
            'content': content
        })

    return data

def basic_auth_scrape(login_url, target_url, username, password):
    """Basic authentication using requests."""
    session = requests.Session()
   
    # Login payload
    login_data = {
        'email': username,
        'password': password
    }
   
    # Perform login
    response = session.post(login_url, data=login_data)

    if response.ok:
        # Scrape the target page
        protected_page = session.get(target_url)
        return protected_page.text
    return None

def csrf_auth_scrape(login_url, target_url, username, password):
    """Authentication with CSRF token handling."""
    session = requests.Session()
   
    # Get the login page first to extract CSRF token
    login_page = session.get(login_url)
    soup = BeautifulSoup(login_page.text, 'lxml')
   
    # Find CSRF token (adjust selector based on website)
    csrf_token = soup.find('input', {'name': 'csrf_token'})['value']
   
    # Login payload with CSRF token
    login_data = {
        'email': username,
        'password': password,
        'csrf_token': csrf_token
    }
   
    # Perform login
    response = session.post(login_url, data=login_data)
   
    if response.ok:
        # Scrape the target page
        protected_page = session.get(target_url)
        return protected_page.text
    return None

def playwright_auth_scrape(login_url, target_url, username, password):
    """Scraping using Playwright for JavaScript-heavy sites."""
    from playwright.sync_api import sync_playwright
    content = None
   
    with sync_playwright() as p:
        try:
            browser = p.chromium.launch(headless=False)
            context = browser.new_context()
            page = context.new_page()
           
            # Navigate to login page
            page.goto(login_url, wait_until="networkidle")
           
            # Fill login form (adjust selectors based on website)
            page.fill('input[name="email"]', username)
            page.fill('input[name="password"]', password)
           
            # Click submit and wait for navigation
            page.click('button[type="submit"]')
            page.wait_for_load_state("networkidle")
           
            # Navigate to target URL
            page.goto(target_url, wait_until="networkidle")
           
            # Get the content
            content = page.inner_html('body')
           
        except Exception as e:
            logging.error(f"Playwright error: {str(e)}")
            return None
           
        finally:
            if 'browser' in locals():
                browser.close()
               
    return content

def mfa_auth_scrape(login_url, target_url, username, password, mfa_secret=None):
    """Authentication with MFA support using Playwright."""
    logger = logging.getLogger(__name__)
   
    with sync_playwright() as p:
        try:
            browser = p.chromium.launch(headless=False)
            page = browser.new_page()
           
            # Handle login
            logger.info("Attempting login with MFA...")
            page.goto(login_url, wait_until="networkidle")
            page.fill('input[name="email"]', username)
            page.fill('input[name="password"]', password)
            page.click('button[type="submit"]')
           
            # Handle MFA if secret provided
            if mfa_secret:
                logger.info("Handling MFA...")
                try:
                    totp = pyotp.TOTP(mfa_secret)
                    mfa_code = totp.now()
                    page.wait_for_selector('input[name="totp_code"]')
                    page.fill('input[name="totp_code"]', mfa_code)
                    page.press('input[name="totp_code"]', "Enter")
                except Exception as e:
                    logger.error(f"MFA failed: {str(e)}")
                    return None

            # Get target page content
            logger.info("Navigating to target page...")
            page.wait_for_load_state("networkidle")
            page.goto(target_url, wait_until="networkidle")
            content = page.content()
            return content
           
        except Exception as e:
            logger.error(f"MFA auth error: {str(e)}")
            return None
           
        finally:
            if 'browser' in locals():
                browser.close()

def setup_logging():
    """Setup logging configuration."""
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s'
    )

def get_auth_method(method_name):
    """Get the authentication method based on name."""
    methods = {
        'basic': basic_auth_scrape,
        'csrf': csrf_auth_scrape,
        'playwright': playwright_auth_scrape,
        'mfa': mfa_auth_scrape
    }
    return methods.get(method_name)

def main():
    parser = argparse.ArgumentParser(description='Authenticate and scrape website content')
    parser.add_argument('--login-url', required=True, help='URL of the login page')
    parser.add_argument('--target-url', required=True, help='URL of the page to scrape')
    parser.add_argument('--username', required=True, help='Login username')
    parser.add_argument('--password', required=True, help='Login password')
    parser.add_argument(
        '--method',
        required=True,
        choices=['basic', 'csrf', 'playwright', 'mfa'],
        help='Authentication method to use'
    )
    parser.add_argument('--mfa-secret', help='TOTP secret key for MFA authentication')
   
    args = parser.parse_args()
   
    # Setup logging
    setup_logging()
    logger = logging.getLogger(__name__)
   
    # Get the selected authentication method
    auth_method = get_auth_method(args.method)
    logger.info(f"Using {args.method} authentication method")
   
    try:
        # Execute the selected method
        if args.method == 'mfa':
            content = auth_method(args.login_url, args.target_url,
                               args.username, args.password, args.mfa_secret)
        else:
            content = auth_method(args.login_url, args.target_url,
                                args.username, args.password)
           
        if content:
            logger.info("Authentication and scraping successful")
            extracted_data = extract_data(content)
            with open('data.json', 'w') as f:
                json.dump(extracted_data, f, indent=4)
            logger.info("Data saved to data.json")
        else:
            logger.error("Authentication or scraping failed")
    except Exception as e:
        logger.error(f"Error: {str(e)}")

if __name__ == "__main__":
    main()

This code also includes two additional functions:

  • extract_data(): This function accepts the HTML code returned by other functions mentioned above and extracts the required data. 
  • main(): This function integrates all the functions and sets up the script to accept arguments during script execution. It also writes extracted data as a JSON file.

Conclusion

You can scrape data from authenticated sites by understanding various authentication mechanisms and using appropriate tools. However, you need to be more careful while ensuring compliance when scraping authenticated websites. Ensure you have permission to do so.

If you need public data, ScrapeHero has got you covered. Our web scraping service can also take care of the compliance regarding scraping data from unauthenticated sites.

ScrapeHero is a fully managed web scraping service provider. We can build enterprise-grade scraping solutions just for you. Our services covers the entire data pipeline, from data extraction to providing custom AI solutions.

Table of content

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

Scrape multimedia content

How to Easily Scrape Multimedia Content

Learn to scrape multimedia content using Python
Transform and map scraped data

How to Transform and Map Scraped Data with Python Libraries

Learn how you can transform and map data using Python.
Using NLP to clean and structure scraped data

How to Use NLP to Clean and Structure Scraped Data

Learn how to use NLP to clean and structure scraped data.
ScrapeHero Logo

Can we help you get some data?