web scraping

5 min read

Need New Data ASAP? Learn Real-Time Data Scraping

Matthew
Published: November 21, 2024

Real-Time Data Scraping with Python
Best Practices for Effective Real-Time Scraping
Challenges in Real-Time Data Scraping
Why Use a Web Scraping Service?

The ability to access and analyze real-time data has become crucial for businesses. Real-time data scraping extracts information from websites as they are updated, enabling you to make informed decisions based on the current information.

However, real-time data extraction can be quite challenging because of its dynamic nature. This article aims to teach you how to scrape real-time data effectively using Python.

Real-Time Data Scraping with Python

Real-time data scraping involves repeatedly scraping a website at a high frequency, allowing you to gather updated data with minimal delay.

However, since websites display real-time data using JavaScript, you require browser automation libraries capable of executing JavaScript.

Here are the steps to scrape data in real time using the browser automation library, Selenium:

1. Identify Data Sources

The first step is to identify the target websites:

Relevance: Choose websites that have data relevant to your project
Update Frequency: Look for sites that update their content frequently.
Accessibility: Check the target website’s scraping restrictions

This tutorial shows you how to scrape data from worldometers.info, a website that provides real-time statistics.

2. Set Up Your Scraper

The next step is to write the code to scrape. Here’s how to write a Python script to scrape data from worldometers.info using Selenium Python.

Install Selenium and Pandas

Selenium is used to extract real-time data, and Pandas to manipulate and save it.

pip install selenium

Import WebDriver and By from Selenium, Pandas, and datetime

# the webdriver module lets you control the browser
from selenium import webdriver

# the By module lets you specify how to locate elements (by class or tag)
from selenium.webdriver.common.by import By

# pandas lets you manipulate the extracted data and datetime lets you handle date and time 
# related tasks
import pandas, datetime

Create a ChromeOptions object

The ChromeOptions() object allows you to add the headless argument to launch the Selenium browser in headless mode (without GUI).

options = webdriver.ChromeOptions()
options.add_argument("--headless")

Launch the Selenium browser with the options defined above

browser = webdriver.Chrome(options=options)

Navigate to ‘https://worldometers.info’

browser.get('https://worldometers.info')

Find all the elements with the class name ‘counter-group’

elements = browser.find_elements(By.CLASS_NAME,'counter-group')

Extract the counter name and the value from each element

data = [{'Counter':element.text.split('\n')[1],'Count':element.text.split('\n')[0]} for element in elements]

Create a Pandas DataFrame using the extracted data

df = pandas.DataFrame(data)

Save the extracted data to a JSON file

df.to_json(f'Latest.json',orient="records",indent=4)

If you want to track the changes in the dataset, you need to store the data obtained in all the executions. You can use a CSV format for that.

Format the dataset so that the scraper stores the extracted data as a single row after each execution.

# transpose the DataFrame to get keys and values as rows
newDf = df.transpose()

# Use the items of the first row for column names and delete the first row
newDf.columns = newDf.iloc[0]
newDf = newDf[1:]

# add a timestamp in a new column
now = datetime.datetime.now().timestamp()
newDf.insert(0,'Time_Stamp',now)

Handle various scenarios, including first-time execution and mismatched columns.

# try reading and updating the existing dataset
try:
    oldDf = pandas.read_csv('worldometers.csv')

# Check if the columns of the existing dataset are identical to that of the extracted one. If they are identical, update the existing CSV file; otherwise, create a new CSV file.

    if set(oldDf.columns) == set(newDf.columns):
        oldDf = oldDf._append(newDf, ignore_index=True)
        oldDf.to_csv("worldometers).csv", index=False)
    else:
        oldDf.to_csv(f'worldometers_old_{now}.csv', index=False)
        newDf.to_csv("worldometers.csv", index=False)

# If no CSV file exists create it
except FileNotFoundError:
    newDf.to_csv(f'worldometers.csv', index=False)

except Exception as e:
    print(e)

3. Automate the Scraper

Finally, you need to execute the scraper. There are various methods to to do that; Here are a few of them:

1. Using Windows Task Scheduler: You can use the task scheduler that comes with Windows. Read this article on How to Build a Price Tracker to learn how to use a task scheduler.
2. Using Apache Airflow: It’s an automation platform you can use for any workflow, including web scraping real-time data. Learn all about automation using Apache Airflow in this article on building a scraping pipeline using Airflow.
3. Using Cloud Services: You can use a cloud service, like AWS or Azure, to run your Python script continuously.

Best Practices for Effective Real-Time Scraping

For efficient and accurate web scraping, follow these best practices:

1. Throttling Requests: By managing the frequency of your requests, you can avoid overwhelming the target server and getting blocked:

Set delays between requests to avoid overwhelming the server
Randomize the request intervals to mimic human behavior
Limiting the concurrent requests to avoid too much load on the server

2. Error Handling: Implementing error handling mechanisms can avoid unexpected issues and ease debugging:

1. Set-up retires if the request fails.
2. Maintain logs of errors encountered during scraping.
3. Have fallback strategies to use if your primary strategy fails.

3. Data Quality Assurance: Ensuring data quality and accuracy is crucial for effective analysis:

1. Validate scraped data to ensure it meets the expected formats and values
2. Remove duplicate data to maintain its integrity
3. Monitor the scraping results and update the logic to overcome structural changes in websites.

Challenges in Real-Time Data Scraping

Understanding challenges in real-time data is crucial for developing strategies to overcome them:

1. Technical Challenges

Real-time scraping involves navigating a range of technical hurdles that can complicate the data extraction process:

Real-time data-provider websites use JavaScript to load content, requiring you to use automated browsers, which are more resource-intensive.
Websites often try to detect and block bots using various anti-scraping mechanisms, including CAPTCHAs, which require techniques like proxy rotation and CAPTCHA solvers.
Changes in a website’s HTML structure may require you to update your scraping logic to avoid failure.

2. Legal Considerations

Legality of web scraping depends on local laws; however,

Ensure you don’t scrape data that is protected by copyright laws.
Avoid scraping private information without consent.

3. Ethical Considerations

You also need to consider implementing ethical scraping practices:

Ensure your web scraping project does not overwhelm the target server by implementing techniques like request delays.
Consider being transparent about your web scraping activities with the website owner.

Implementing these measures while building your scraper will increase the likelihood of success.

Want to learn how to scrape data ethically? Read this article on ethical web scraping.

Why Use a Web Scraping Service?

The technique discussed in this article can help you gather real-time data. However, you need to overcome the challenges yourself, including anti-scraping measures and HTML structure changes.

But you don’t need to do the scraping yourself: use a web scraping service.

A web scraping service, like ScrapeHero, can take care of all the challenges and legal issues that come with web scraping. We can manage the anti-scraping measures and have the infrastructure to handle dynamic content smoothly.

ScrapeHero is a fully managed web scraping service capable of building enterprise-grade web scrapers and crawlers. Contact ScrapeHero to get real-time data without hassle.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data

Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help