Fighting Misinformation: Web Scraping for Fake News Detection

Share:

Web Scraping for Fake News Detection

Fake news has become a massive problem in our society. LLMs and social media have made it easy to create and spread fake news. However, you can leverage web scraping for fake news detection.

How? Read on. 

This article tells you how to use web scraping to create datasets for training AI/ML models using Python.

Web scraping for fake news detection flow chart

Web Scraping for Fake News Detection: The Environment

This code to scrape news articles requires two external packages.

  1. Python requests: Handles HTTP requests to fetch the HTML code of the article page.
  2. BeautifulSoup: Manages parsing and extracting the required data points.

You can install these packages using Python’s PIP.

What to Scrape

You need to scrape the training data for AI/ML models. Here, the code creates a dataset for fake news detection by scraping these three details:

  • Article Content
  • Headline
  • Author’s Name

Web Scraping for Fake News Detection: The Code

Start by importing the necessary packages.

import requests
from bs4 import BeautifulSoup

import time
import json

In addition to requests and BeautifulSoup mentioned above, you also need to import time and json. The module time allows you to add delays between requests, ensuring you don’t overload the web server when you make multiple HTTP requests.

Next, create a class NewsScraper. This class contains all the methods required to build a dataset using web scraping for fake news detection:

  1. read_articles_from_urls()
  2. scrape_article()
  3. build_dataset()

read_urls_from_file()

This function reads URLs from text files and returns them as a list.

   @staticmethod
    def read_urls_from_file(filepath):
        try:
            with open(filepath, "r", encoding="utf-8") as file:
                urls = [line.strip() for line in file if line.strip()]
            return urls
        except Exception as e:
            print(f"Error reading URLs from {filepath}: {str(e)}")
            return []

scrape_article()

This function accepts a URL and returns a dict containing the article details, including the content, headline, and the author’s name.

First, the function makes an HTTP request to the URL. The response will contain the HTML code of the article page.

response = requests.get(url, headers=self.headers)

Next, it parses the code from the response using BeautifulSoup, creating an object.

soup = BeautifulSoup(response.text, "html.parser")

You can use this object to extract the required data points. The function extracts the title from an h1 tag and the article content from all the p tags.

title = soup.find("h1").get_text().strip()
content = " ".join([p.get_text().strip() for p in soup.find_all("p")])

However, web pages use various tags and attributes to store author names, so you need to consider the ones used most commonly. 

That’s why the function defines a list of possible CSS selectors to fetch author names and tries all of them one by one.

 author = None
    author_selectors = [
        ".byline",
        ".author",
        ".article-author",
        ".post-author",
        ".author-name",
        ".author-byline",
        ".byline-text",
        ".credit",
        "[author]",  
        "[data-author]",
        "[data-byline]",
        "[class*='byline']",
        "[class*='author']",  
        "[rel='author']",
       
        "p.byline",
        "span.byline",
        "div.byline",  
        "p.author",  
        "span.author",
       
        "article .byline",  
        "header .byline",
        ".post-meta .byline",
        ".entry-meta .byline",  
        "#author",
        "#article-byline",
        "cite",  
    ]


    for selector in author_selectors:
        author_elem = soup.select_one(selector)
        if author_elem:
            if author_elem.name == "meta":
                " ".join(author=author_elem.get("content").split())
            else:
                author = " ".join(author_elem.get_text().split())
            break
    print(author)
    return {"url": url, "title": title, "content": content, "author": author}
except Exception as e:
    print(f"Error scraping {url}: {str(e)}")
    return None

build_dataset()

This function iterates through a list of URLs, and in each iteration, it:

  1. Calls scrape_article() with the URL as the argument.
  2. Adds a label (fake or real) to the returned dict.
  3. Appends the dict to a list defined outside the loop.

The function returns a list of dicts containing details of all articles.

def build_dataset(self, sources):
    data = []

    for label, urls in sources.items():
        for url in urls:
            article = self.scrape_article(url)
            if article:
                article["label"] = label
                data.append(article)

    return data

After defining the function, you can call them.

First, build a dict consisting of URLs. Use the read_urls_from_file() method to read URLs of both real and fake news articles.

sources = {
    "real": NewsScraper.read_urls_from_file("real_urls.txt"),
    "fake": NewsScraper.read_urls_from_file("fake_urls.txt"),
}

Next, create a NewsScraper() object and call build_dataset() with the URL dict as an argument. This function will return a list of dict objects containing details of all the articles.

scraper = NewsScraper()
dataset = scraper.build_dataset(sources)

Finally, save this list using the json.dump() method.

with open("news_dataset.json", "w", encoding="utf-8") as f:
    json.dump(dataset, f, ensure_ascii=False, indent=4)

Here’s the complete code.

import requests
from bs4 import BeautifulSoup

import time
import json

# from typing import List, Dict

class NewsScraper:
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        }
        self.delay = 3  # seconds between requests

    def scrape_article(self, url):
        try:
            time.sleep(self.delay)
            response = requests.get(url, headers=self.headers)
            soup = BeautifulSoup(response.text, "html.parser")

            title = soup.find("h1").get_text().strip()
            content = " ".join([p.get_text().strip() for p in soup.find_all("p")])

            author = None
            author_selectors = [
                ".byline",
                ".author",
                ".article-author",
                ".post-author",
                ".author-name",
                ".author-byline",
                ".byline-text",
                ".credit",
                "[author]",  
                "[data-author]",
                "[data-byline]",
                "[class*='byline']",
                "[class*='author']",  
                "[rel='author']",
               
                "p.byline",
                "span.byline",
                "div.byline",  
                "p.author",  
                "span.author",
               
                "article .byline",  
                "header .byline",
                ".post-meta .byline",
                ".entry-meta .byline",  
                "#author",
                "#article-byline",
                "cite",  
            ]

            for selector in author_selectors:
                author_elem = soup.select_one(selector)
                if author_elem:
                    if author_elem.name == "meta":
                        " ".join(author=author_elem.get("content").split())
                    else:
                        author = " ".join(author_elem.get_text().split())
                    break
            print(author)
            return {"url": url, "title": title, "content": content, "author": author}
        except Exception as e:
            print(f"Error scraping {url}: {str(e)}")
            return None

    def build_dataset(self, sources):
        data = []

        for label, urls in sources.items():
            for url in urls:
                article = self.scrape_article(url)
                if article:
                    article["label"] = label
                    data.append(article)

        return data

    @staticmethod
    def read_urls_from_file(filepath):
        try:
            with open(filepath, "r", encoding="utf-8") as file:
                urls = [line.strip() for line in file if line.strip()]
            return urls
        except Exception as e:
            print(f"Error reading URLs from {filepath}: {str(e)}")
            return []

if __name__ == "__main__":
    sources = {
        "real": NewsScraper.read_urls_from_file("real_urls.txt"),
        "fake": NewsScraper.read_urls_from_file("fake_urls.txt"),
    }

    scraper = NewsScraper()
    dataset = scraper.build_dataset(sources)
    with open("news_dataset.json", "w", encoding="utf-8") as f:
        json.dump(dataset, f, ensure_ascii=False, indent=4)

Go the hassle-free route with ScrapeHero

Why worry about expensive infrastructure, resource allocation and complex websites when ScrapeHero can scrape for you at a fraction of the cost?

A Note on Fake News Detection

After using the scraped dataset to train the model, you can use it to check whether a particular news piece is fake. 

In this case, web scraping fake news comes in handy because you can build an automated pipeline, which

  1. Reads a list of URLs from a text file
  2. For each URL the code:
    1. Scrapes the article details
    2. Runs the details through the trained model
    3. Returns whether or not the article is fake

Here’s a sample code for the above pipeline:

def predict_fake_news(url):
    try:
        # Load the trained model
        model = joblib.load('fake_news_model.joblib')
       
        # Scrape the article
        article = scrape_article(url)
        if not article:
            return None
       
        #assuming the model accepts a dictionary  
        prediction = model.predict(article)
       
        return prediction == 'fake'
       
    except Exception as e:
        print(f"Error in prediction: {str(e)}")
        return None

if __name__ == "__main__":
    with open("urls.txt", "r") as file:
        urls = file.readlines()
    urls = [url.strip() for url in urls if url.strip()]

    for url in urls:
        print(f"Analyzing URL: {url}")
        result = predict_fake_news(url)
       
        if result is None:
            print("Could not analyze the article.")
        elif result:
            print("⚠️ This article is likely FAKE news!")
        else:
            print("✅ This article appears to be legitimate.")

The above code defines a predict_fake_news() function that accepts a URL and returns a prediction: fake or real.

The function starts by loading the trained model, which the code assumes is in a ‘.joblib’ format. 

Then, it calls a scrape_article() function that accepts a URL and returns a dict with article details; this function is similar to the scrape_article() function that was used to build the dataset.

Finally, it’ll pass the article details to the model, which will predict whether or not the article is fake.

After defining the function, the code

  1. Reads a list of URLs from a file.
  2. Iterates through the list and calls predict_fake_news() for each URL

Need data for your data pipeline? Use ScrapeHero’s web scraping API for hassle-free integration.

Code Limitations

The code can help you get started with web scraping for fake news detection. However,

  1. The selectors are not universal; you might need to tweak them for specific websites.
  2. Websites may also change the HTML structure frequently, requiring you to determine the new selectors.
  3. Some websites may use anti-scraping measures; this code doesn’t deal with it.

Wrapping Up: Why Use a Web Scraping Service

You can use Python to scrape data for fake news detection models. However, building a training model itself is a tedious task, so why bother with scraping yourself if you only want the data?

Use a web scraping service.

A web scraping service like ScrapeHero can handle everything related to web scraping, including managing selectors and anti-scraping measures. We also provide custom AI solutions, including natural language processing and data classification, that can help you detect fake news. Let us join in your battle against fake news. Contact ScrapeHero now!

Table of contents

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Clients love ScrapeHero on G2

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

Amazon Scraper for Time Series Forecasting

Facing Unpredictable Market Trends? Use an Amazon Scraper for Time Series Forecasting!

Build an Amazon scraper for time series forecasting to predict future trends and make data-driven decisions.
Selenium Element Visibility

Selenium Element Visibility: Ensuring Robust Automation in Web Scraping

Verifying element visibility in Selenium using Python and JavaScript.
Robots.txt for Web Scraping

Ethical and Efficient Scraping: Respecting Robots.txt for Web Scraping

ScrapeHero Logo

Can we help you get some data?