Fake news has become a massive problem in our society. LLMs and social media have made it easy to create and spread fake news. However, you can leverage web scraping for fake news detection.
How? Read on.
This article tells you how to use web scraping to create datasets for training AI/ML models using Python.
Web Scraping for Fake News Detection: The Environment
This code to scrape news articles requires two external packages.
- Python requests: Handles HTTP requests to fetch the HTML code of the article page.
- BeautifulSoup: Manages parsing and extracting the required data points.
You can install these packages using Python’s PIP.
What to Scrape
You need to scrape the training data for AI/ML models. Here, the code creates a dataset for fake news detection by scraping these three details:
- Article Content
- Headline
- Author’s Name
Web Scraping for Fake News Detection: The Code
Start by importing the necessary packages.
import requests
from bs4 import BeautifulSoup
import time
import json
In addition to requests and BeautifulSoup mentioned above, you also need to import time and json. The module time allows you to add delays between requests, ensuring you don’t overload the web server when you make multiple HTTP requests.
Next, create a class NewsScraper. This class contains all the methods required to build a dataset using web scraping for fake news detection:
- read_articles_from_urls()
- scrape_article()
- build_dataset()
read_urls_from_file()
This function reads URLs from text files and returns them as a list.
@staticmethod
def read_urls_from_file(filepath):
try:
with open(filepath, "r", encoding="utf-8") as file:
urls = [line.strip() for line in file if line.strip()]
return urls
except Exception as e:
print(f"Error reading URLs from {filepath}: {str(e)}")
return []
scrape_article()
This function accepts a URL and returns a dict containing the article details, including the content, headline, and the author’s name.
First, the function makes an HTTP request to the URL. The response will contain the HTML code of the article page.
response = requests.get(url, headers=self.headers)
Next, it parses the code from the response using BeautifulSoup, creating an object.
soup = BeautifulSoup(response.text, "html.parser")
You can use this object to extract the required data points. The function extracts the title from an h1 tag and the article content from all the p tags.
title = soup.find("h1").get_text().strip()
content = " ".join([p.get_text().strip() for p in soup.find_all("p")])
However, web pages use various tags and attributes to store author names, so you need to consider the ones used most commonly.
That’s why the function defines a list of possible CSS selectors to fetch author names and tries all of them one by one.
author = None
author_selectors = [
".byline",
".author",
".article-author",
".post-author",
".author-name",
".author-byline",
".byline-text",
".credit",
"[author]",
"[data-author]",
"[data-byline]",
"[class*='byline']",
"[class*='author']",
"[rel='author']",
"p.byline",
"span.byline",
"div.byline",
"p.author",
"span.author",
"article .byline",
"header .byline",
".post-meta .byline",
".entry-meta .byline",
"#author",
"#article-byline",
"cite",
]
for selector in author_selectors:
author_elem = soup.select_one(selector)
if author_elem:
if author_elem.name == "meta":
" ".join(author=author_elem.get("content").split())
else:
author = " ".join(author_elem.get_text().split())
break
print(author)
return {"url": url, "title": title, "content": content, "author": author}
except Exception as e:
print(f"Error scraping {url}: {str(e)}")
return None
build_dataset()
This function iterates through a list of URLs, and in each iteration, it:
- Calls scrape_article() with the URL as the argument.
- Adds a label (fake or real) to the returned dict.
- Appends the dict to a list defined outside the loop.
The function returns a list of dicts containing details of all articles.
def build_dataset(self, sources):
data = []
for label, urls in sources.items():
for url in urls:
article = self.scrape_article(url)
if article:
article["label"] = label
data.append(article)
return data
After defining the function, you can call them.
First, build a dict consisting of URLs. Use the read_urls_from_file() method to read URLs of both real and fake news articles.
sources = {
"real": NewsScraper.read_urls_from_file("real_urls.txt"),
"fake": NewsScraper.read_urls_from_file("fake_urls.txt"),
}
Next, create a NewsScraper() object and call build_dataset() with the URL dict as an argument. This function will return a list of dict objects containing details of all the articles.
scraper = NewsScraper()
dataset = scraper.build_dataset(sources)
Finally, save this list using the json.dump() method.
with open("news_dataset.json", "w", encoding="utf-8") as f:
json.dump(dataset, f, ensure_ascii=False, indent=4)
Here’s the complete code.
import requests
from bs4 import BeautifulSoup
import time
import json
# from typing import List, Dict
class NewsScraper:
def __init__(self):
self.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
self.delay = 3 # seconds between requests
def scrape_article(self, url):
try:
time.sleep(self.delay)
response = requests.get(url, headers=self.headers)
soup = BeautifulSoup(response.text, "html.parser")
title = soup.find("h1").get_text().strip()
content = " ".join([p.get_text().strip() for p in soup.find_all("p")])
author = None
author_selectors = [
".byline",
".author",
".article-author",
".post-author",
".author-name",
".author-byline",
".byline-text",
".credit",
"[author]",
"[data-author]",
"[data-byline]",
"[class*='byline']",
"[class*='author']",
"[rel='author']",
"p.byline",
"span.byline",
"div.byline",
"p.author",
"span.author",
"article .byline",
"header .byline",
".post-meta .byline",
".entry-meta .byline",
"#author",
"#article-byline",
"cite",
]
for selector in author_selectors:
author_elem = soup.select_one(selector)
if author_elem:
if author_elem.name == "meta":
" ".join(author=author_elem.get("content").split())
else:
author = " ".join(author_elem.get_text().split())
break
print(author)
return {"url": url, "title": title, "content": content, "author": author}
except Exception as e:
print(f"Error scraping {url}: {str(e)}")
return None
def build_dataset(self, sources):
data = []
for label, urls in sources.items():
for url in urls:
article = self.scrape_article(url)
if article:
article["label"] = label
data.append(article)
return data
@staticmethod
def read_urls_from_file(filepath):
try:
with open(filepath, "r", encoding="utf-8") as file:
urls = [line.strip() for line in file if line.strip()]
return urls
except Exception as e:
print(f"Error reading URLs from {filepath}: {str(e)}")
return []
if __name__ == "__main__":
sources = {
"real": NewsScraper.read_urls_from_file("real_urls.txt"),
"fake": NewsScraper.read_urls_from_file("fake_urls.txt"),
}
scraper = NewsScraper()
dataset = scraper.build_dataset(sources)
with open("news_dataset.json", "w", encoding="utf-8") as f:
json.dump(dataset, f, ensure_ascii=False, indent=4)
Why worry about expensive infrastructure, resource allocation and complex websites when ScrapeHero can scrape for you at a fraction of the cost?Go the hassle-free route with ScrapeHero
A Note on Fake News Detection
After using the scraped dataset to train the model, you can use it to check whether a particular news piece is fake.
In this case, web scraping fake news comes in handy because you can build an automated pipeline, which
- Reads a list of URLs from a text file
- For each URL the code:
- Scrapes the article details
- Runs the details through the trained model
- Returns whether or not the article is fake
Here’s a sample code for the above pipeline:
def predict_fake_news(url):
try:
# Load the trained model
model = joblib.load('fake_news_model.joblib')
# Scrape the article
article = scrape_article(url)
if not article:
return None
#assuming the model accepts a dictionary
prediction = model.predict(article)
return prediction == 'fake'
except Exception as e:
print(f"Error in prediction: {str(e)}")
return None
if __name__ == "__main__":
with open("urls.txt", "r") as file:
urls = file.readlines()
urls = [url.strip() for url in urls if url.strip()]
for url in urls:
print(f"Analyzing URL: {url}")
result = predict_fake_news(url)
if result is None:
print("Could not analyze the article.")
elif result:
print("⚠️ This article is likely FAKE news!")
else:
print("✅ This article appears to be legitimate.")
The above code defines a predict_fake_news() function that accepts a URL and returns a prediction: fake or real.
The function starts by loading the trained model, which the code assumes is in a ‘.joblib’ format.
Then, it calls a scrape_article() function that accepts a URL and returns a dict with article details; this function is similar to the scrape_article() function that was used to build the dataset.
Finally, it’ll pass the article details to the model, which will predict whether or not the article is fake.
After defining the function, the code
- Reads a list of URLs from a file.
- Iterates through the list and calls predict_fake_news() for each URL
Need data for your data pipeline? Use ScrapeHero’s web scraping API for hassle-free integration.
Code Limitations
The code can help you get started with web scraping for fake news detection. However,
- The selectors are not universal; you might need to tweak them for specific websites.
- Websites may also change the HTML structure frequently, requiring you to determine the new selectors.
- Some websites may use anti-scraping measures; this code doesn’t deal with it.
Wrapping Up: Why Use a Web Scraping Service
You can use Python to scrape data for fake news detection models. However, building a training model itself is a tedious task, so why bother with scraping yourself if you only want the data?
Use a web scraping service.
A web scraping service like ScrapeHero can handle everything related to web scraping, including managing selectors and anti-scraping measures. We also provide custom AI solutions, including natural language processing and data classification, that can help you detect fake news. Let us join in your battle against fake news. Contact ScrapeHero now!