web scraping

8 min read

How to Use NLP to Clean and Structure Scraped Data

Matthew
Published: January 13, 2025

Using NLP to Clean and Structure Scraped Data
Tips for Using NLP Effectively
Wrapping Up

NLP, or natural language processing, involves understanding and generating text. Therefore, you can use NLP techniques to process scraped data that contains text. Curious about how to do it? Keep reading to learn about using NLP to clean and structure scraped data.

Using NLP to Clean and Structure Scraped Data

Web scraping extracts unstructured web data and structures it into usable formats. However, extracted text—such as reviews and articles—remains largely unstructured. Natural Language Processing (NLP) then analyzes this text to extract meaningful features and create a structured representation.

Here are some ways you can clean the scraped text and create structure:

Named Entity Recognition

When it comes to using NLP techniques to clean and structure scraped data, Named Entity Recognition (NER) is crucial. NER helps extract named entities and categorize them into predefined groups such as names, organizations, locations, and more.

The process involves two main steps: identification and classification. First, an NER model identifies the named entities; then, it classifies them into different categories.

You can implement NER using various methods:

Rule-based methods use predefined rules to identify and classify named entities; for example, identifying nouns based on whether or not their first letter is in upper case is a rule-based method.
Machine learning methods use techniques like decision trees to learn from labeled data. A trained machine learning model can then classify vast datasets; however, it needs a significant amount of labeled data to train it.

You can use NER for these purposes:

As a structuring technique to organize extracted entities into tables or databases
As one of the text cleaning techniques for web scraping to remove irrelevant entities from the dataset

Here’s how you can perform NER:

from sklearn.feature_extraction.text import TfidfVectorizer
import spacy

# 1. Named Entity Recognition (NER)
def perform_named_entity_recognition(text):

    """
    Extract and classify named entities from text using spaCy
    """
    # Download and load English language model
    nlp = spacy.load("en_core_web_sm")
    
    # Process the text
    doc = nlp(text)
    
    # Extract named entities
    
    entities = {}
    for ent in doc.ents:
        if ent.label_ not in entities:
            entities[ent.label_] = []
        entities[ent.label_].append(ent.text)
    
    return entities

# Example usage
sample_text = "Apple Inc. was founded by Steve Jobs in Cupertino, California."
print(perform_named_entity_recognition(sample_text))

This code defines a function perform_named_entity_recognition(). It accepts text as input and prints the recognized entities. It starts by loading the Spacy model for NER, ‘en_core_web_sm’, and then uses that to load the text. Once the text loads into the model, it identifies and classifies named entities. The code then

Creates an empty dict to store the extracted entities
Loops through the extracted entities and adds them to the dict.

Text Summarization

Text summarization allows you to condense a large volume of scraped data, helping you identify important information. It has two types: abstractive and extractive text summarization.

Extractive summarization involves summarizing the text using the original sentences from the source text.
Abstractive summarization generates new sentences to summarize the long-form text.

Text summarization allows you to:

Identify core themes in your dataset and organize the information accordingly.
Clean your dataset by removing unwanted information.

Here’s how to perform extractive summarization:

import nltk
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer



def extractive_summarization(text, num_sentences=3):

    nltk.download('punkt', quiet=True)
    
    sentences = nltk.sent_tokenize(text)
    
    if len(sentences) &lt;= num_sentences:
        return sentences
    
    # Use TF-IDF to score sentences
    vectorizer = TfidfVectorizer()
    sentence_vectors = vectorizer.fit_transform(sentences)
    
    # Get top sentences based on TF-IDF scores
    sentence_scores = np.array(sentence_vectors.sum(axis=1)).flatten()
    
    # Get indices of top sentences
    top_sentence_indices = sentence_scores.argsort()[-num_sentences:][::-1]
    
    # Return summarized text (sorted in original order)
    return [sentences[i] for i in sorted(top_sentence_indices)]

# Example usage

long_text = """
Machine learning is a method of data analysis that automates analytical model building. 
It is a branch of artificial intelligence based on the idea that systems can learn from data, 
identify patterns and make decisions with minimal human intervention. 
Deep learning, a subset of machine learning, is based on artificial neural networks.
Artificial intelligence continues to evolve rapidly, transforming various industries 
and creating new possibilities for technological innovation.
"""

# Perform summarization
summary = extractive_summarization(long_text)
print("Original Text Length:", len(nltk.sent_tokenize(long_text)))
print("Summary Length:", len(summary))
print("\nSummary:")
for sentence in summary:
    print("- " + sentence)

This code uses NLTK to perform extractive summarization. It first tokenizes the text into sentences using sent_tokenize. If there are fewer sentences in the original text than requested, it prints the original text.

Otherwise, the code creates sentence vectors using TfidfVectorizer().fit_transform(), which generates vectors based on the term frequency (frequency of a word in a document) and inverse document (frequency of the word in a collection of documents) to find the most important sentences.

Then, the code gets the top sentences based on the sentence scores.

Sentiment Analysis

Sentiment analysis refers to analyzing the emotions behind the text. There are two primary ways to perform sentiment analysis: rule-based systems and machine learning.

Rule-based systems compare the words of a text to the words in a positive, negative, or neutral lexicon. Depending on which lexicon they match, the system assigns a sentiment score. The sum of the sentiment scores of each word forms the sentiment score of the text.

Machine learning models learn from training data that consists of words and their associated sentiments. The main advantage of using machine learning for sentiment analysis is that you can analyze the sentiment of out-of-vocabulary words by employing techniques like sub-word tokenization and character-level models.

Sentiment analysis allows you to

Filter your scraped dataset for specific sentiments.
Create a structured dataset based on the results of sentiment analysis.

Here’s how to perform sentiment analysis:

from nltk.sentiment import SentimentIntensityAnalyzer


# 3. Sentiment Analysis
def analyze_sentiment(text):
    """
    Perform sentiment analysis using NLTK's VADER
    """
    # Initialize sentiment analyzer
    sia = SentimentIntensityAnalyzer()
    
    # Get sentiment scores
    sentiment_scores = sia.polarity_scores(text)
    
    # Determine overall sentiment
    if sentiment_scores['compound'] &gt;= 0.05:
        return 'Positive'
    elif sentiment_scores['compound'] &lt;= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

# Example usage
reviews = [
    "This product is amazing and works perfectly!",
    "I'm disappointed with the quality of this item.",
    "It's an okay product, nothing special."
]
for review in reviews:
    print(f"Review: {review}")
    print(f"Sentiment: {analyze_sentiment(review)}\n")

This code uses NLTK’s VADER to analyze sentiments. It first calculates the sentiment scores before categorizing them as positive, negative, or neutral. A score below -0.5 is negative; above 0.5 is positive. Otherwise, it is neutral.

Curious how scraped data is useful for sentiment analysis? Check this article on sentiment analysis using web scraping.

Topic Modelling

Topic modeling helps categorize text into various topics. For instance, you could use topic modeling to categorize scraped customer reviews into numerous topics, enabling you to grasp key points made by each customer.

Topic modeling works by clustering similar words or documents by topic. It is an unsupervised learning method, meaning it doesn’t need training data.

You can use topic modeling to

Organize data based on their topic
Remove unwanted topics from your dataset

Here’s how to perform topic modeling:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

def perform_topic_modeling(documents, num_topics=3):
    """
    Perform topic modeling using Latent Dirichlet Allocation
    """
    # Vectorize the documents
    vectorizer = TfidfVectorizer(
        stop_words='english',  # Remove common stop words
        max_df=0.95,  # Ignore terms that appear in more than 95% of documents
        min_df=1,     # Ignore terms that appear in fewer than 1 document
    )
    
    try:
        # Transform documents to document-term matrix
        doc_term_matrix = vectorizer.fit_transform(documents)
        
        # Check if we have enough unique terms
        if doc_term_matrix.shape[1] &lt; num_topics:
            print("Warning: Not enough unique terms for topic modeling")
            return {}
        
        # Perform topic modeling
        lda_model = LatentDirichletAllocation(
            n_components=min(num_topics, doc_term_matrix.shape[1]),
            random_state=42,
            max_iter=10
        )
        lda_model.fit_transform(doc_term_matrix)
        
        # Get feature names
        feature_names = vectorizer.get_feature_names_out()
        
        # Extract top words for each topic
        topics = {}
        for topic_idx, topic in enumerate(lda_model.components_):
            # Get indices of top words
            top_word_indices = topic.argsort()[:-10 - 1:-1]
            
            # Use these indices to get actual words
            top_words = [feature_names[i] for i in top_word_indices]
            topics[f"Topic {topic_idx + 1}"] = top_words
        
        return topics
    
    except Exception as e:
        print(f"Error in topic modeling: {e}")
        return {}

# Example usage
sample_documents = [
    "An axe is a great tool for cutting trees. You can use a chainsaw also, but it requires a power source. In contrast, axe does not require a power source even if it is slower",
    "A lion is considered the king of a jungle, but tigers have a more majestic roar. However, this is just a notion that humans have. It would be surprising if the animals feel this way.",
    "Machine learning is transforming artificial intelligence. An awesome application of machine learning is ChatGPT, which uses NLP",
    "Chocolates have several nutrients that improve cognitive function. However, unlike sugarless black coffee, which also have cognitive benefits, chocolates contain sugar",
]

# Perform topic modeling
topics = perform_topic_modeling(sample_documents)

# Print results
for topic, words in topics.items():
    print(f"{topic}: {', '.join(words)}")

This code uses the SKlearn library to perform topic modeling using Latent Dirichlet Allocation:

Creates a mathematical representation of documents using TfidfVectorizer().
Checks for unique terms suitable for topic modeling; if none exists, it returns an empty dict
Uses LatentDirichletAllocation() to identify the topics

Want to learn more? Check this article on analyzing Amazon product reviews using LDA topic modeling.

Tips for Using NLP Effectively

Now that you’re familiar with using NLP to clean and structure scraped data, keep these points in mind for using NLP effectively:

Data Preprocessing: NLP performs best when punctuation and special characters are not present. Additionally, eliminate stop words—words that add little content.
Model Selection: With numerous NLP models available, choose one carefully based on your task’s complexity and dataset’s size.
Evaluation: Measure your NLP model’s accuracy using metrics, including accuracy, precision, and recall.
Routine Updates: Regularly analyze the results of your NLP tasks and update your processes to enhance accuracy and efficiency.

Wrapping Up

Using NLP to clean and structure data enhances the value of web scraping. While scraping provides a basic structure by organizing data into fields and records, NLP further structures the textual content within those fields, enabling more detailed analysis

However, the effectiveness of NLP relies on the quality of scraped data. Inconsistent formatting and missing data can reduce the accuracy of your results, and a web scraping service like ScrapeHero can take care of providing high-quality data without errors.ScrapeHero is a fully-managed web scraping service. We can build enterprise-grade web scrapers and crawlers customized to your specific needs, leaving you to focus on using the data.