Data obtained from web scraping is not immediately ready for analysis. It might be riddled with problems, including missing values and unsuitable format and features that are not very useful. Solution? Transform and map scraped data.
This guide shows you how to do that using Python.
Understanding Data Transformation and Mapping
Before learning how to transform and map scraped data, let’s learn about the processes:
Data transformation involves modifying, cleaning, and restructuring raw data into a format suitable for analysis. This might also include converting data types, handling missing values, or creating new features from existing data.
Data mapping, on the other hand, creates relationships between different data elements and establishes how fields from different sources correspond to each other. It is crucial when integrating multiple sources or standardizing data formats.
Usually, both these processes are a part of a data pipeline.
Data Transformation Fundamentals
Data Cleaning
Data cleaning is typically the first and most crucial step in the transformation process; it handles:
- Missing Values
- Outliers
- Non-Standard Formats
- Duplicates
Handling Missing Values
Pandas DataFrames has several methods to deal with missing values. You can use
- isnull() to check for null values
- dropna() to remove them
- fillna() to insert data in place of the null values
import pandas as pd
df = pd.DataFrame(‘data’)
#count missing values
df.isnull().sum()
##remove missing values
df.dropna()
##fill missing values
df.fillna(0) # fill with a constant value
df.fillna(df.mean()) # Fill with mean
df.fillna(method='ffill') # Forward fill
df.fillna(method='bfill') # Backward fill
#interpolate missing values
# Interpolate missing values
df.interpolate(method='linear') # Linear interpolation
df.interpolate(method='polynomial', order=2) # Polynomial interpolation
Removing Duplicate Entries
You can use the drop_duplicates() methods of Python Pandas to remove duplicate entries.
df.drop_duplicates() # Remove duplicate rows
df.drop_duplicates(subset=['column']) # Remove duplicates based on specific columns
df.drop_duplicates(keep='first') # Keep first occurrence
df.drop_duplicates(keep='last') # Keep last occurrence
Standardizing Formats
Standardizing the formats depends on the objects. Pandas has built-in methods to handle date-time objects, but you need to apply custom logic for text and numbers.
# Dates
df['date'] = pd.to_datetime(df['date']) # Convert to datetime
df['date'] = df['date'].dt.strftime('%Y-%m-%d') # Format as string
# Text
text = text.lower() # Lowercase
text = text.strip() # Remove whitespace
import re
text = re.sub(r'[^\w\s]', '', text) # Remove special chars
# Numbers
df['price'] = df['price'].round(2) # Round decimals
df['price'] = df['price'].astype(int) # Convert to integer
df['price'] = df['price'].apply(lambda x: f"${x:,.2f}") # Currency format
# Categories
df['category'] = df['category'].astype('category') # Convert to categorical
df['category'] = df['category'].map({'a': 'Category A', 'b': 'Category B'}) # Map values
Correcting Inconsistencies in data
Inconsistencies in the data include wrong cases, typos, extra spaces, variations, etc. The logic depends on the specific inconsistency:
# Standardize case
df['name'] = df['name'].str.title() # Proper case
df['email'] = df['email'].str.lower()
# Remove extra spaces
df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
# Standardize formats
df['phone'] = df['phone'].replace(r'\D+', '', regex=True) # Keep only digits
df['phone'] = df['phone'].apply(lambda x: f"{x[:3]}-{x[3:6]}-{x[6:]}")
# Handle variations
df['status'] = df['status'].replace(['Y', 'yes', 'Yes'], 'Active')
df['status'] = df['status'].replace(['N', 'no', 'No'], 'Inactive')
# Validate data
assert df['price'].between(0, 1000).all() # Check valid range
assert df['email'].str.contains('@').all() # Basic email validation
Dealing with Outliers Based on Statistical Analysis
There are several methods to identify outliers.
- Z-Score Method: This score tells you how many standard deviations away a value is from the mean. Usually, you take a Z-Score of greater than 3 as an outlier.
z_scores = stats.zscore(df['column'])
df_no_outliers = df[abs(z_scores) < 3] # Remove if |z| > 3
- IQR (Interquartile Range) Method: This method sorts all the data and splits them into four parts. If the value separating the first part is Q1 and that separating the fourth part is Q3, the IQR is the difference between them. Typically, outliers are those that fall 1.5 times above or below IQR.
Q1 = df['column'].quantile(0.25) # 25th percentile
Q3 = df['column'].quantile(0.75) # 75th percentile
IQR = Q3 - Q1 # Distance between Q1 and Q3
lower = Q1 - 1.5 * IQR # Lower fence
upper = Q3 + 1.5 * IQR # Upper fence
- Isolation Forest: This method chooses a random split between the minimum and maximum, splitting the values into two groups. The process repeats for all the sub-groups until all the values get isolated. The outliers are those which require the least amount of splits to get isolated.
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.1) # Expect 10% outliers
outliers = iso.fit_predict(df[['col1', 'col2']]) == -1
Once you detect outliers, you can either remove them, cap them or flag them:
# Option 1: Remove them
df_clean = df[~outliers]
# Option 2: Cap them (Winsorization)
df['column'] = df['column'].clip(lower, upper)
# Option 3: Flag them for review
df['is_outlier'] = outliers
Feature Engineering
Feature engineering is the process of creating new meaningful features from existing data. Here are the primary approaches:
Mathematical Transformations: Applying various mathematical operations to the existing dataset to create new features:
- Arithmetic operations: These include addition, multiplication, etc., that allow you to combine multiple observations to form new ones.
- Statistical transformations: These transformations—such as logarithmic and exponential—-are useful in statistical analysis.
- Rolling calculations: The result of these transformations—-such as moving averages and cumulative sums—-at any point in the dataset depends on the range of previous data points.
# Example data
numbers = [1, 2, 3, 4, 5]
# Addition
addition_result = [x + 1 for x in numbers]
print("Addition:", addition_result)
# Multiplication
multiplication_result = [x * 2 for x in numbers]
print("Multiplication:", multiplication_result)
import numpy as np
# Logarithmic transformation
log_transformed = [np.log(x) for x in numbers if x > 0] # Log of 0 or negative numbers is undefined
print("Logarithmic transformation:", log_transformed)
# Exponential transformation
exp_transformed = [np.exp(x) for x in numbers]
print("Exponential transformation:", exp_transformed)
import pandas as pd
# Create a pandas Series for easier rolling calculations
s = pd.Series(numbers)
# Moving Average (3-period)
moving_avg = s.rolling(window=3).mean()
print("Moving Average:", moving_avg.tolist())
# Cumulative Sum
cumulative_sum = s.cumsum()
print("Cumulative Sum:", cumulative_sum.tolist())
Discretization: Making continuous data discrete. It allows you to simplify datasets for analysis. Its types include:
- Equal-width binning: Divides range into equal-sized intervals; each bin might have an uneven number of samples.
- Equal-frequency binning: Creates bins with an equal number of samples; each bin might be of uneven width.
- Custom binning based on the domain: Widths and the number of samples depend on the domain.
import pandas as pd
# equal width binning
n_bins = 3
df['equal_width_bin'] = pd.cut(df['column_name'], bins=n_bins, labels=False)
print(df['equal_width_bin'])
# equal frequency binning
n_bins = 3
df['equal_freq_bin'] = pd.qcut(df['column_name'], q=n_bins, labels=False, duplicates='drop')
print(df['equal_freq_bin'])
# custom binning
custom_bins = [0, 10, 20, 30, 50]
df['custom_bin'] = pd.cut(df['column_name'], bins=custom_bins, labels=False, include_lowest=True)
print(df['custom_bin'])
Normalization and Scaling: This involves multiple techniques:
- Min-max scaling: Transforms features by scaling them to a given range such as between 0 and 1.
- Standard scaling (z-score normalization): Subtracts the mean from each dataset and divides it by the standard deviation.
Robust scaling: Substracts the median from each dataset and divides it by the Interquartile Range.
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
# min max scaler
scaler = MinMaxScaler()
df['min_max_scaled'] = scaler.fit_transform(df[['column_name']])
print(df['min_max_scaled'])
# standard scaler
scaler = StandardScaler()
df['standard_scaled'] = scaler.fit_transform(df[['column_name']])
print(df['standard_scaled'])
Extracting Sentiment: This involves understanding the tone behind the text using machine learning techniques. Here’s how to perform sentimental analysis using a technique called VADER.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
def analyze_sentiment(text):
analyzer = SentimentIntensityAnalyzer()
analyzed = analyzer.polarity_scores(text)
print(analyzed)
Data Structure Transformation
You may need to change the structure of your dataset depending on your needs. This may involve data reshaping and data format conversion.
Data Reshaping:
- Between Long and Wide Datasets: In wide format, each related variable will have its own column. In a long format, these variables are represented as observations within a single column. Converting a dataset from long to wide formats is called pivoting, and the reverse is called melting. Pandas has methods for both these operations:
import pandas as pd
# Wide format data
df_wide = pd.DataFrame({
'id': [1, 2],
'A': [5, 3],
'B': [6, 4]
})
df_long = pd.melt(df_wide, id_vars=['id'], value_vars=['A', 'B'], var_name='variable', value_name='value')
df_wide= df_long.pivot(index='id', columns='variable', values='value').reset_index().rename_axis(None, axis=1)
- Between Nested and Flat Datasets: The fundamental difference between nested and flat datasets is that in a nested dataset, a record may have sub-records. While you can use Pandas to interchange between nested and flat records, the logic depends on data arrangement.
import pandas as pd
# Example nested data
data = [
{"id": 1, "name": "John", "contacts": [{"type": "email", "value": "john@example.com"}, {"type": "phone", "value": "123456"}]},
{"id": 2, "name": "Doe", "contacts": [{"type": "email", "value": "doe@example.com"}]}
]
df = pd.DataFrame(data)
# Flatten contacts
contacts = df['contacts'].apply(pd.Series).stack().reset_index(level=1, drop=True).apply(pd.Series)
contacts.columns = ['type', 'value']
# Merge back with original DataFrame
df_flat = df.drop('contacts', axis=1).join(contacts.reset_index(drop=True))
# Flat to nested
nested_data = df_flat.groupby(['id', 'name']).apply(
lambda x: x[['type', 'value']].to_dict('records')
).reset_index().rename(columns={0: 'contacts'})
nested_data = nested_data.to_dict('records')
Data Format Conversion: Various formats serve different purposes:
- JSON: Flexible, hierarchical data representation
- CSV: Simple tabular data
- Parquet: Efficient columnar storage
- SQL: Relational database storage
Pandas has inbuilt methods to convert to and from these data formats:
import pandas as pd
#suppose your dataset is in CSV
data = pd.read_csv(‘dataset.csv’)
#converting to JSON
data.to_json(orient=’records’)
#converting to Paraquet
data.to_paraquet(‘data.paraquet’)
#converting to SQL
from sqlalchemy import create_engine
connection = create_engine(‘sqlite:///:memory:’)
data.to_sql(‘data_table’, connection, index=False)
Besides these, Pandas also supports several other data formats, including Excel, feather, and Markdown.
Best Practices for Data Transformation
Follow these best practices to ensure efficiency and reliability in data transformation:
Version Control for Data: Make it easier to detect erroneous transformations by documenting each step. By checking the documentation, you can quickly figure out what went wrong during the data transformation.
Performance Optimization: Use techniques like parallel processing and chunking to improve performance; this helps you finish your analysis faster.
Data Quality Assurance: Ensure that the transformed data is of high quality by implementing validation checks and automated testing pipelines; the quality of data transformation affects the quality of the overall analysis.
Ethical Considerations: Handle data ethically by complying with data protection regulations and anonymizing data when necessary; this will avoid unnecessary legal troubles.
Data Mapping: Creating Relationships Between Data Elements
Data mapping is a critical process that establishes how data fields from different sources relate to each other. This process is especially important for scraped data that might come from multiple sources:
There are two types of data mapping:
- Field Mapping: Maps fields of one dataset with those of another dataset to ensure that integrated data is in appropriate fields.
- Value Mapping: Maps the content of one dataset to that of another dataset to standardize values across all the datasets.
Suppose, you have two datasets. One has customer information, and the other has sales data.
import pandas as pd
# Mock Customer Information
customer_info = pd.DataFrame({
'CustID': [1, 2, 3, 4],
'GenderCode': ['M', 'F', 'M', 'O'],
'EducationLevel': ['HS', 'Bach', 'Mast', 'HS']
})
# Mock Sales Data
sales_data = pd.DataFrame({
'Customer_ID': [1, 2, 3, 4],
'Sex': ['Male', 'Female', 'Male', 'Other'],
'Education': ['High School', 'Bachelor', 'Master', 'PhD'],
'Total_Sales': [1000, 2000, 3000, 4000]
})
The code merges two datasets by:
- Creating a field mapping
- Applying the field mapping using the rename() method
- Creating value mappings
- Applying the value mapping using the map method on appropriate columns
- Using the merge() method to combine two datasets
# Define field mapping
field_mapping = {
'CustID': 'Customer_ID', # Mapping 'CustID' from customer_info to 'Customer_ID' in sales_data
'GenderCode': 'Sex', # Although 'Sex' exists in sales_data, rename to keep consistency
'EducationLevel': 'Education' # Similarly, 'EducationLevel' to 'Education'
}
# Apply field mapping to customer info
customer_info = customer_info.rename(columns=field_mapping)
# Define value mappings
gender_mapping = {
'M': 'Male',
'F': 'Female',
'O': 'Other'
}
education_mapping = {
'HS': 'High School',
'Bach': 'Bachelor',
'Mast': 'Master'
}
# Apply value mappings to customer_info
customer_info['Sex'] = customer_info['Sex'].map(gender_mapping)
customer_info['Education'] = customer_info['Education'].map(education_mapping)
# Merge datasets on 'Customer_ID'
integrated_data = pd.merge(customer_info, sales_data, on='Customer_ID', how='outer')
print(integrated_data)
Data Mapping Best Practices
When implementing data mapping for web-scraped data, consider these essential practices:
Schema Validation: Ensure that both source and destination data structures match their expected formats. This prevents mapping errors and data corruption. Implement validation checks before applying mappings to catch inconsistencies early.
Type Handling: Address differences in data types between sources. For example, one source might represent prices as strings with currency symbols, while another uses decimal numbers. Your mapping logic should handle these conversions gracefully.
Null Handling: Determine how to proceed when source fields are missing or contain null values. Establish clear rules for these cases – whether to skip the mapping, use a default value, or raise an error.
Error Logging: Keep track of mapping failures and provide meaningful error messages. This helps in debugging and maintaining data quality. Consider implementing a process to quarantine problematic records for manual review.
Wrapping Up
You can transform and map scraped data using Python. Its libraries, including Pandas, Numpy, Sklearn, etc., provide excellent methods for efficiently transforming and mapping your scraped data.
However, the quality of scraped data determines the efficiency of data mapping and transformations. This is where a web scraping service like ScrapeHero shines. It can provide high-quality data according to your specifications.
ScrapeHero is a fully-managed web scraping service capable of building enterprise-grade web scrapers and crawlers. Our services also include custom RPA and AI solutions. Let us handle the scraping part. Contact ScrapeHero now.