Webscraping using Python without using large frameworks like Scrapy

Share:

scrapy-open-source-web-tool

Scrapy is a well-established framework for scraping, but it is also a very heavy framework. For smaller jobs, it may be overkill and for extremely large jobs it is very slow. If you would like to roll up your sleeves and perform web scraping in Python. continue reading.

If you need publicly available data from scraping the Internet, before creating a web scraper, it is best to check if this data is already available from public data sources or APIs. Check the site’s FAQ section or Google for their API endpoints and public data. Even if their API endpoints are available you have to create some parser for fetching and structuring the data according to your needs.

Here are some basic steps performed by most web spiders:

  1. Start with a URL and use an HTTP GET or PUT request to access the URL
  2. Fetch all the contents in it and parse the data
  3. Store the data in any database or put it into any data warehouse
  4. Enqueue all the URLs in a page
  5. Use the URLs in the queue and repeat from process 1

Here are the 3 major modules in every web crawler:

  1. Request/Response handler.
  2. Data parsing/data cleansing/data munging process.
  3. Data serialization/data pipelines.

Let’s look at each of these modules and see what they do and how to use them.

Request/Response Handler

Request/response handlers are managers who make HTTP requests to a url or a group of urls, and fetch the response objects as HTML contents and pass this data to the next module. If you use Python for performing request/response url-opening process libraries such as the following are most commonly used

  1. urllib (20.5. urllib – Open arbitrary resources by URL – Python v2.7.8 documentation) -Basic python library yet high-level interface for fetching data across the World Wide Web.
  2. urllib2 (20.6. urllib2 – extensible library for opening URLs – Python v2.7.8 documentation) – an extensible library of urllib, which would handle basic HTTP requests, digest authentication, redirections, cookies and more.
  3. requests (Requests: HTTP for Humans) – Much advanced request library

which is built on top of basic request handling libraries.

Data parsing/data cleansing/data munging process

This is the module where the fetched data is processed and cleaned. Unstructured data is transformed into structured during this processing. Usually  a set of Regular Expressions (regexes) which perform pattern matching and text processing tasks on the html data are used for this processing.

In addition to regexes, basic string manipulation and search methods are also used to perform this cleaning and transformation. You must have a thorough knowledge of regular expressions and so that you could design the regex patterns.

Data serialization/data pipelines

Once you get the cleaned data from the parsing and cleaning module, the data serialization module will be used to serialize the data according to the data models that you require. This is the final module that will output data in a standard format that can be stored in databases, JSON/CSV files or passed to any data warehouses for storage. These tasks are usually performed by libraries listed below for web scraping in python

  1. pickle (pickle – Python object serialization) –  This module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure
  2. JSON (JSON encoder and decoder)
  3. CSV (https://docs.python.org/2/library/csv.html)
  4. Basic database interface libraries like pymongo (Tutorial – PyMongo), mysqldb ( on python.org), sqlite3(sqlite3 – DB-API interface for SQLite databases)

And many more such libraries based on the format and database/data storage.

Read more: How to Implement Web Scraping in R

Basic spider rules

The rules to follow while building a spider are to be nice to the sites you are scraping and follow the rules in the site’s spider policies outlined in the site’s robots.txt.

Limit the number of requests in a second and build enough delays in the spiders so that you don’t adversely affect the site.

It just makes sense to be nice.

To learn more on web scraping in Python check out our web scraping tutorials page.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Table of content

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

Transform and map scraped data

How to Transform and Map Scraped Data with Python Libraries

Learn how you can transform and map data using Python.
Using NLP to clean and structure scraped data

How to Use NLP to Clean and Structure Scraped Data

Learn how to use NLP to clean and structure scraped data.
Search engine web crawling

From Crawling to Ranking! This is How Search Engines Use Web Crawling to Index Websites!

Search engine crawling indexes web pages, making it essential for ranking and visibility in search results.
ScrapeHero Logo

Can we help you get some data?