Scrapy is a well-established framework for scraping, but it is also a very heavy framework. For smaller jobs, it may be overkill and for extremely large jobs it is very slow. If you would like to roll up your sleeves and perform web scraping in Python. continue reading.
If you need publicly available data from scraping the Internet, before creating a web scraper, it is best to check if this data is already available from public data sources or APIs. Check the site’s FAQ section or Google for their API endpoints and public data. Even if their API endpoints are available you have to create some parser for fetching and structuring the data according to your needs.
Here are some basic steps performed by most web spiders:
- Start with a URL and use an HTTP GET or PUT request to access the URL
- Fetch all the contents in it and parse the data
- Store the data in any database or put it into any data warehouse
- Enqueue all the URLs in a page
- Use the URLs in the queue and repeat from process 1
Here are the 3 major modules in every web crawler:
- Request/Response handler.
- Data parsing/data cleansing/data munging process.
- Data serialization/data pipelines.
Let’s look at each of these modules and see what they do and how to use them.
Request/Response Handler
Request/response handlers are managers who make HTTP requests to a url or a group of urls, and fetch the response objects as HTML contents and pass this data to the next module. If you use Python for performing request/response url-opening process libraries such as the following are most commonly used
- urllib (20.5. urllib – Open arbitrary resources by URL – Python v2.7.8 documentation) -Basic python library yet high-level interface for fetching data across the World Wide Web.
- urllib2 (20.6. urllib2 – extensible library for opening URLs – Python v2.7.8 documentation) – an extensible library of urllib, which would handle basic HTTP requests, digest authentication, redirections, cookies and more.
- requests (Requests: HTTP for Humans) – Much advanced request library
which is built on top of basic request handling libraries.
Data parsing/data cleansing/data munging process
This is the module where the fetched data is processed and cleaned. Unstructured data is transformed into structured during this processing. Usually a set of Regular Expressions (regexes) which perform pattern matching and text processing tasks on the html data are used for this processing.
In addition to regexes, basic string manipulation and search methods are also used to perform this cleaning and transformation. You must have a thorough knowledge of regular expressions and so that you could design the regex patterns.
Data serialization/data pipelines
Once you get the cleaned data from the parsing and cleaning module, the data serialization module will be used to serialize the data according to the data models that you require. This is the final module that will output data in a standard format that can be stored in databases, JSON/CSV files or passed to any data warehouses for storage. These tasks are usually performed by libraries listed below for web scraping in python
- pickle (pickle – Python object serialization) – This module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure
- JSON (JSON encoder and decoder)
- CSV (https://docs.python.org/2/library/csv.html)
- Basic database interface libraries like pymongo (Tutorial – PyMongo), mysqldb ( on python.org), sqlite3(sqlite3 – DB-API interface for SQLite databases)
And many more such libraries based on the format and database/data storage.
Read more: How to Implement Web Scraping in R
Basic spider rules
The rules to follow while building a spider are to be nice to the sites you are scraping and follow the rules in the site’s spider policies outlined in the site’s robots.txt.
Limit the number of requests in a second and build enough delays in the spiders so that you don’t adversely affect the site.
It just makes sense to be nice.
To learn more on web scraping in Python check out our web scraping tutorials page.
We can help with your data or automation needs
Turn the Internet into meaningful, structured and usable data