This is an open thread and the goal is to solicit comments on what the best web scraping service may look like. Please go ahead a type away and write down the ideas or requirements…
Are you wondering whether you can perform web scraping using RegEx? Yes, you can. However, using RegEx is more error-prone. Hence, dedicated parsing libraries, like BeautifulSoup, are more preferred.
But there is no harm in learning how to use RegEx for web scraping. It can solidify your RegEx and web scraping knowledge.
This article shows you how to use RegEx for web scraping without using any parser.
How RegEx works
Before using RegEx for web scraping, let’s be clear on the fundamentals.
RegEx or regular expressions work by searching for a pattern in a string. For example, suppose you want to find emails from a string. Then the pattern of the string could be
\S+@\S+\.\S+
Here,
- \S is a non-whitespace character
- + tells that the previous character should repeat one or more times
- @ matches the character itself
- \. matches the period. The period is a special character, requiring a backslash to escape it.
In short, the above pattern searches for a string. The string has non-white characters before and after the character ‘@.’ Following them, the string also has a period and another set of non-whitespace characters.
Here is a RegEx cheat sheet you can use while using RegEx for web scraping.
Data Scraped Using RegEx
The tutorial shows web scraping using regular expressions with Python. The Python code uses RegEx to scrape eBay product data from its search results page.
- Name
- Price
- URL
Use the browser’s inspect tool to find the HTML source code of these data points. Right-click on a data point and click ‘Inspect’.
Web Scraping using RegEx: The Environment
The code in this tutorial uses three Python packages.
- The re module: This module enables you to use RegEx
- The json module: This module allows you to write the extracted data to a JSON file
- Python requests: This library has methods to manage HTTP requests
The re and json modules come with the Python standard library. So you don’t need to install them.
However, for web scraping with Python requests library, you must install it; you can do that using pip.
pip install requests
Web Scraping using RegEx: The Code
Import the packages mentioned above; you can do that with a single code line.
import re, requests, json
Make an HTTP request using the Python requests package to the eBay search results page; the request’s response will contain the HTML source code. You can use the get() method of Python requests to make the HTTP request.
response = requests.get("https://www.ebay.com/sch/i.html?_from=R40&_trksid=p4432023.m570.l1313&_nkw=smartphones&_sacat=0")
Extract all the div elements containing the product details from the response text. From these div elements, you can then extract the name, URL, and price. The findall() method of the re-module can help you find the div elements.
The findall() method takes two arguments, a pattern, and a string. It checks for the pattern in the string and returns the matched values. Here, the pattern matches a string that
- Starts with ‘<div class=”s-item__wrapper’
- Contains ‘<span class=s-item__price’
- Ends with ‘</div>’
products = re.findall(r'<div class="s-item__wrapper.+?>.+?<span class=s-item__price.+?<\/div>',response.text)
The extracted div elements will be within a list; you can iterate through this list and extract the required data points. A different RegEx pattern is required for each data point.
Extracting Name
The name will be inside a span tag with the role ‘heading’
<span role=heading aria-level=3> <!--F#f_0-->HP Chromebook 11 G6 11.6" Intel 2.40 GHz 4GB RAM 16GB eMMC Bluetooth Webcam<!--F/--> </span>
Therefore, the RegEx pattern to extract the name should
- Start with `<span role=heading`
- End with `<\/span>`.
name_pattern = r'<.+?>(.+?)<.+?><\/span>'
Extracting Price
The price will be inside a span element with the class ‘s-item__price’
<span class=s-item__price> <!--F#f_0--><!--F#f_0-->$79.99<!--F/--><!--F/--> </span>
Therefore, the RegEx pattern to extract the name should
- Start with `<span class=s-item__price>`
- End with `<\/span>`
price_pattern = r'<span class=s-item__price><.+?><.+?>(.+?)<.+?><.+?><\/span>'
Extracting URL
The URL will be inside an anchor tag as the href attribute.
<a data-interactions='[{"actionKind":"NAVSRC","interaction":"wwFVrK2vRE0lhQQ0MDFKNDlCQ0VOSzQzR001MTNZUEFBWUVYOVY0MDFKNDlCQ0VKWVBDVlhCNDBIUVJGNjFRU0MAAAg3NDAwDE5BVlNSQwA="}]' target=_blank data-s-03wf764='{"eventFamily":"LST","eventAction":"ACTN","actionKind":"NAVSRC","actionKinds":["NAVSRC"],"operationId":"2351460","flushImmediately":false,"eventProperty":{"$l":"62869668414928"}}' _sp=p2351460.m1686.l7400 class=s-item__link href=https://www.ebay.com/itm/115603286837?hash=item1aea7e2b35:g:nV0AAOSw~wdjfE2o&itmprp=enc%3AAQAJAAAA4AsnK4hoYAelsNVt8vNwmOQEIEKRSBTpVOI4Fzbr7cY0wK4o3g%2BrtWEJhLd2tsPKiwUrnIGyzEdYgBJOcmzcLc1%2FC4tkCFZqpT5nMaUS3UHfDgnh%2FeiHBaoBh%2BjUmuHYeZzx45Agc8Zvj897LpZpEWGXKSH%2BHigaqb%2BZETNr3mFR9d7CbpBPZ%2BxtTxJxa6HdFw%2BaFHzqxi3xQ3hBXOP9NuoOZo631pXyyFqCMy4eTMG1UcJrvFr5eAspGuquv8tgpImBsQ2ndtFEiB6zKuMfsFvQQI3hdTAvWd926KbKYgqF%7Ctkp%3ABFBM1Omxq6Jk>
Therefore, the RegEx pattern to extract the name should
- Start with `href=`
- End with a space.
url_pattern = r'href=(https:.+?) .+?>'
Note: The above patterns are specific to the eBay search results page. Analyze the HTML source code to determine the appropriate RegEx patterns in each project.
You can use the above pattern to extract data from each div element. Iterate through the extracted div elements, and in each iteration
1. Extract name, price, and URL
name = re.search(name_pattern,product).group(1)
price = re.search(price_pattern,product).group(1)
url = re.search(url_pattern,product).group(1)
2. Store them in a dict and append it to an array. Here, the patterns also match strings that are not required. So use a conditional statement while appending; specifically, it should not append the values if the name contains ‘Shop on eBay’ or the character ‘<’
nameAndUrl.append(
{
"Name":name,
"Price":price,
"URL":url
}
) if name !='Shop on eBay'and '<' not in name else None
Finally, you can save the array as a JSON file using the json module. To do so, use json.dump().
with open("regEx.json","w",encoding="utf-8") as f:
json.dump(nameAndUrl,f,indent=4,ensure_ascii=False)
Code Limitations
The code shown in this tutorial is only efficient if the code is well-structured. For complex, highly nested HTML source code, web scraping using RegEx can become slow.
Moreover, a slight change in the HTML code can break the code. For example, a change in spacing or the order of attributes may render the code unusable even if the attributes and the tag names of the data points remain unchanged.
The code does not bypass anti-scraping measures. Hence, it is not appropriate for large-scale web scraping, as the massive number of requests makes your scraper more susceptible to these measures.
Why Code Yourself? Use ScrapeHero’s Web Scraping Service
The code can scrape three data points from an eBay search results page, showing web scraping with RegEx in Python.
However, maintaining a RegEx code can be challenging as slight changes can break it. Moreover, trying to scrape additional data points requires complex RegEx that can slow down the process.
Therefore, it is better to use a professional web scraping service, like ScrapeHero, for large-scale projects where scalability is important.
ScrapeHero’s web scraping service can build enterprise-grade web scrapers and crawlers according to your specifications. This way, you can focus on using the data to derive insights rather than gathering it. Contact ScrapeHero now for high-quality data.
We can help with your data or automation needs
Turn the Internet into meaningful, structured and usable data