You no longer need to manage servers or infrastructure to run web scrapers. With serverless web scraping, you can scrape every hour or run only when needed. You only pay when it runs.
In this article, you’ll learn how to deploy web scrapers using AWS Lambda, Azure Functions, and Google Cloud Functions step by step.
Deploying Web Scrapers on Serverless Platforms
Serverless platforms help run scrapers in a modern way without the hassle of managing servers or scaling infrastructure, saving both time and money.
Your code runs in response to triggers such as timers or HTTP requests with serverless services like AWS Lambda, Azure Functions, and Google Cloud Functions.
These cloud providers handle the provisioning, scaling, and maintenance, making them ideal for scheduled scrapes, event-based triggers, and lightweight data extraction jobs.
Deploying Web Scrapers on AWS Lambda
Why AWS Lambda?
AWS Lambda is ideal for short, recurring scraping tasks. Since it supports Python (among other languages) and integrates easily with CloudWatch Events, you can schedule jobs without needing a separate task scheduler or server.
1. Develop the Scraper
This is a basic scraper using requests and BeautifulSoup that extracts an <h1> tag from a web page:
import requests
from bs4 import BeautifulSoup
def lambda_handler(event, context):
url = "https://example.com"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
data = soup.find("h1").text if soup.find("h1") else "No H1 tag found"
return {
"statusCode": 200,
"body": data
}
AWS Lambda triggers this function and returns the scraped data as a response.
2. Package Dependencies
AWS Lambda has no internet access to install packages at runtime. So you may need to bundle your dependencies into a .zip file:
mkdir scraper && cd scraper
pip install requests beautifulsoup4 -t .
cp ../lambda_function.py .
zip -r function.zip .
-t . tells pip to install packages into the current directory. When uploading to Lambda, remember to Include all your files and libraries in the zip file.
3. Create and Deploy the Lambda Function
Go to the AWS Management Console, navigate to Lambda, and:
- Click the Create function
- Choose “Author from scratch”
- Select the Python runtime (e.g., Python 3.11)
- Upload your function.zip under the Code section
- Set the handler name to lambda_function.lambda_handler (or adjust based on your filename)
4. Set Triggers Using CloudWatch
To automate scraping, set up a CloudWatch Events rule:
- Go to CloudWatch > Rules > Create rule
- Under Event Source, select “Schedule”
- Use a cron or rate expression like rate(1 hour) or cron(0 12 * * ? *)
- Under Targets, select your Lambda function
Now, your scraper runs automatically at fixed intervals without any extra servers or schedulers.
Deploying Web Scrapers on Azure Functions
Azure Functions offers a solid serverless option, especially for those who are already using Microsoft tools or hosting in Azure.
With Azure Functions, you will be able to get built-in scaling, support for HTTP and timer-based triggers, and tight integration with services like Azure Storage, Event Grid, and Application Insights.
It’s an excellent choice for lightweight web scraping jobs that run periodically or need to be triggered by HTTP requests.
1. Develop the Scraper
Azure Functions need a specific function signature. Here’s an example using requests to fetch a page, which will be exposed as an HTTP endpoint:
import requests
import azure.functions as func
def main(req: func.HttpRequest) -> func.HttpResponse:
url = "https://example.com"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
return func.HttpResponse(response.text, status_code=200)
The function runs when it receives an HTTP request. You can modify it to parse specific elements from the HTML using BeautifulSoup.
2. Project Structure
Azure Functions follow a specific folder layout. A basic Python function will look like this:
MyScraperFunction/
├── __init__.py # Your main Python code
├── function.json # Trigger and binding configuration
├── requirements.txt # Dependencies
function.json defines the trigger and return type:
{
"bindings": [
{
"authLevel": "function",
"type": "httpTrigger",
"direction": "in",
"name": "req",
"methods": ["get"]
},
{
"type": "http",
"direction": "out",
"name": "$return"
}
]
}
requirements.txt lists your dependencies. For this example:
requests
If you are planning to parse HTML, then you can add beautifulsoup4.
3. Deploy Using Azure CLI
Azure provides a CLI tool to set up and deploy your function app.
# Initialize a new Azure Function app with Python
func init MyScraperFunction --python
cd MyScraperFunction
# Create a new function named "scraper" using the HTTP trigger template
func new --name scraper --template "HTTP trigger"
# Log in to your Azure account
az login
# Publish the function app to your Azure subscription
func azure functionapp publish
Before publishing, make sure that you have already created an Azure Function App in the Azure Portal.
4. Set Triggers
Azure Functions created using the HTTP template respond to web requests by default. But you can also run your scraper on a schedule using a timer trigger.
To use a timer, update function.json:
{
"bindings": [
{
"name": "mytimer",
"type": "timerTrigger",
"direction": "in",
"schedule": "0 */1 * * * *"
}
]
}
The schedule field follows a six-field CRON expression, which runs every minute.
Deploying Web Scrapers on Google Cloud Functions
Google Cloud Functions are ideal for lightweight web scraping tasks. It is beneficial if you’re already using other Google Cloud services like BigQuery, Cloud Storage, or Pub/Sub.
These functions are event-driven and serverless. They can be triggered via HTTP requests or scheduled jobs using Cloud Scheduler.
Google Cloud Functions also support Python and are easy to deploy with minimal setup.
1. Develop the Scraper
You can write a simple function that accepts an HTTP request and returns extracted content. Here’s a basic example using requests:
import requests
def scrape(request):
url = "https://example.com"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
return response.text
You can expand this to parse specific HTML elements using BeautifulSoup, or return JSON based on extracted data.
2. Project Structure
Google Cloud Functions require a minimal set of files:
my-scraper/
├── main.py # Your main function logic
├── requirements.txt # Your Python dependencies
main.py must include the scrape(request) function.
requirements.txt should include:
requests
If you’re parsing HTML, you can also add beautifulsoup4 as well.
3. Deploy Using gcloud CLI
Before deploying your function, make sure that you enable billing on your Google Cloud project and select it using gcloud config set project [PROJECT_ID].
Then deploy your function:
gcloud functions deploy scrape \
--runtime python311 \
--trigger-http \
--allow-unauthenticated \
--entry-point scrape
Note that:
- –runtime python311 sets the Python version
- –trigger-http makes the function respond to HTTP requests
- –allow-unauthenticated allows public access. You can also remove it if you want to restrict access
After deployment, you’ll get a public URL for your function.
4. Set Triggers with Cloud Scheduler
You can use Cloud Scheduler which is Google Cloud’s equivalent of a cron job to run the scraper on a schedule.
gcloud scheduler jobs create http scrape-job \
--schedule="0 * * * *" \
--uri=https://REGION-PROJECT.cloudfunctions.net/scrape \
--http-method=GET
Replace the –uri value with the actual function URL from your deployment. The Cloud Scheduler cannot invoke the function if it’s not publicly accessible, so you have to permit Cloud Scheduler.
Why ScrapeHero Web Scraping Service?
Even with serverless technology, scraping at a large scale isn’t straightforward. Function timeouts (AWS: 15 min, GCP: 9 min, Azure: 5 min) can kill long jobs.
In serverless web scraping, you may also face challenges like IP bans and rate limits, which demand expert handling, meticulous setup, and monitoring.
ScrapeHero can take this burden off your hands. Using the ScrapeHero web scraping service, you outsource your data collection, and we handle everything from anti-bot measures to data formatting.
We are a fully managed, enterprise-grade web scraping service with a decade of experience ensuring high-quality, error-free data.
Frequently Asked Questions
Serverless web scraping is running scrapers on cloud platforms such as AWS Lambda, Azure Functions, or Google Cloud Functions without managing infrastructure.
Since you only pay for execution time, it is cost-efficient but may not be suitable for large-scale web scraping.
You can use programming languages like Python, Node.js, and Go for serverless web scraping.
Azure Functions supports small-scale jobs with proper rate limits. Its default timeout is 5 minutes.
Yes, you can use GCP Functions, which is ideal for modular tasks split across multiple functions. It allows scraping up to 9 minutes per invocation.