Scraper Blocked by Amazon? IP Rotation for Scraping Can be the Answer

Share:

IP rotation for scraping

IP rotation for scraping is one way to evade Amazon’s anti-scraping measures. It involves hiding your IP by cycling through different proxies while accessing Amazon’s website, either through HTTP requests or using automated browsers. Confused about how to proceed? Read on.

This tutorial shows you how to manage IP rotation using various Python libraries.

IP Rotation for Scraping Static Amazon Pages

The simplest way for Amazon scraping with IP rotation is to use Python’s urllib or the requests library. Here’s how you can rotate proxies with these libraries:

Using Python requests

Import requests and, from itertools, import cycle.

import requests
from itertools import cycle

The class cycle allows you to iterate through a list cyclically; that means after the last item in the list, you get the first item.

Next, store the available proxies in a list.

proxies = [

    {'http': 'http://proxy1:8080'},

    {'http': 'http://proxy2:8080'},

    {'http': 'http://proxy3:8080'}

]

Use this list to create an object of the class cycle.

proxy_pool = cycle(proxies)

The next step is to make HTTP requests. For convenience, you can create a dedicated function to make an HTTP request. This function will

  1. Get the next IP from the list using next().
  2. Make an HTTP request to Amazon’s URL.
# Using requests

def make_request(url):

    proxy = next(proxy_pool)
 
    response = requests.get(url, proxies=proxy, timeout=10)

    if response.status_code != 200:
        raise Exception(‘Failed’) 
    return response.text

Want to learn more about using Python requests for web scraping? Read this article on web scraping with Python requests.

Using Urllib

When using urllib, you need to create a custom opener for each proxy:

Start by importing these from Urllib:

  • ProxyHandler: To handle proxies
  • build_opener: Handles HTTP requests when using handlers like ProxyHandler
from urllib.request import ProxyHandler, build_opener

Next, define a function to accept a URL and make an HTTP request with the proxy. This function:

  1. Gets the next proxy from the list defined previously
  2. Creates a ProxyHandler object using the obtained proxy
  3. Builds an opener using the proxy handler object
  4. Tries to make an HTTP request using the opener
  5. Raises an exception
def make_urllib_request(url):

    proxy = next(proxy_pool)

    handler = ProxyHandler(proxy)

    opener = build_opener(handler)

    response = opener.open(url, timeout=10)

    if response.status_code != 200:
        raise Exception(‘Failed’) 

    return response.read().decode(‘utf-8’)  

You can now call the functions whenever you need to make a request to the URL with a new IP address. Use a loop, and in each iteration, try calling make_request():

for i in range(len(proxy_pool)*2):
    try:
        make_request(url)
        break
    except:
        continue

The number of iterations depends on the number of attempts you want to make with one proxy. 

For instance, If you want to try a proxy twice, the number of iterations should be twice the number of proxies in your pool.

IP Rotation for Scraping Dynamic Amazon Pages

Suppose you want to scroll and load dynamic elements on Amazon’s page. You need to use a browser automation library, like Playwright. Here’s how you would rotate proxies in that case.

from playwright.sync_api import sync_playwright

def scrape_with_playwright():
    
    proxies = ['http://proxy1:8080', 'http://proxy2:8080']

    proxy_cycle = cycle(proxies)

    with sync_playwright() as p:
        success = False
        for i in len(proxies):
            
            proxy = next(proxy_cycle)
            browser = p.chromium.launch(

                proxy={

                    'server': proxy,
                }

            )

            

            try:

                page = browser.new_page()

                page.goto('https://amazon.com') 
                success = True               
                break
                # Your scraping logic here
            except:
                continue

            finally:

                browser.close()
        if success = False:
            print(‘all proxies failed’)

The above code cycles through proxies as before. It uses a loop, and in each iteration, it starts a Chromium instance with a proxy and tries to navigate to the target page. 

If the navigation is successful, the loop breaks; otherwise, it moves on to the next proxy. 

Need to know how to scrape Amazon using Playwright? Read this article on scraping Amazon product offers and sellers.

Adding Robustness with Delays and Validation

Here’s a more practical implementation that includes proxy validation and request delays. This method uses a customer class ProxyManager. 

This class accepts proxies, minimum delay, and maxim delay as arguments while creating its object. It provides two methods:

  1. validate_proxy(): Ensures that the proxy works
  2. get_next_proxy(): Gets the next proxy from the proxy pool after ensuring that it works.
import time

import random

class ProxyManager:

    def __init__(self, proxies, min_delay=1, max_delay=5):

        self.proxies = cycle(proxies)

        self.min_delay = min_delay

        self.max_delay = max_delay

    

    def validate_proxy(self, proxy):

        try:

            response = requests.get(

                'https://httpbin.org/ip',

                proxies={'http': proxy, 'https': proxy},

                timeout=5

            )

            return response.status_code == 200

        except:

            return False

   
    def get_next_proxy(self):

        proxy = next(self.proxies)

        if self.validate_proxy(proxy):

            time.sleep(random.uniform(self.min_delay, self.max_delay))

            return proxy

        return self.get_next_proxy()  # Try next proxy

Now, you can use the class ProxyManager while web scraping Amazon. Just initialize the class with a proxy list and the values for minimum and maximum delays between requests.

proxies = ['http://proxy1:8080', 'http://proxy2:8080']

manager = ProxyManager(proxies,2,4)
# get new proxy

proxy = manager.get_next_proxy()

Important Considerations When Using IP Rotation

Each approach has its strengths. Using urllib/requests is simple and good for basic needs; however, Playwright is necessary for handling dynamic websites.

Remember to implement appropriate delays between requests and validate proxies before use to maintain a stable and respectful scraping operation. Consider using a proxy service that provides an API for rotating IPs automatically, as this can be more reliable than managing your own proxy pool.

Also, proper error handling and logging in production environments should be implemented to track proxy performance and quickly identify issues.  

Want to know more about IP rotation? Read this article on using proxies and rotating IP addresses.

Why Use a Web Scraping Service

You can manage proxies yourself while scraping Amazon. Just create a list of proxies and use itertools to cycle through them. 

However, choosing the right proxies and managing them yourself can be cumbersome, especially if you only need the data. It’s better to use a web scraping service in that case.

With a web scraping service like ScrapeHero, you won’t have to bother about choosing and managing proxies or other technical aspects of web scraping. We’ll take care of all that.

ScrapeHero is an enterprise-grade web scraping service that can build high-quality scrapers and crawlers for you. Our services can also handle your complete data pipeline, including robotic process automation and custom AI solutions. 

Table of content

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Clients love ScrapeHero on G2

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

Best Tools for Scraping Amazon Reviews

The Ultimate List of 5 Best Tools for Scraping Amazon Reviews

Here are 5 best tools for scraping Amazon reviews; discussing the features, pros, and cons of each.
Enhance scraped data

Transform Raw Data: Enrich and Enhance Scraped Data Effectively

Learn ways in which you can enrich and enhance scraped data for decision-making and increase operational efficiency.
Scrape Amazon Fresh

Web Scraping Amazon Fresh: How to Get Online Grocery Market Data

Learn how you can use Python to scrape Amazon Fresh.
ScrapeHero Logo

Can we help you get some data?