Web Scraping Alibaba Using Python

Share:

web scraping Alibaba.com

Are you thinking of web scraping Alibaba? Search no more. Here is how to scrape data from Alibaba using Python.

Python has several modules for web scraping. This tutorial will show how to scrape Alibaba products using Python’s Playwright module. The module allows you to surf websites programmatically using its unique browser.

Setting up the Environment for Web Scraping Alibaba

Install Playwright and SelectorLib using the pip package manager.

pip install selectorlib playwright

Then, install the Playwright browser.

playwright install

Use SelectorLib to Get CSS Elements

Here, you will use a YAML file to provide information about the elements you want to scrape. SelectorLib will help you create this YAML file. It is a convenient tool to select elements from a web page to get their CSS selectors.

You can install SelectorLib as a Chrome extension. After installation, you can find it in your browser’s developer tools.

General steps to get CSS elements using SelectorLib,

  1. Go to the page you want to scrape from Alibab.com and open SelectorLib in Developer Tools.
    Screenshot showing SelectorLib in Developer Tools
  2. Click on Create Template and enter the desired name.
    Screenshot showing how to create a template in SelectorLib
  3. Click Add, and select the type of the selector.
    Screenshot showing how to add a selector in SelectorLib
  4. Click Select Element and then select the HTML element on the web page. To select an HTML element, you must hover over it; this will highlight the element. Then click, and you will see the CSS selector in the corresponding text box.
    Screenshots showing how to select elements from Alibaba
  5. Click save, and you will get the following screen.
    Screenshot showing a saved selector in SelectorLib

You can then create child elements by clicking on the plus sign on a parent element.

Screenshot showing parent and child elements in SelectorLib

Finally, you can export the content as a YAML file.

Products:
    css: 'div.fy23-search-card:nth-of-type(1)'
    xpath: null
    type: Text
    children:
        name:
            css: 'h2.search-card-e-title span'
            xpath: null 
        price:
            css: div.search-card-e-price-main
            xpath: null
            type: Text
        seller_name:
            css: a.search-card-e-company
            xpath: null
            type: Text
        Link:
            css: a.search-card-e-slider__link
            xpath: null
            type: Link

You can see from the YAML file that this tutorial scrapes four data points:

  • Name
  • Price
  • Seller Name
  • Link

Here, the code only scrapes data from the product search results page, which has these details. However, you can also write code to go to the product page and extract more information. Keep in mind that the code will then require more time to finish.

The code for Web Scraping Alibaba

The code has several defined functions. Here is their basic logical flow.

Flowchart showing the logical flow of Web Scraping Alibaba

Here are the steps to write the code:

  1. Import the modules necessary for Alibaba web scraping:
    1. Asyncio: for asynchronous programming that allows the code to execute the next step while the previous step is still waiting for the results.
    2. Playwright: to browse the internet programmatically
    3. CSV: to save the result as a CSV file
    4. SelectorLib: to get the selectors for locating data points
    5. Re: for Regular Expression support
      import asyncio
      import re
      from playwright.async_api import async_playwright
      import csv
      from selectorlib import Extractor
  2. Create a function parse() to
    1. Get HTML content from a web page.
    2. Extract the required data from the HTML content
      async def parse(page, search_text, extractor):
          html_content = await page.content()
          return extractor.extract(html_content, base_url=page.url)
  3. Write a function process_page() to
    1. Use the Playwright browser to go to Alibaba’s website
    2. Call parse() function to get data
    3. Write the data into the created CSV file
    4. Call process_page() again if the number of products scraped is less than the allowed maximum.
      async def process_page(page, url, search_text, extractor, max_pages, writer, current_page_no=1):
          try:
              await page.goto(url)
              data = await parse(page, search_text, extractor)
              product = data['Products']
              # Write to CSV
              writer.writerow([product['name'], product['price'], product['seller_name'], product['Link']])
      
      
              # Pagination logic
              if data['Products'] and current_page_no < max_pages:
                  next_page_no = current_page_no + 1
                  next_page_url = re.sub(r'(page=\d+)|$', lambda match: f'page={next_page_no}' if match.group(1) else f'&page={next_page_no}', url)
                  await process_page(page, next_page_url, search_text, extractor, max_pages, writer, next_page_no)
      
      
          except Exception as e:
              print(f"Error processing page: {e}")
      
  4. Define a function start_requests() to
    1. Extract keywords from “keywords.csv,”
    2. Create a CSV file to write the results
    3. Run the process_page() function.
      async def start_requests(page, extractor, max_pages):
          with open("keywords.csv") as search_keywords, open("alibaba_products.csv", "w", newline="") as csvfile:
              writer = csv.writer(csvfile)
              # Write CSV header
              writer.writerow(["Name", "Price", "Seller Name", "Link"])
      
      
              reader = csv.DictReader(search_keywords)
              for keyword in reader:
                  search_text = keyword["keyword"]
                  url = f"https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText={search_text}&viewtype=G&page=1"
                  await process_page(page, url, search_text, extractor, max_pages, writer)
  5. Finally, the main() function integrates the functions defined above. The function
    1. Launches the Playwright browser
    2. Extracts CSS selectors from the YAML file
    3. Sets the maximum number of pages the code can scrape
    4. Calls the start_requests() function
    5. Limitations of the Code
      async def main():
          async with async_playwright() as p:
              browser = await p.chromium.launch()
              page = await browser.new_page()
      
      
              extractor = Extractor.from_yaml_file("search_results.yml")
              max_pages = 20
      
      
              await start_requests(page, extractor, max_pages)
      
      
              await browser.close()
      # Run the main function
      await main()

Here is the scraped data

Screenshot showing the results of web scraping Alibaba

Limitations of the Code

This code uses CSS selectors to locate elements on Alibaba’s website. The selectors may change frequently. So, you must use SelectorLib again to get the new CSS selectors.

Moreover, the code might fail for large-scale scraping because it can’t bypass anti-scraping measures like rate limiting.

Concluding

You can scrape product data from a website, such as Alibaba, using Python. This tutorial showed you how to scrape Alibaba using Playwright. Further, you saw how to use SelectorLib to get the CSS selectors required to instruct Playwright on what to scrape.

However, CSS selectors can change frequently. Therefore, you must keep checking for any changes in the website structure. Or your code will fail to locate the data points.

This code can also only scrape a modest amount of data. To get thousands of product details, you need a more robust code. Try ScrapeHero Sevices; we can build enterprise-grade web scrapers for you.

ScrapeHero is a fully managed web scraping service that can scrape eCommerce websites for product and brand monitoring and more.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Table of content

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

Transform and map scraped data

How to Transform and Map Scraped Data with Python Libraries

Learn how you can transform and map data using Python.
Using NLP to clean and structure scraped data

How to Use NLP to Clean and Structure Scraped Data

Learn how to use NLP to clean and structure scraped data.
Search engine web crawling

From Crawling to Ranking! This is How Search Engines Use Web Crawling to Index Websites!

Search engine crawling indexes web pages, making it essential for ranking and visibility in search results.
ScrapeHero Logo

Can we help you get some data?