Need to Know Fashion Trends? Learn How to Scrape Product Data from H&M

Share:

Learn how to use MechanicalSoup to fetch and parse H&M product data.

Are you interested in fashion trends, price, monitoring, or market research? If so, scrape Product data from H&M’s website and analyze it. 

Although H&M uses lazy loading to display results, meaning you need to scroll to load all items, you can still use HTTP requests to gather the data. How? Simply extract the data from the JSON string found within one of the script tags.

This article will guide you through building an H&M Product data scraper using MechanicalSoup, a request-based web scraping library for Python. 

Data Scraped From H&M

H&M stores the product data available on a search results page as a JSON string inside a script tag with the id __NEXT_DATA__.

Initially, the code retrieves only the product URLs from the search results page. Then, it collects these four details from the product pages targeted by the product URLs:

  1. Name
  2. Price
  3. Description
  4. URL

All of this information is also available as a JSON string within a script tag on the product page. 

Scrape Product Data from H&M: The Environment

This code for H&M data extraction requires just two packages:

1. MechanicalSoup: A web scraping library built on top of Python requests and BeautifulSoup, which you must install using pip.

pip install mechanicalsoup

2. json: A module for handling JSON, including saving a dict object to a JSON file, which is available in the Python standard library.

Want to learn more about web scraping libraries? Read this article on Python libraries for web scraping.

Scrape Product Data from H&M: The Code

Start your code to scrape H&M by importing the necessary packages. For MechanicalSoup, you only need to import the StatefulBrowser class.

from mechanicalsoup.stateful_browser import StatefulBrowser
import json

This code will search for a term on H&M’s website and extract details from the results, so you need a search term:

search_keyword = "suits"

Create an object of the StatefulBrowser class; this object will have methods for fetching and parsing. You can also specify which parser to use using the ‘soup_config=’ argument. 

browser = StatefulBrowser(
    soup_config={'features':'lxml'}
)

To ensure that your scraper isn’t blocked, use headers. Store the headers in a dict and update the headers of your MechanicalSoup object using session.headers.update(). 

headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,'
                    '*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
          'accept-language': 'en-GB;q=0.9,en-US;q=0.8,en;q=0.7',
          'dpr': '1',
          'sec-fetch-dest': 'document',
          'sec-fetch-mode': 'navigate',
          'sec-fetch-site': 'none',
          'sec-fetch-user': '?1',
          'upgrade-insecure-requests': '1',
          'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                        'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}
browser.session.headers.update(headers)

Now, fetch H&M’s search results page using the open() method with the URL as an argument, which you can get by visiting H&M and searching for a product. 

Use an f-string to replace the search term in the URL with the variable defined earlier.

browser.open(f'https://www2.hm.com/en_us/search-results.html?q={search_keyword}')

If the request is successful, MechanicalSoup will fetch and parse the HTML code of the search results page. You can access this parsed page through the page attribute.

soup = browser.page

This parsed object is a BeautifulSoup tag, allowing you to use BeautifulSoup’s methods; this means you can find the script tag containing the product data as JSON using the find() method.

product_script = soup.find('script',{'id':'__NEXT_DATA__'}).text

You can now parse the JSON string using the json module, allowing you to navigate keys and values.

product_json = json.loads(product_script)

Navigate through this parsed JSON data to get URLs and prices of products listed on the search results page.

products = product_json['props']['pageProps']['srpProps']['hits']

product_details = [[product['pdpUrl'], product['regularPrice']] for product in products]

Define a list to store data extracted from H&M:

all_product_data = []

Now, you can iterate through this list. In each iteration: 

1. Make an HTTP request to each product’s URL

browser.open(product[0])

2. Get the parsed data through the page attribute.

souplet = browser.page

3. Extract and parse the JSON string.

script = souplet.find('script', {'id': '__NEXT_DATA__'}).text
json_data = json.loads(script)

4. Extract the required details and save them in a dict.

product_info = json_data['props']['pageProps']['productPageProps']['aemData']['productArticleDetails']
    firstKey = list(product_info['variations'].keys())[0]

    data = {
        'Name': product_info['productName'],
        'Price': product[1],
        'Description': product_info['variations'][firstKey]['description'],
        'URL': product_info['productUrl']
    }

5. Append this data to your previously defined list.

all_product_data.append(data)

Finally, save the extracted H&M product data to a JSON file. 

with open('hNm.json','w',encoding='utf-8') as f:
    json.dump(all_product_data,f,indent=4,ensure_ascii=False)

The extracted data will look like this:

{
        "Name": "High-waist Dress Pants",
        "Price": "$ 19.99",
        "Description": "Relaxed-fit, dressy pants in jersey with a high waist. Elasticized waistband, diagonal side pockets, and wide legs with pleats at top and creases at front.",
        "URL": "https://www2.hm.com/en_us/productpage.1091186001.html"
    }

And here’s the complete code to scrape H&M data.

from mechanicalsoup.stateful_browser import StatefulBrowser
import json

search_keyword = "suits"


browser = StatefulBrowser(
    soup_config={'features':'lxml'}
)

headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,'
                    '*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
          'accept-language': 'en-GB;q=0.9,en-US;q=0.8,en;q=0.7',
          'dpr': '1',
          'sec-fetch-dest': 'document',
          'sec-fetch-mode': 'navigate',
          'sec-fetch-site': 'none',
          'sec-fetch-user': '?1',
          'upgrade-insecure-requests': '1',
          'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                        'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}

browser.session.headers.update(headers)
browser.open(f'https://www2.hm.com/en_us/search-results.html?q={search_keyword}')

soup = browser.page
product_script = soup.find('script',{'id':'__NEXT_DATA__'}).text

product_json = json.loads(product_script)

products = product_json['props']['pageProps']['srpProps']['hits']

product_details = [[product['pdpUrl'], product['regularPrice']] for product in products]

all_product_data = []

for product in product_details[:10]:
    browser.open(product[0])
    souplet = browser.page
    script = souplet.find('script', {'id': '__NEXT_DATA__'}).text
    json_data = json.loads(script)
    product_info = json_data['props']['pageProps']['productPageProps']['aemData']['productArticleDetails']
    firstKey = list(product_info['variations'].keys())[0]

    data = {
        'Name': product_info['productName'],
        'Price': product[1],
        'Description': product_info['variations'][firstKey]['description'],
        'URL': product_info['productUrl']
    }
    all_product_data.append(data)

with open('hNm.json','w',encoding='utf-8') as f:
    json.dump(all_product_data,f,indent=4,ensure_ascii=False)

Code Limitations

This code can scrape product data from H&M; however, you’ll need to

  • Modify it if you want additional data points beyond those covered in this tutorial.
  • Monitor H&M’s website for any changes in HTML structure; otherwise, the code may fail to retrieve data.
  • Implement techniques to bypass anti-scraping measures—such as request delays and proxy rotation—for large-scale web scraping; otherwise, you may get blocked.

Why Use a Web Scraping Service?

To fetch a few hundred product data, you can use the H&M data scraping code shown here, to fetch a few hundred product data, but for large-scale data extraction, it is better to use a web scraping service. 

A web scraping service, like ScrapeHero, can take care of

  1. Bypassing anti-scraping measures
  2. Executing JavaScript 
  3. Monitoring site changes

This allows you to focus on utilizing the data instead of managing technical challenges. 

ScrapeHero is a fully managed web scraping service provider capable of building enterprise-grade web scrapers and crawlers. Our services also include custom robotic process automation and developing tailored AI models.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Table of content

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

Transform and map scraped data

How to Transform and Map Scraped Data with Python Libraries

Learn how you can transform and map data using Python.
Using NLP to clean and structure scraped data

How to Use NLP to Clean and Structure Scraped Data

Learn how to use NLP to clean and structure scraped data.
Search engine web crawling

From Crawling to Ranking! This is How Search Engines Use Web Crawling to Index Websites!

Search engine crawling indexes web pages, making it essential for ranking and visibility in search results.
ScrapeHero Logo

Can we help you get some data?