Deploying web scraping tasks for a mission critical application takes a lot more than merely writing a scraper and running it periodically. For extracting e-commerce listings like scraping Amazon product data, you need an infrastructure that offers it as a service, assuming end to end responsibility from executing the scraping crawler to managing scraped data.
This post was originally published in Apify.
Use Case: Scraping Amazon Product Data
Problem Statement
Scraping product data from e-commerce websites at scale is tough with restrictions on requests and the risk of getting blocked. Additionally, websites employ countermeasures to prevent automated scraping which results in increased difficulty and risk of detection.
Realization Approach
A hosted scraping service offers infrastructure scaling and sophisticated approaches to allow developers and data analysts to build and deploy mission critical web scraping tasks.
Solution Space
Scraping Amazon product data using this approach ensures that the scrapers leverage a large pool of servers with intelligent human-like browser fingerprints to avoid blocking. Furthermore, the crawled data can be stored, shared, and the scraping process can be monitored for performance over time.
Amazon is one of the most complex websites to scrape. That’s why we built an Amazon scraper you can use on the Apify cloud platform. It provides the infrastructure you need to scrape Amazon.
Approach for Scraping Amazon Product Data
Amazon Product Scraper is one of many ready-made e-commerce scraping tools available on Apify Store. This tool effectively creates an unofficial Amazon scraper API that enables you to get the Amazon product data you need without limits.
Here’s how you can use it to scrape Amazon in 7 simple steps:
Step 1. Go to Amazon Product Scraper on Apify Store
Click on Try for free. If you already have an Apify account, you’ll be taken straight to Apify Console, so you can skip ahead to step 3.
Step 2. Sign up for a free Apify account
If you don’t have an Apify account, you can sign up for free using your email address, Google, or GitHub.
Step 3. Copy and paste the Amazon URL you want to scrape
Once you’re in Apify Console, insert the Amazon category or product URL from which you want to extract data. In the example below, we’ve copied and pasted the URL for the Headphones, Earbuds & Accessories category on Amazon.com. You can click on the + Add
button to insert more categories or product URLs.
Step 4. Select the maximum number of results you want to scrape
Insert the maximum number of items you want to scrape in the Max items
field. In our example, we have set the number low and opted for just 10 results.
Note: You can also enable optional settings to get better results:
If you enable Captcha solver
the scraper will automatically solve captchas thrown by Amazon. This will decrease the amount of request retries and increase the speed of the scraper.
However, this option works well only for the '.com'
Amazon domain, but even then, Amazon doesn’t show a few product fields after solving a captcha (specifically: ‘attributes’, ‘manufacturer attributes’, and ‘bestseller ranks’)
Enabling the Scrape product variant prices
lets you extract prices of different variations of a product. This is useful when you need prices for each variant.
But be warned: this will increase the number of requests and extend the scraping time.
Step 5. Select the proxy option you want to use
You won’t get far scraping Amazon without a proxy. You can set proxy groups from specific countries. Amazon shows you the products that can be shipped to your address based on the proxy you use. You don’t need to worry about it if globally shipped products are enough for you.
The default setting is Residential proxy
, as this is the most effective for bypassing anti-scraping technologies. But you can also opt for Datacenter
or your Own proxies
.
Step 6. Start Amazon Product Scraper
Now just click Start
and wait for your results to come in. Your task will change from Running
to Succeeded
when it has finished.
Step 7. Get your data
Go to the Export results
tab to see your results. You can preview and download your Amazon data in several formats: HTML table, JSON, JSONL, CSV, Excel, XML, and RSS feed.
Here’s just some of the data from our scraping example in CSV:
Now you can download and keep the data to use it in spreadsheets, reports, or other apps. You can create as many variations on the input parameters as you like and schedule the scraper to extract Amazon product data as often as you need it.
This video goes into more detail on how to use Amazon Product Scraper:
The Legalities of Scraping Amazon Product Data
It is legal to scrape publicly available data on the internet and that includes scraping Amazon. Scraping information such as product descriptions, details, ratings, prices, or the number of reactions to a particular product is perfectly legal. You just need to be careful with personal data and copyright protection.
For instance, you may need to consider these when scraping product reviews, as the name and avatar of the reviewer may constitute personal data, while the text of the review itself may, in some cases, be copyright-protected. Always use extra caution and possibly consult with a lawyer when scraping this kind of data.
Restrictions From Amazon
While scraping publicly available data is legal, Amazon sometimes takes action to prevent scraping by rate-limiting requests, banning IP addresses, and engaging in browser fingerprinting to detect scraping bots.
Amazon will generally block web scraping with 200 OK
success status response code and a requirement to pass a CAPTCHA or with HTTP Error 503 Service Unavailable
with a message to contact sales for paid API.
There are ways to circumvent these measures, but ethical web scraping can help avoid triggering them in the first place. This includes limiting the frequency of requests, using appropriate user agents, and avoiding excessive scraping that could impact website performance.
Ethical web scraping reduces the risk of getting banned or facing legal consequences while still letting you extract useful data at scale from Amazon.