Scraping Amazon: ASINs, Reviews, and Pricing at Scale


The Scraper
Use Cases
Amazon is simultaneously the most valuable scraping target for e-commerce intelligence and one of the most aggressive anti-bot deployments on the web. Their detection stack combines IP reputation, behavioral analysis, device fingerprinting, and CAPTCHA challenges into a system that's been tuned by decades of scraping attempts.
Getting reliable data from Amazon at scale requires understanding their detection model and building around it, not against it.
What Amazon Data Is Worth Extracting
Before the technical setup, define your extraction targets:
Product data — title, ASIN, category, brand, dimensions, weight. Relatively stable; changes slowly.
Pricing — list price, current price, sale price, Buy Box price, third-party seller prices. Changes frequently; high-value for competitive intelligence.
Reviews and ratings — star rating, review count, review text, verified purchase status, review date. Aggregated for sentiment analysis; granular for NLP tasks.
Seller data — who holds the Buy Box, fulfilled by Amazon vs. third-party, seller rating, shipping options.
Inventory signals — "Only X left in stock", availability status, delivery estimates. Useful for supply chain intelligence.
Sponsored placements — which products appear in sponsored positions for which keywords. Competitive advertising intelligence.
Amazon's Detection Stack
Amazon runs their own anti-bot system (not a third-party vendor), which means they have first-party behavioral data at a scale no external vendor can match. Their detection has evolved significantly from simple IP blocks to a multi-layer system:
IP reputation — Amazon maintains extensive IP intelligence. Datacenter IPs from AWS, GCP, and Azure are particularly well-catalogued (obviously). Known proxy ranges are flagged. Residential IPs fare better, but heavily used proxy pool IPs accumulate reputation.
TLS and HTTP fingerprinting — Amazon checks JA3 fingerprints and HTTP/2 settings. Python's requests or default httpx produce known fingerprints that don't match real browser behavior.
Browser fingerprinting — When JavaScript executes, Amazon collects canvas, WebGL, and behavioral signals. Their ue_sid and related tracking scripts run on every page.
CAPTCHA challenges — Amazon uses their own CAPTCHA system. Soft blocks return a "To discuss automated access to Amazon data please contact api-services-support@amazon.com" page with a CAPTCHA.
Rate limiting — Velocity-based blocks at both IP and account level. Signed-in sessions have lower block rates but higher scrutiny on account-level patterns.
The Extraction Approach: httpx + curl_cffi
For product pages and pricing data that doesn't require JavaScript execution, the most efficient approach is direct HTTP requests with a browser-matching TLS fingerprint.
from curl_cffi.requests import AsyncSession import asyncio import random import re from bs4 import BeautifulSoup EVOMI_PROXY = "http://USERNAME:PASSWORD@rp.evomi.com:1001" HEADERS = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate, br", "Referer": "https://www.google.com/", "Sec-Fetch-Dest": "document", "Sec-Fetch-Mode": "navigate", "Sec-Fetch-Site": "cross-site", "Sec-Ch-Ua": '"Chromium";v="124", "Not_A Brand";v="99"', "Sec-Ch-Ua-Mobile": "?0", "Sec-Ch-Ua-Platform": '"Windows"', "Upgrade-Insecure-Requests": "1", } async def fetch_amazon_page(asin: str) -> str | None: """Fetch an Amazon product page using browser-matching TLS fingerprint.""" url = f"https://www.amazon.com/dp/{asin}" async with AsyncSession(impersonate="chrome124") as session: try: response = await session.get( url, headers=HEADERS, proxies={"https": EVOMI_PROXY}, timeout=30, allow_redirects=True, ) if response.status_code == 200: # Check for CAPTCHA page if "api-services-support@amazon.com" in response.text: print(f"CAPTCHA triggered for ASIN {asin}") return None if "Sorry, we just need to make sure" in response.text: print(f"Bot check page for ASIN {asin}") return None return response.text else: print(f"Status {response.status_code} for ASIN {asin}") return None except Exception as e: print(f"Fetch error for {asin}: {e}") return None
Parsing Amazon Product Data
Amazon's HTML structure is complex and changes frequently. The __NEXT_DATA__ trick doesn't apply here, Amazon uses their own rendering stack, not Next.js. DOM parsing is the path.
def parse_amazon_product(html: str, asin: str) -> dict | None: soup = BeautifulSoup(html, "lxml") def get_text(selector: str) -> str | None: el = soup.select_one(selector) return el.get_text(strip=True) if el else None def get_attr(selector: str, attr: str) -> str | None: el = soup.select_one(selector) return el.get(attr) if el else None # Title title = get_text("#productTitle") or get_text("span#productTitle") # Pricing — multiple selectors due to Amazon's A/B testing price_whole = get_text(".a-price-whole") price_fraction = get_text(".a-price-fraction") price = None if price_whole: price_str = price_whole.replace(",", "").rstrip(".") if price_fraction: price_str += f".{price_fraction}" try: price = float(price_str) except ValueError: pass # Rating rating_text = get_text("span.a-icon-alt") rating = None if rating_text: match = re.search(r"([\d.]+) out of", rating_text) if match: rating = float(match.group(1)) # Review count review_count = None review_el = soup.select_one("#acrCustomerReviewText") if review_el: match = re.search(r"([\d,]+)", review_el.get_text()) if match: review_count = int(match.group(1).replace(",", "")) # Availability availability_el = soup.select_one("#availability span, #outOfStock") availability = availability_el.get_text(strip=True) if availability_el else None # Brand brand = get_text("#bylineInfo") or get_text("a#bylineInfo") if brand: brand = re.sub(r"^(Brand:|Visit the|Store)", "", brand).strip() # ASIN from page (verify it matches) asin_from_page = get_attr('input[name="ASIN"]', "value") or asin if not title or price is None: return None # Likely a soft block or layout change return { "asin": asin_from_page, "title": title, "price": price, "rating": rating, "review_count": review_count, "availability": availability, "brand": brand, }
Scraping Amazon Reviews
Reviews live on a separate paginated endpoint and have their own extraction logic:
async def fetch_reviews_page(asin: str, page: int = 1) -> list[dict]: """Fetch a page of reviews for an ASIN.""" url = ( f"https://www.amazon.com/product-reviews/{asin}" f"?pageNumber={page}&sortBy=recent&reviewerType=all_reviews" ) async with AsyncSession(impersonate="chrome124") as session: response = await session.get( url, headers={**HEADERS, "Referer": f"https://www.amazon.com/dp/{asin}"}, proxies={"https": EVOMI_PROXY}, timeout=30, ) if response.status_code != 200 or "api-services-support" in response.text: return [] soup = BeautifulSoup(response.text, "lxml") reviews = [] for review_div in soup.select('[data-hook="review"]'): rating_el = review_div.select_one('[data-hook="review-star-rating"] span.a-icon-alt') body_el = review_div.select_one('[data-hook="review-body"] span') title_el = review_div.select_one('[data-hook="review-title"] span') date_el = review_div.select_one('[data-hook="review-date"]') verified_el = review_div.select_one('[data-hook="avp-badge"]') review_id = review_div.get("id", "") rating = None if rating_el: match = re.search(r"([\d.]+) out of", rating_el.get_text()) if match: rating = float(match.group(1)) date_str = date_el.get_text(strip=True) if date_el else None date_match = re.search(r"(January|February|March|April|May|June|July|August|" r"September|October|November|December)\s+\d+,\s+\d{4}", date_str or "") reviews.append({ "review_id": review_id, "asin": asin, "rating": rating, "title": title_el.get_text(strip=True) if title_el else None, "body": body_el.get_text(strip=True) if body_el else None, "date_str": date_match.group(0) if date_match else None, "verified_purchase": verified_el is not None, }) return reviews async def scrape_all_reviews(asin: str, max_pages: int = 10) -> list[dict]: """Scrape up to max_pages of reviews for an ASIN.""" all_reviews = [] for page in range(1, max_pages + 1): reviews = await fetch_reviews_page(asin, page) if not reviews: break all_reviews.extend(reviews) # Jittered delay between pages await asyncio.sleep(random.gauss(3.0, 0.8)) return all_reviews
ASIN Discovery at Scale
Rather than scraping individual product pages, seed your pipeline with ASINs from category and search pages:
async def discover_asins_from_category(category_url: str) -> list[str]: """Extract ASINs from an Amazon category listing page.""" async with AsyncSession(impersonate="chrome124") as session: response = await session.get( category_url, headers=HEADERS, proxies={"https": EVOMI_PROXY}, ) soup = BeautifulSoup(response.text, "lxml") asins = set() # ASINs appear in data-asin attributes on product cards for el in soup.select("[data-asin]"): asin = el.get("data-asin", "").strip() if len(asin) == 10 and asin.isalnum(): asins.add(asin) # Also extract from product links for link in soup.select("a[href*='/dp/']"): href = link.get("href", "") match = re.search(r"/dp/([A-Z0-9]{10})", href) if match: asins.add(match.group(1)) return list(asins)
Rate and Rotation Strategy
Amazon is sensitive to per-IP velocity. The rotation strategy that works:
One IP per session — don't rotate IPs within a single product page load
Rotate between products — new IP for each ASIN (or small batch)
Delay between requests — minimum 3 seconds, jittered, target ~5 seconds average
Vary entry points — sometimes enter via Google (with Referer header), sometimes direct
Respect business hours — scraping at 3am local time for the target geography is a signal; distribute across normal hours
Evomi's residential proxies with US geo-targeting are the baseline for Amazon US. For Amazon EU sites (.de, .fr, .it), use the corresponding country targeting. Amazon's geo-detection is sophisticated, a US IP scraping amazon.de can trigger different treatment than a DE IP.
Common Pitfalls
Pitfall 1: Scraping during high-traffic periods. Black Friday, Prime Day, Amazon's anti-bot systems are tuned tighter during peak traffic. Schedule intensive scraping runs outside these windows.
Pitfall 2: Ignoring price history complexity. Amazon's current price vs. list price vs. sale price vs. third-party Buy Box price are four different numbers. Your schema needs to distinguish them explicitly.
Pitfall 3: Review pagination limits. Amazon caps reviews at 10 pages regardless of the total count. If a product has 50,000 reviews, you'll only get the first ~100 via standard pagination. Use the sort parameter (sortBy=helpful, sortBy=recent) to access different review subsets.
Conclusion
Amazon scraping in 2026 is achievable with curl_cffi for TLS fingerprint matching and Evomi's residential proxies for clean IP access. The key variables are TLS fingerprint (don't use requests/httpx defaults), geo-matching, and per-IP rate control. Get those right and you have a reliable foundation for Amazon competitive intelligence.
Test the setup with Evomi's free trial against your target ASIN set before scaling volume.
Amazon is simultaneously the most valuable scraping target for e-commerce intelligence and one of the most aggressive anti-bot deployments on the web. Their detection stack combines IP reputation, behavioral analysis, device fingerprinting, and CAPTCHA challenges into a system that's been tuned by decades of scraping attempts.
Getting reliable data from Amazon at scale requires understanding their detection model and building around it, not against it.
What Amazon Data Is Worth Extracting
Before the technical setup, define your extraction targets:
Product data — title, ASIN, category, brand, dimensions, weight. Relatively stable; changes slowly.
Pricing — list price, current price, sale price, Buy Box price, third-party seller prices. Changes frequently; high-value for competitive intelligence.
Reviews and ratings — star rating, review count, review text, verified purchase status, review date. Aggregated for sentiment analysis; granular for NLP tasks.
Seller data — who holds the Buy Box, fulfilled by Amazon vs. third-party, seller rating, shipping options.
Inventory signals — "Only X left in stock", availability status, delivery estimates. Useful for supply chain intelligence.
Sponsored placements — which products appear in sponsored positions for which keywords. Competitive advertising intelligence.
Amazon's Detection Stack
Amazon runs their own anti-bot system (not a third-party vendor), which means they have first-party behavioral data at a scale no external vendor can match. Their detection has evolved significantly from simple IP blocks to a multi-layer system:
IP reputation — Amazon maintains extensive IP intelligence. Datacenter IPs from AWS, GCP, and Azure are particularly well-catalogued (obviously). Known proxy ranges are flagged. Residential IPs fare better, but heavily used proxy pool IPs accumulate reputation.
TLS and HTTP fingerprinting — Amazon checks JA3 fingerprints and HTTP/2 settings. Python's requests or default httpx produce known fingerprints that don't match real browser behavior.
Browser fingerprinting — When JavaScript executes, Amazon collects canvas, WebGL, and behavioral signals. Their ue_sid and related tracking scripts run on every page.
CAPTCHA challenges — Amazon uses their own CAPTCHA system. Soft blocks return a "To discuss automated access to Amazon data please contact api-services-support@amazon.com" page with a CAPTCHA.
Rate limiting — Velocity-based blocks at both IP and account level. Signed-in sessions have lower block rates but higher scrutiny on account-level patterns.
The Extraction Approach: httpx + curl_cffi
For product pages and pricing data that doesn't require JavaScript execution, the most efficient approach is direct HTTP requests with a browser-matching TLS fingerprint.
from curl_cffi.requests import AsyncSession import asyncio import random import re from bs4 import BeautifulSoup EVOMI_PROXY = "http://USERNAME:PASSWORD@rp.evomi.com:1001" HEADERS = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate, br", "Referer": "https://www.google.com/", "Sec-Fetch-Dest": "document", "Sec-Fetch-Mode": "navigate", "Sec-Fetch-Site": "cross-site", "Sec-Ch-Ua": '"Chromium";v="124", "Not_A Brand";v="99"', "Sec-Ch-Ua-Mobile": "?0", "Sec-Ch-Ua-Platform": '"Windows"', "Upgrade-Insecure-Requests": "1", } async def fetch_amazon_page(asin: str) -> str | None: """Fetch an Amazon product page using browser-matching TLS fingerprint.""" url = f"https://www.amazon.com/dp/{asin}" async with AsyncSession(impersonate="chrome124") as session: try: response = await session.get( url, headers=HEADERS, proxies={"https": EVOMI_PROXY}, timeout=30, allow_redirects=True, ) if response.status_code == 200: # Check for CAPTCHA page if "api-services-support@amazon.com" in response.text: print(f"CAPTCHA triggered for ASIN {asin}") return None if "Sorry, we just need to make sure" in response.text: print(f"Bot check page for ASIN {asin}") return None return response.text else: print(f"Status {response.status_code} for ASIN {asin}") return None except Exception as e: print(f"Fetch error for {asin}: {e}") return None
Parsing Amazon Product Data
Amazon's HTML structure is complex and changes frequently. The __NEXT_DATA__ trick doesn't apply here, Amazon uses their own rendering stack, not Next.js. DOM parsing is the path.
def parse_amazon_product(html: str, asin: str) -> dict | None: soup = BeautifulSoup(html, "lxml") def get_text(selector: str) -> str | None: el = soup.select_one(selector) return el.get_text(strip=True) if el else None def get_attr(selector: str, attr: str) -> str | None: el = soup.select_one(selector) return el.get(attr) if el else None # Title title = get_text("#productTitle") or get_text("span#productTitle") # Pricing — multiple selectors due to Amazon's A/B testing price_whole = get_text(".a-price-whole") price_fraction = get_text(".a-price-fraction") price = None if price_whole: price_str = price_whole.replace(",", "").rstrip(".") if price_fraction: price_str += f".{price_fraction}" try: price = float(price_str) except ValueError: pass # Rating rating_text = get_text("span.a-icon-alt") rating = None if rating_text: match = re.search(r"([\d.]+) out of", rating_text) if match: rating = float(match.group(1)) # Review count review_count = None review_el = soup.select_one("#acrCustomerReviewText") if review_el: match = re.search(r"([\d,]+)", review_el.get_text()) if match: review_count = int(match.group(1).replace(",", "")) # Availability availability_el = soup.select_one("#availability span, #outOfStock") availability = availability_el.get_text(strip=True) if availability_el else None # Brand brand = get_text("#bylineInfo") or get_text("a#bylineInfo") if brand: brand = re.sub(r"^(Brand:|Visit the|Store)", "", brand).strip() # ASIN from page (verify it matches) asin_from_page = get_attr('input[name="ASIN"]', "value") or asin if not title or price is None: return None # Likely a soft block or layout change return { "asin": asin_from_page, "title": title, "price": price, "rating": rating, "review_count": review_count, "availability": availability, "brand": brand, }
Scraping Amazon Reviews
Reviews live on a separate paginated endpoint and have their own extraction logic:
async def fetch_reviews_page(asin: str, page: int = 1) -> list[dict]: """Fetch a page of reviews for an ASIN.""" url = ( f"https://www.amazon.com/product-reviews/{asin}" f"?pageNumber={page}&sortBy=recent&reviewerType=all_reviews" ) async with AsyncSession(impersonate="chrome124") as session: response = await session.get( url, headers={**HEADERS, "Referer": f"https://www.amazon.com/dp/{asin}"}, proxies={"https": EVOMI_PROXY}, timeout=30, ) if response.status_code != 200 or "api-services-support" in response.text: return [] soup = BeautifulSoup(response.text, "lxml") reviews = [] for review_div in soup.select('[data-hook="review"]'): rating_el = review_div.select_one('[data-hook="review-star-rating"] span.a-icon-alt') body_el = review_div.select_one('[data-hook="review-body"] span') title_el = review_div.select_one('[data-hook="review-title"] span') date_el = review_div.select_one('[data-hook="review-date"]') verified_el = review_div.select_one('[data-hook="avp-badge"]') review_id = review_div.get("id", "") rating = None if rating_el: match = re.search(r"([\d.]+) out of", rating_el.get_text()) if match: rating = float(match.group(1)) date_str = date_el.get_text(strip=True) if date_el else None date_match = re.search(r"(January|February|March|April|May|June|July|August|" r"September|October|November|December)\s+\d+,\s+\d{4}", date_str or "") reviews.append({ "review_id": review_id, "asin": asin, "rating": rating, "title": title_el.get_text(strip=True) if title_el else None, "body": body_el.get_text(strip=True) if body_el else None, "date_str": date_match.group(0) if date_match else None, "verified_purchase": verified_el is not None, }) return reviews async def scrape_all_reviews(asin: str, max_pages: int = 10) -> list[dict]: """Scrape up to max_pages of reviews for an ASIN.""" all_reviews = [] for page in range(1, max_pages + 1): reviews = await fetch_reviews_page(asin, page) if not reviews: break all_reviews.extend(reviews) # Jittered delay between pages await asyncio.sleep(random.gauss(3.0, 0.8)) return all_reviews
ASIN Discovery at Scale
Rather than scraping individual product pages, seed your pipeline with ASINs from category and search pages:
async def discover_asins_from_category(category_url: str) -> list[str]: """Extract ASINs from an Amazon category listing page.""" async with AsyncSession(impersonate="chrome124") as session: response = await session.get( category_url, headers=HEADERS, proxies={"https": EVOMI_PROXY}, ) soup = BeautifulSoup(response.text, "lxml") asins = set() # ASINs appear in data-asin attributes on product cards for el in soup.select("[data-asin]"): asin = el.get("data-asin", "").strip() if len(asin) == 10 and asin.isalnum(): asins.add(asin) # Also extract from product links for link in soup.select("a[href*='/dp/']"): href = link.get("href", "") match = re.search(r"/dp/([A-Z0-9]{10})", href) if match: asins.add(match.group(1)) return list(asins)
Rate and Rotation Strategy
Amazon is sensitive to per-IP velocity. The rotation strategy that works:
One IP per session — don't rotate IPs within a single product page load
Rotate between products — new IP for each ASIN (or small batch)
Delay between requests — minimum 3 seconds, jittered, target ~5 seconds average
Vary entry points — sometimes enter via Google (with Referer header), sometimes direct
Respect business hours — scraping at 3am local time for the target geography is a signal; distribute across normal hours
Evomi's residential proxies with US geo-targeting are the baseline for Amazon US. For Amazon EU sites (.de, .fr, .it), use the corresponding country targeting. Amazon's geo-detection is sophisticated, a US IP scraping amazon.de can trigger different treatment than a DE IP.
Common Pitfalls
Pitfall 1: Scraping during high-traffic periods. Black Friday, Prime Day, Amazon's anti-bot systems are tuned tighter during peak traffic. Schedule intensive scraping runs outside these windows.
Pitfall 2: Ignoring price history complexity. Amazon's current price vs. list price vs. sale price vs. third-party Buy Box price are four different numbers. Your schema needs to distinguish them explicitly.
Pitfall 3: Review pagination limits. Amazon caps reviews at 10 pages regardless of the total count. If a product has 50,000 reviews, you'll only get the first ~100 via standard pagination. Use the sort parameter (sortBy=helpful, sortBy=recent) to access different review subsets.
Conclusion
Amazon scraping in 2026 is achievable with curl_cffi for TLS fingerprint matching and Evomi's residential proxies for clean IP access. The key variables are TLS fingerprint (don't use requests/httpx defaults), geo-matching, and per-IP rate control. Get those right and you have a reliable foundation for Amazon competitive intelligence.
Test the setup with Evomi's free trial against your target ASIN set before scaling volume.
Amazon is simultaneously the most valuable scraping target for e-commerce intelligence and one of the most aggressive anti-bot deployments on the web. Their detection stack combines IP reputation, behavioral analysis, device fingerprinting, and CAPTCHA challenges into a system that's been tuned by decades of scraping attempts.
Getting reliable data from Amazon at scale requires understanding their detection model and building around it, not against it.
What Amazon Data Is Worth Extracting
Before the technical setup, define your extraction targets:
Product data — title, ASIN, category, brand, dimensions, weight. Relatively stable; changes slowly.
Pricing — list price, current price, sale price, Buy Box price, third-party seller prices. Changes frequently; high-value for competitive intelligence.
Reviews and ratings — star rating, review count, review text, verified purchase status, review date. Aggregated for sentiment analysis; granular for NLP tasks.
Seller data — who holds the Buy Box, fulfilled by Amazon vs. third-party, seller rating, shipping options.
Inventory signals — "Only X left in stock", availability status, delivery estimates. Useful for supply chain intelligence.
Sponsored placements — which products appear in sponsored positions for which keywords. Competitive advertising intelligence.
Amazon's Detection Stack
Amazon runs their own anti-bot system (not a third-party vendor), which means they have first-party behavioral data at a scale no external vendor can match. Their detection has evolved significantly from simple IP blocks to a multi-layer system:
IP reputation — Amazon maintains extensive IP intelligence. Datacenter IPs from AWS, GCP, and Azure are particularly well-catalogued (obviously). Known proxy ranges are flagged. Residential IPs fare better, but heavily used proxy pool IPs accumulate reputation.
TLS and HTTP fingerprinting — Amazon checks JA3 fingerprints and HTTP/2 settings. Python's requests or default httpx produce known fingerprints that don't match real browser behavior.
Browser fingerprinting — When JavaScript executes, Amazon collects canvas, WebGL, and behavioral signals. Their ue_sid and related tracking scripts run on every page.
CAPTCHA challenges — Amazon uses their own CAPTCHA system. Soft blocks return a "To discuss automated access to Amazon data please contact api-services-support@amazon.com" page with a CAPTCHA.
Rate limiting — Velocity-based blocks at both IP and account level. Signed-in sessions have lower block rates but higher scrutiny on account-level patterns.
The Extraction Approach: httpx + curl_cffi
For product pages and pricing data that doesn't require JavaScript execution, the most efficient approach is direct HTTP requests with a browser-matching TLS fingerprint.
from curl_cffi.requests import AsyncSession import asyncio import random import re from bs4 import BeautifulSoup EVOMI_PROXY = "http://USERNAME:PASSWORD@rp.evomi.com:1001" HEADERS = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate, br", "Referer": "https://www.google.com/", "Sec-Fetch-Dest": "document", "Sec-Fetch-Mode": "navigate", "Sec-Fetch-Site": "cross-site", "Sec-Ch-Ua": '"Chromium";v="124", "Not_A Brand";v="99"', "Sec-Ch-Ua-Mobile": "?0", "Sec-Ch-Ua-Platform": '"Windows"', "Upgrade-Insecure-Requests": "1", } async def fetch_amazon_page(asin: str) -> str | None: """Fetch an Amazon product page using browser-matching TLS fingerprint.""" url = f"https://www.amazon.com/dp/{asin}" async with AsyncSession(impersonate="chrome124") as session: try: response = await session.get( url, headers=HEADERS, proxies={"https": EVOMI_PROXY}, timeout=30, allow_redirects=True, ) if response.status_code == 200: # Check for CAPTCHA page if "api-services-support@amazon.com" in response.text: print(f"CAPTCHA triggered for ASIN {asin}") return None if "Sorry, we just need to make sure" in response.text: print(f"Bot check page for ASIN {asin}") return None return response.text else: print(f"Status {response.status_code} for ASIN {asin}") return None except Exception as e: print(f"Fetch error for {asin}: {e}") return None
Parsing Amazon Product Data
Amazon's HTML structure is complex and changes frequently. The __NEXT_DATA__ trick doesn't apply here, Amazon uses their own rendering stack, not Next.js. DOM parsing is the path.
def parse_amazon_product(html: str, asin: str) -> dict | None: soup = BeautifulSoup(html, "lxml") def get_text(selector: str) -> str | None: el = soup.select_one(selector) return el.get_text(strip=True) if el else None def get_attr(selector: str, attr: str) -> str | None: el = soup.select_one(selector) return el.get(attr) if el else None # Title title = get_text("#productTitle") or get_text("span#productTitle") # Pricing — multiple selectors due to Amazon's A/B testing price_whole = get_text(".a-price-whole") price_fraction = get_text(".a-price-fraction") price = None if price_whole: price_str = price_whole.replace(",", "").rstrip(".") if price_fraction: price_str += f".{price_fraction}" try: price = float(price_str) except ValueError: pass # Rating rating_text = get_text("span.a-icon-alt") rating = None if rating_text: match = re.search(r"([\d.]+) out of", rating_text) if match: rating = float(match.group(1)) # Review count review_count = None review_el = soup.select_one("#acrCustomerReviewText") if review_el: match = re.search(r"([\d,]+)", review_el.get_text()) if match: review_count = int(match.group(1).replace(",", "")) # Availability availability_el = soup.select_one("#availability span, #outOfStock") availability = availability_el.get_text(strip=True) if availability_el else None # Brand brand = get_text("#bylineInfo") or get_text("a#bylineInfo") if brand: brand = re.sub(r"^(Brand:|Visit the|Store)", "", brand).strip() # ASIN from page (verify it matches) asin_from_page = get_attr('input[name="ASIN"]', "value") or asin if not title or price is None: return None # Likely a soft block or layout change return { "asin": asin_from_page, "title": title, "price": price, "rating": rating, "review_count": review_count, "availability": availability, "brand": brand, }
Scraping Amazon Reviews
Reviews live on a separate paginated endpoint and have their own extraction logic:
async def fetch_reviews_page(asin: str, page: int = 1) -> list[dict]: """Fetch a page of reviews for an ASIN.""" url = ( f"https://www.amazon.com/product-reviews/{asin}" f"?pageNumber={page}&sortBy=recent&reviewerType=all_reviews" ) async with AsyncSession(impersonate="chrome124") as session: response = await session.get( url, headers={**HEADERS, "Referer": f"https://www.amazon.com/dp/{asin}"}, proxies={"https": EVOMI_PROXY}, timeout=30, ) if response.status_code != 200 or "api-services-support" in response.text: return [] soup = BeautifulSoup(response.text, "lxml") reviews = [] for review_div in soup.select('[data-hook="review"]'): rating_el = review_div.select_one('[data-hook="review-star-rating"] span.a-icon-alt') body_el = review_div.select_one('[data-hook="review-body"] span') title_el = review_div.select_one('[data-hook="review-title"] span') date_el = review_div.select_one('[data-hook="review-date"]') verified_el = review_div.select_one('[data-hook="avp-badge"]') review_id = review_div.get("id", "") rating = None if rating_el: match = re.search(r"([\d.]+) out of", rating_el.get_text()) if match: rating = float(match.group(1)) date_str = date_el.get_text(strip=True) if date_el else None date_match = re.search(r"(January|February|March|April|May|June|July|August|" r"September|October|November|December)\s+\d+,\s+\d{4}", date_str or "") reviews.append({ "review_id": review_id, "asin": asin, "rating": rating, "title": title_el.get_text(strip=True) if title_el else None, "body": body_el.get_text(strip=True) if body_el else None, "date_str": date_match.group(0) if date_match else None, "verified_purchase": verified_el is not None, }) return reviews async def scrape_all_reviews(asin: str, max_pages: int = 10) -> list[dict]: """Scrape up to max_pages of reviews for an ASIN.""" all_reviews = [] for page in range(1, max_pages + 1): reviews = await fetch_reviews_page(asin, page) if not reviews: break all_reviews.extend(reviews) # Jittered delay between pages await asyncio.sleep(random.gauss(3.0, 0.8)) return all_reviews
ASIN Discovery at Scale
Rather than scraping individual product pages, seed your pipeline with ASINs from category and search pages:
async def discover_asins_from_category(category_url: str) -> list[str]: """Extract ASINs from an Amazon category listing page.""" async with AsyncSession(impersonate="chrome124") as session: response = await session.get( category_url, headers=HEADERS, proxies={"https": EVOMI_PROXY}, ) soup = BeautifulSoup(response.text, "lxml") asins = set() # ASINs appear in data-asin attributes on product cards for el in soup.select("[data-asin]"): asin = el.get("data-asin", "").strip() if len(asin) == 10 and asin.isalnum(): asins.add(asin) # Also extract from product links for link in soup.select("a[href*='/dp/']"): href = link.get("href", "") match = re.search(r"/dp/([A-Z0-9]{10})", href) if match: asins.add(match.group(1)) return list(asins)
Rate and Rotation Strategy
Amazon is sensitive to per-IP velocity. The rotation strategy that works:
One IP per session — don't rotate IPs within a single product page load
Rotate between products — new IP for each ASIN (or small batch)
Delay between requests — minimum 3 seconds, jittered, target ~5 seconds average
Vary entry points — sometimes enter via Google (with Referer header), sometimes direct
Respect business hours — scraping at 3am local time for the target geography is a signal; distribute across normal hours
Evomi's residential proxies with US geo-targeting are the baseline for Amazon US. For Amazon EU sites (.de, .fr, .it), use the corresponding country targeting. Amazon's geo-detection is sophisticated, a US IP scraping amazon.de can trigger different treatment than a DE IP.
Common Pitfalls
Pitfall 1: Scraping during high-traffic periods. Black Friday, Prime Day, Amazon's anti-bot systems are tuned tighter during peak traffic. Schedule intensive scraping runs outside these windows.
Pitfall 2: Ignoring price history complexity. Amazon's current price vs. list price vs. sale price vs. third-party Buy Box price are four different numbers. Your schema needs to distinguish them explicitly.
Pitfall 3: Review pagination limits. Amazon caps reviews at 10 pages regardless of the total count. If a product has 50,000 reviews, you'll only get the first ~100 via standard pagination. Use the sort parameter (sortBy=helpful, sortBy=recent) to access different review subsets.
Conclusion
Amazon scraping in 2026 is achievable with curl_cffi for TLS fingerprint matching and Evomi's residential proxies for clean IP access. The key variables are TLS fingerprint (don't use requests/httpx defaults), geo-matching, and per-IP rate control. Get those right and you have a reliable foundation for Amazon competitive intelligence.
Test the setup with Evomi's free trial against your target ASIN set before scaling volume.

Author
The Scraper
Engineer and Webscraping Specialist
About Author
The Scraper is a software engineer and web scraping specialist, focused on building production-grade data extraction systems. His work centers on large-scale crawling, anti-bot evasion, proxy infrastructure, and browser automation. He writes about real-world scraping failures, silent data corruption, and systems that operate at scale.



