Scraping Amazon: ASINs, Reviews, and Pricing at Scale

The Scraper

Use Cases

Amazon is simultaneously the most valuable scraping target for e-commerce intelligence and one of the most aggressive anti-bot deployments on the web. Their detection stack combines IP reputation, behavioral analysis, device fingerprinting, and CAPTCHA challenges into a system that's been tuned by decades of scraping attempts.

Getting reliable data from Amazon at scale requires understanding their detection model and building around it, not against it.


What Amazon Data Is Worth Extracting

Before the technical setup, define your extraction targets:

Product data — title, ASIN, category, brand, dimensions, weight. Relatively stable; changes slowly.

Pricing — list price, current price, sale price, Buy Box price, third-party seller prices. Changes frequently; high-value for competitive intelligence.

Reviews and ratings — star rating, review count, review text, verified purchase status, review date. Aggregated for sentiment analysis; granular for NLP tasks.

Seller data — who holds the Buy Box, fulfilled by Amazon vs. third-party, seller rating, shipping options.

Inventory signals — "Only X left in stock", availability status, delivery estimates. Useful for supply chain intelligence.

Sponsored placements — which products appear in sponsored positions for which keywords. Competitive advertising intelligence.


Amazon's Detection Stack

Amazon runs their own anti-bot system (not a third-party vendor), which means they have first-party behavioral data at a scale no external vendor can match. Their detection has evolved significantly from simple IP blocks to a multi-layer system:

IP reputation — Amazon maintains extensive IP intelligence. Datacenter IPs from AWS, GCP, and Azure are particularly well-catalogued (obviously). Known proxy ranges are flagged. Residential IPs fare better, but heavily used proxy pool IPs accumulate reputation.

TLS and HTTP fingerprinting — Amazon checks JA3 fingerprints and HTTP/2 settings. Python's requests or default httpx produce known fingerprints that don't match real browser behavior.

Browser fingerprinting — When JavaScript executes, Amazon collects canvas, WebGL, and behavioral signals. Their ue_sid and related tracking scripts run on every page.

CAPTCHA challenges — Amazon uses their own CAPTCHA system. Soft blocks return a "To discuss automated access to Amazon data please contact api-services-support@amazon.com" page with a CAPTCHA.

Rate limiting — Velocity-based blocks at both IP and account level. Signed-in sessions have lower block rates but higher scrutiny on account-level patterns.


The Extraction Approach: httpx + curl_cffi

For product pages and pricing data that doesn't require JavaScript execution, the most efficient approach is direct HTTP requests with a browser-matching TLS fingerprint.


from curl_cffi.requests import AsyncSession
import asyncio
import random
import re
from bs4 import BeautifulSoup

EVOMI_PROXY = "http://USERNAME:PASSWORD@rp.evomi.com:1001"

HEADERS = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": "https://www.google.com/",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "cross-site",
    "Sec-Ch-Ua": '"Chromium";v="124", "Not_A Brand";v="99"',
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Ch-Ua-Platform": '"Windows"',
    "Upgrade-Insecure-Requests": "1",
}

async def fetch_amazon_page(asin: str) -> str | None:
    """Fetch an Amazon product page using browser-matching TLS fingerprint."""
    url = f"https://www.amazon.com/dp/{asin}"

    async with AsyncSession(impersonate="chrome124") as session:
        try:
            response = await session.get(
                url,
                headers=HEADERS,
                proxies={"https": EVOMI_PROXY},
                timeout=30,
                allow_redirects=True,
            )

            if response.status_code == 200:
                # Check for CAPTCHA page
                if "api-services-support@amazon.com" in response.text:
                    print(f"CAPTCHA triggered for ASIN {asin}")
                    return None
                if "Sorry, we just need to make sure" in response.text:
                    print(f"Bot check page for ASIN {asin}")
                    return None
                return response.text
            else:
                print(f"Status {response.status_code} for ASIN {asin}")
                return None

        except Exception as e:
            print(f"Fetch error for {asin}: {e}")
            return None



Parsing Amazon Product Data

Amazon's HTML structure is complex and changes frequently. The __NEXT_DATA__ trick doesn't apply here, Amazon uses their own rendering stack, not Next.js. DOM parsing is the path.


def parse_amazon_product(html: str, asin: str) -> dict | None:
    soup = BeautifulSoup(html, "lxml")

    def get_text(selector: str) -> str | None:
        el = soup.select_one(selector)
        return el.get_text(strip=True) if el else None

    def get_attr(selector: str, attr: str) -> str | None:
        el = soup.select_one(selector)
        return el.get(attr) if el else None

    # Title
    title = get_text("#productTitle") or get_text("span#productTitle")

    # Pricing multiple selectors due to Amazon's A/B testing
    price_whole = get_text(".a-price-whole")
    price_fraction = get_text(".a-price-fraction")
    price = None
    if price_whole:
        price_str = price_whole.replace(",", "").rstrip(".")
        if price_fraction:
            price_str += f".{price_fraction}"
        try:
            price = float(price_str)
        except ValueError:
            pass

    # Rating
    rating_text = get_text("span.a-icon-alt")
    rating = None
    if rating_text:
        match = re.search(r"([\d.]+) out of", rating_text)
        if match:
            rating = float(match.group(1))

    # Review count
    review_count = None
    review_el = soup.select_one("#acrCustomerReviewText")
    if review_el:
        match = re.search(r"([\d,]+)", review_el.get_text())
        if match:
            review_count = int(match.group(1).replace(",", ""))

    # Availability
    availability_el = soup.select_one("#availability span, #outOfStock")
    availability = availability_el.get_text(strip=True) if availability_el else None

    # Brand
    brand = get_text("#bylineInfo") or get_text("a#bylineInfo")
    if brand:
        brand = re.sub(r"^(Brand:|Visit the|Store)", "", brand).strip()

    # ASIN from page (verify it matches)
    asin_from_page = get_attr('input[name="ASIN"]', "value") or asin

    if not title or price is None:
        return None  # Likely a soft block or layout change

    return {
        "asin": asin_from_page,
        "title": title,
        "price": price,
        "rating": rating,
        "review_count": review_count,
        "availability": availability,
        "brand": brand,
    }



Scraping Amazon Reviews

Reviews live on a separate paginated endpoint and have their own extraction logic:


async def fetch_reviews_page(asin: str, page: int = 1) -> list[dict]:
    """Fetch a page of reviews for an ASIN."""
    url = (
        f"https://www.amazon.com/product-reviews/{asin}"
        f"?pageNumber={page}&sortBy=recent&reviewerType=all_reviews"
    )

    async with AsyncSession(impersonate="chrome124") as session:
        response = await session.get(
            url,
            headers={**HEADERS, "Referer": f"https://www.amazon.com/dp/{asin}"},
            proxies={"https": EVOMI_PROXY},
            timeout=30,
        )

    if response.status_code != 200 or "api-services-support" in response.text:
        return []

    soup = BeautifulSoup(response.text, "lxml")
    reviews = []

    for review_div in soup.select('[data-hook="review"]'):
        rating_el = review_div.select_one('[data-hook="review-star-rating"] span.a-icon-alt')
        body_el = review_div.select_one('[data-hook="review-body"] span')
        title_el = review_div.select_one('[data-hook="review-title"] span')
        date_el = review_div.select_one('[data-hook="review-date"]')
        verified_el = review_div.select_one('[data-hook="avp-badge"]')
        review_id = review_div.get("id", "")

        rating = None
        if rating_el:
            match = re.search(r"([\d.]+) out of", rating_el.get_text())
            if match:
                rating = float(match.group(1))

        date_str = date_el.get_text(strip=True) if date_el else None
        date_match = re.search(r"(January|February|March|April|May|June|July|August|"
                               r"September|October|November|December)\s+\d+,\s+\d{4}",
                               date_str or "")

        reviews.append({
            "review_id": review_id,
            "asin": asin,
            "rating": rating,
            "title": title_el.get_text(strip=True) if title_el else None,
            "body": body_el.get_text(strip=True) if body_el else None,
            "date_str": date_match.group(0) if date_match else None,
            "verified_purchase": verified_el is not None,
        })

    return reviews

async def scrape_all_reviews(asin: str, max_pages: int = 10) -> list[dict]:
    """Scrape up to max_pages of reviews for an ASIN."""
    all_reviews = []
    for page in range(1, max_pages + 1):
        reviews = await fetch_reviews_page(asin, page)
        if not reviews:
            break
        all_reviews.extend(reviews)
        # Jittered delay between pages
        await asyncio.sleep(random.gauss(3.0, 0.8))
    return all_reviews



ASIN Discovery at Scale

Rather than scraping individual product pages, seed your pipeline with ASINs from category and search pages:


async def discover_asins_from_category(category_url: str) -> list[str]:
    """Extract ASINs from an Amazon category listing page."""
    async with AsyncSession(impersonate="chrome124") as session:
        response = await session.get(
            category_url,
            headers=HEADERS,
            proxies={"https": EVOMI_PROXY},
        )

    soup = BeautifulSoup(response.text, "lxml")
    asins = set()

    # ASINs appear in data-asin attributes on product cards
    for el in soup.select("[data-asin]"):
        asin = el.get("data-asin", "").strip()
        if len(asin) == 10 and asin.isalnum():
            asins.add(asin)

    # Also extract from product links
    for link in soup.select("a[href*='/dp/']"):
        href = link.get("href", "")
        match = re.search(r"/dp/([A-Z0-9]{10})", href)
        if match:
            asins.add(match.group(1))

    return list(asins)



Rate and Rotation Strategy

Amazon is sensitive to per-IP velocity. The rotation strategy that works:

  • One IP per session — don't rotate IPs within a single product page load

  • Rotate between products — new IP for each ASIN (or small batch)

  • Delay between requests — minimum 3 seconds, jittered, target ~5 seconds average

  • Vary entry points — sometimes enter via Google (with Referer header), sometimes direct

  • Respect business hours — scraping at 3am local time for the target geography is a signal; distribute across normal hours

Evomi's residential proxies with US geo-targeting are the baseline for Amazon US. For Amazon EU sites (.de, .fr, .it), use the corresponding country targeting. Amazon's geo-detection is sophisticated, a US IP scraping amazon.de can trigger different treatment than a DE IP.


Common Pitfalls

Pitfall 1: Scraping during high-traffic periods. Black Friday, Prime Day, Amazon's anti-bot systems are tuned tighter during peak traffic. Schedule intensive scraping runs outside these windows.

Pitfall 2: Ignoring price history complexity. Amazon's current price vs. list price vs. sale price vs. third-party Buy Box price are four different numbers. Your schema needs to distinguish them explicitly.

Pitfall 3: Review pagination limits. Amazon caps reviews at 10 pages regardless of the total count. If a product has 50,000 reviews, you'll only get the first ~100 via standard pagination. Use the sort parameter (sortBy=helpful, sortBy=recent) to access different review subsets.


Conclusion

Amazon scraping in 2026 is achievable with curl_cffi for TLS fingerprint matching and Evomi's residential proxies for clean IP access. The key variables are TLS fingerprint (don't use requests/httpx defaults), geo-matching, and per-IP rate control. Get those right and you have a reliable foundation for Amazon competitive intelligence.

Test the setup with Evomi's free trial against your target ASIN set before scaling volume.

Amazon is simultaneously the most valuable scraping target for e-commerce intelligence and one of the most aggressive anti-bot deployments on the web. Their detection stack combines IP reputation, behavioral analysis, device fingerprinting, and CAPTCHA challenges into a system that's been tuned by decades of scraping attempts.

Getting reliable data from Amazon at scale requires understanding their detection model and building around it, not against it.


What Amazon Data Is Worth Extracting

Before the technical setup, define your extraction targets:

Product data — title, ASIN, category, brand, dimensions, weight. Relatively stable; changes slowly.

Pricing — list price, current price, sale price, Buy Box price, third-party seller prices. Changes frequently; high-value for competitive intelligence.

Reviews and ratings — star rating, review count, review text, verified purchase status, review date. Aggregated for sentiment analysis; granular for NLP tasks.

Seller data — who holds the Buy Box, fulfilled by Amazon vs. third-party, seller rating, shipping options.

Inventory signals — "Only X left in stock", availability status, delivery estimates. Useful for supply chain intelligence.

Sponsored placements — which products appear in sponsored positions for which keywords. Competitive advertising intelligence.


Amazon's Detection Stack

Amazon runs their own anti-bot system (not a third-party vendor), which means they have first-party behavioral data at a scale no external vendor can match. Their detection has evolved significantly from simple IP blocks to a multi-layer system:

IP reputation — Amazon maintains extensive IP intelligence. Datacenter IPs from AWS, GCP, and Azure are particularly well-catalogued (obviously). Known proxy ranges are flagged. Residential IPs fare better, but heavily used proxy pool IPs accumulate reputation.

TLS and HTTP fingerprinting — Amazon checks JA3 fingerprints and HTTP/2 settings. Python's requests or default httpx produce known fingerprints that don't match real browser behavior.

Browser fingerprinting — When JavaScript executes, Amazon collects canvas, WebGL, and behavioral signals. Their ue_sid and related tracking scripts run on every page.

CAPTCHA challenges — Amazon uses their own CAPTCHA system. Soft blocks return a "To discuss automated access to Amazon data please contact api-services-support@amazon.com" page with a CAPTCHA.

Rate limiting — Velocity-based blocks at both IP and account level. Signed-in sessions have lower block rates but higher scrutiny on account-level patterns.


The Extraction Approach: httpx + curl_cffi

For product pages and pricing data that doesn't require JavaScript execution, the most efficient approach is direct HTTP requests with a browser-matching TLS fingerprint.


from curl_cffi.requests import AsyncSession
import asyncio
import random
import re
from bs4 import BeautifulSoup

EVOMI_PROXY = "http://USERNAME:PASSWORD@rp.evomi.com:1001"

HEADERS = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": "https://www.google.com/",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "cross-site",
    "Sec-Ch-Ua": '"Chromium";v="124", "Not_A Brand";v="99"',
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Ch-Ua-Platform": '"Windows"',
    "Upgrade-Insecure-Requests": "1",
}

async def fetch_amazon_page(asin: str) -> str | None:
    """Fetch an Amazon product page using browser-matching TLS fingerprint."""
    url = f"https://www.amazon.com/dp/{asin}"

    async with AsyncSession(impersonate="chrome124") as session:
        try:
            response = await session.get(
                url,
                headers=HEADERS,
                proxies={"https": EVOMI_PROXY},
                timeout=30,
                allow_redirects=True,
            )

            if response.status_code == 200:
                # Check for CAPTCHA page
                if "api-services-support@amazon.com" in response.text:
                    print(f"CAPTCHA triggered for ASIN {asin}")
                    return None
                if "Sorry, we just need to make sure" in response.text:
                    print(f"Bot check page for ASIN {asin}")
                    return None
                return response.text
            else:
                print(f"Status {response.status_code} for ASIN {asin}")
                return None

        except Exception as e:
            print(f"Fetch error for {asin}: {e}")
            return None



Parsing Amazon Product Data

Amazon's HTML structure is complex and changes frequently. The __NEXT_DATA__ trick doesn't apply here, Amazon uses their own rendering stack, not Next.js. DOM parsing is the path.


def parse_amazon_product(html: str, asin: str) -> dict | None:
    soup = BeautifulSoup(html, "lxml")

    def get_text(selector: str) -> str | None:
        el = soup.select_one(selector)
        return el.get_text(strip=True) if el else None

    def get_attr(selector: str, attr: str) -> str | None:
        el = soup.select_one(selector)
        return el.get(attr) if el else None

    # Title
    title = get_text("#productTitle") or get_text("span#productTitle")

    # Pricing multiple selectors due to Amazon's A/B testing
    price_whole = get_text(".a-price-whole")
    price_fraction = get_text(".a-price-fraction")
    price = None
    if price_whole:
        price_str = price_whole.replace(",", "").rstrip(".")
        if price_fraction:
            price_str += f".{price_fraction}"
        try:
            price = float(price_str)
        except ValueError:
            pass

    # Rating
    rating_text = get_text("span.a-icon-alt")
    rating = None
    if rating_text:
        match = re.search(r"([\d.]+) out of", rating_text)
        if match:
            rating = float(match.group(1))

    # Review count
    review_count = None
    review_el = soup.select_one("#acrCustomerReviewText")
    if review_el:
        match = re.search(r"([\d,]+)", review_el.get_text())
        if match:
            review_count = int(match.group(1).replace(",", ""))

    # Availability
    availability_el = soup.select_one("#availability span, #outOfStock")
    availability = availability_el.get_text(strip=True) if availability_el else None

    # Brand
    brand = get_text("#bylineInfo") or get_text("a#bylineInfo")
    if brand:
        brand = re.sub(r"^(Brand:|Visit the|Store)", "", brand).strip()

    # ASIN from page (verify it matches)
    asin_from_page = get_attr('input[name="ASIN"]', "value") or asin

    if not title or price is None:
        return None  # Likely a soft block or layout change

    return {
        "asin": asin_from_page,
        "title": title,
        "price": price,
        "rating": rating,
        "review_count": review_count,
        "availability": availability,
        "brand": brand,
    }



Scraping Amazon Reviews

Reviews live on a separate paginated endpoint and have their own extraction logic:


async def fetch_reviews_page(asin: str, page: int = 1) -> list[dict]:
    """Fetch a page of reviews for an ASIN."""
    url = (
        f"https://www.amazon.com/product-reviews/{asin}"
        f"?pageNumber={page}&sortBy=recent&reviewerType=all_reviews"
    )

    async with AsyncSession(impersonate="chrome124") as session:
        response = await session.get(
            url,
            headers={**HEADERS, "Referer": f"https://www.amazon.com/dp/{asin}"},
            proxies={"https": EVOMI_PROXY},
            timeout=30,
        )

    if response.status_code != 200 or "api-services-support" in response.text:
        return []

    soup = BeautifulSoup(response.text, "lxml")
    reviews = []

    for review_div in soup.select('[data-hook="review"]'):
        rating_el = review_div.select_one('[data-hook="review-star-rating"] span.a-icon-alt')
        body_el = review_div.select_one('[data-hook="review-body"] span')
        title_el = review_div.select_one('[data-hook="review-title"] span')
        date_el = review_div.select_one('[data-hook="review-date"]')
        verified_el = review_div.select_one('[data-hook="avp-badge"]')
        review_id = review_div.get("id", "")

        rating = None
        if rating_el:
            match = re.search(r"([\d.]+) out of", rating_el.get_text())
            if match:
                rating = float(match.group(1))

        date_str = date_el.get_text(strip=True) if date_el else None
        date_match = re.search(r"(January|February|March|April|May|June|July|August|"
                               r"September|October|November|December)\s+\d+,\s+\d{4}",
                               date_str or "")

        reviews.append({
            "review_id": review_id,
            "asin": asin,
            "rating": rating,
            "title": title_el.get_text(strip=True) if title_el else None,
            "body": body_el.get_text(strip=True) if body_el else None,
            "date_str": date_match.group(0) if date_match else None,
            "verified_purchase": verified_el is not None,
        })

    return reviews

async def scrape_all_reviews(asin: str, max_pages: int = 10) -> list[dict]:
    """Scrape up to max_pages of reviews for an ASIN."""
    all_reviews = []
    for page in range(1, max_pages + 1):
        reviews = await fetch_reviews_page(asin, page)
        if not reviews:
            break
        all_reviews.extend(reviews)
        # Jittered delay between pages
        await asyncio.sleep(random.gauss(3.0, 0.8))
    return all_reviews



ASIN Discovery at Scale

Rather than scraping individual product pages, seed your pipeline with ASINs from category and search pages:


async def discover_asins_from_category(category_url: str) -> list[str]:
    """Extract ASINs from an Amazon category listing page."""
    async with AsyncSession(impersonate="chrome124") as session:
        response = await session.get(
            category_url,
            headers=HEADERS,
            proxies={"https": EVOMI_PROXY},
        )

    soup = BeautifulSoup(response.text, "lxml")
    asins = set()

    # ASINs appear in data-asin attributes on product cards
    for el in soup.select("[data-asin]"):
        asin = el.get("data-asin", "").strip()
        if len(asin) == 10 and asin.isalnum():
            asins.add(asin)

    # Also extract from product links
    for link in soup.select("a[href*='/dp/']"):
        href = link.get("href", "")
        match = re.search(r"/dp/([A-Z0-9]{10})", href)
        if match:
            asins.add(match.group(1))

    return list(asins)



Rate and Rotation Strategy

Amazon is sensitive to per-IP velocity. The rotation strategy that works:

  • One IP per session — don't rotate IPs within a single product page load

  • Rotate between products — new IP for each ASIN (or small batch)

  • Delay between requests — minimum 3 seconds, jittered, target ~5 seconds average

  • Vary entry points — sometimes enter via Google (with Referer header), sometimes direct

  • Respect business hours — scraping at 3am local time for the target geography is a signal; distribute across normal hours

Evomi's residential proxies with US geo-targeting are the baseline for Amazon US. For Amazon EU sites (.de, .fr, .it), use the corresponding country targeting. Amazon's geo-detection is sophisticated, a US IP scraping amazon.de can trigger different treatment than a DE IP.


Common Pitfalls

Pitfall 1: Scraping during high-traffic periods. Black Friday, Prime Day, Amazon's anti-bot systems are tuned tighter during peak traffic. Schedule intensive scraping runs outside these windows.

Pitfall 2: Ignoring price history complexity. Amazon's current price vs. list price vs. sale price vs. third-party Buy Box price are four different numbers. Your schema needs to distinguish them explicitly.

Pitfall 3: Review pagination limits. Amazon caps reviews at 10 pages regardless of the total count. If a product has 50,000 reviews, you'll only get the first ~100 via standard pagination. Use the sort parameter (sortBy=helpful, sortBy=recent) to access different review subsets.


Conclusion

Amazon scraping in 2026 is achievable with curl_cffi for TLS fingerprint matching and Evomi's residential proxies for clean IP access. The key variables are TLS fingerprint (don't use requests/httpx defaults), geo-matching, and per-IP rate control. Get those right and you have a reliable foundation for Amazon competitive intelligence.

Test the setup with Evomi's free trial against your target ASIN set before scaling volume.

Amazon is simultaneously the most valuable scraping target for e-commerce intelligence and one of the most aggressive anti-bot deployments on the web. Their detection stack combines IP reputation, behavioral analysis, device fingerprinting, and CAPTCHA challenges into a system that's been tuned by decades of scraping attempts.

Getting reliable data from Amazon at scale requires understanding their detection model and building around it, not against it.


What Amazon Data Is Worth Extracting

Before the technical setup, define your extraction targets:

Product data — title, ASIN, category, brand, dimensions, weight. Relatively stable; changes slowly.

Pricing — list price, current price, sale price, Buy Box price, third-party seller prices. Changes frequently; high-value for competitive intelligence.

Reviews and ratings — star rating, review count, review text, verified purchase status, review date. Aggregated for sentiment analysis; granular for NLP tasks.

Seller data — who holds the Buy Box, fulfilled by Amazon vs. third-party, seller rating, shipping options.

Inventory signals — "Only X left in stock", availability status, delivery estimates. Useful for supply chain intelligence.

Sponsored placements — which products appear in sponsored positions for which keywords. Competitive advertising intelligence.


Amazon's Detection Stack

Amazon runs their own anti-bot system (not a third-party vendor), which means they have first-party behavioral data at a scale no external vendor can match. Their detection has evolved significantly from simple IP blocks to a multi-layer system:

IP reputation — Amazon maintains extensive IP intelligence. Datacenter IPs from AWS, GCP, and Azure are particularly well-catalogued (obviously). Known proxy ranges are flagged. Residential IPs fare better, but heavily used proxy pool IPs accumulate reputation.

TLS and HTTP fingerprinting — Amazon checks JA3 fingerprints and HTTP/2 settings. Python's requests or default httpx produce known fingerprints that don't match real browser behavior.

Browser fingerprinting — When JavaScript executes, Amazon collects canvas, WebGL, and behavioral signals. Their ue_sid and related tracking scripts run on every page.

CAPTCHA challenges — Amazon uses their own CAPTCHA system. Soft blocks return a "To discuss automated access to Amazon data please contact api-services-support@amazon.com" page with a CAPTCHA.

Rate limiting — Velocity-based blocks at both IP and account level. Signed-in sessions have lower block rates but higher scrutiny on account-level patterns.


The Extraction Approach: httpx + curl_cffi

For product pages and pricing data that doesn't require JavaScript execution, the most efficient approach is direct HTTP requests with a browser-matching TLS fingerprint.


from curl_cffi.requests import AsyncSession
import asyncio
import random
import re
from bs4 import BeautifulSoup

EVOMI_PROXY = "http://USERNAME:PASSWORD@rp.evomi.com:1001"

HEADERS = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": "https://www.google.com/",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "cross-site",
    "Sec-Ch-Ua": '"Chromium";v="124", "Not_A Brand";v="99"',
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Ch-Ua-Platform": '"Windows"',
    "Upgrade-Insecure-Requests": "1",
}

async def fetch_amazon_page(asin: str) -> str | None:
    """Fetch an Amazon product page using browser-matching TLS fingerprint."""
    url = f"https://www.amazon.com/dp/{asin}"

    async with AsyncSession(impersonate="chrome124") as session:
        try:
            response = await session.get(
                url,
                headers=HEADERS,
                proxies={"https": EVOMI_PROXY},
                timeout=30,
                allow_redirects=True,
            )

            if response.status_code == 200:
                # Check for CAPTCHA page
                if "api-services-support@amazon.com" in response.text:
                    print(f"CAPTCHA triggered for ASIN {asin}")
                    return None
                if "Sorry, we just need to make sure" in response.text:
                    print(f"Bot check page for ASIN {asin}")
                    return None
                return response.text
            else:
                print(f"Status {response.status_code} for ASIN {asin}")
                return None

        except Exception as e:
            print(f"Fetch error for {asin}: {e}")
            return None



Parsing Amazon Product Data

Amazon's HTML structure is complex and changes frequently. The __NEXT_DATA__ trick doesn't apply here, Amazon uses their own rendering stack, not Next.js. DOM parsing is the path.


def parse_amazon_product(html: str, asin: str) -> dict | None:
    soup = BeautifulSoup(html, "lxml")

    def get_text(selector: str) -> str | None:
        el = soup.select_one(selector)
        return el.get_text(strip=True) if el else None

    def get_attr(selector: str, attr: str) -> str | None:
        el = soup.select_one(selector)
        return el.get(attr) if el else None

    # Title
    title = get_text("#productTitle") or get_text("span#productTitle")

    # Pricing multiple selectors due to Amazon's A/B testing
    price_whole = get_text(".a-price-whole")
    price_fraction = get_text(".a-price-fraction")
    price = None
    if price_whole:
        price_str = price_whole.replace(",", "").rstrip(".")
        if price_fraction:
            price_str += f".{price_fraction}"
        try:
            price = float(price_str)
        except ValueError:
            pass

    # Rating
    rating_text = get_text("span.a-icon-alt")
    rating = None
    if rating_text:
        match = re.search(r"([\d.]+) out of", rating_text)
        if match:
            rating = float(match.group(1))

    # Review count
    review_count = None
    review_el = soup.select_one("#acrCustomerReviewText")
    if review_el:
        match = re.search(r"([\d,]+)", review_el.get_text())
        if match:
            review_count = int(match.group(1).replace(",", ""))

    # Availability
    availability_el = soup.select_one("#availability span, #outOfStock")
    availability = availability_el.get_text(strip=True) if availability_el else None

    # Brand
    brand = get_text("#bylineInfo") or get_text("a#bylineInfo")
    if brand:
        brand = re.sub(r"^(Brand:|Visit the|Store)", "", brand).strip()

    # ASIN from page (verify it matches)
    asin_from_page = get_attr('input[name="ASIN"]', "value") or asin

    if not title or price is None:
        return None  # Likely a soft block or layout change

    return {
        "asin": asin_from_page,
        "title": title,
        "price": price,
        "rating": rating,
        "review_count": review_count,
        "availability": availability,
        "brand": brand,
    }



Scraping Amazon Reviews

Reviews live on a separate paginated endpoint and have their own extraction logic:


async def fetch_reviews_page(asin: str, page: int = 1) -> list[dict]:
    """Fetch a page of reviews for an ASIN."""
    url = (
        f"https://www.amazon.com/product-reviews/{asin}"
        f"?pageNumber={page}&sortBy=recent&reviewerType=all_reviews"
    )

    async with AsyncSession(impersonate="chrome124") as session:
        response = await session.get(
            url,
            headers={**HEADERS, "Referer": f"https://www.amazon.com/dp/{asin}"},
            proxies={"https": EVOMI_PROXY},
            timeout=30,
        )

    if response.status_code != 200 or "api-services-support" in response.text:
        return []

    soup = BeautifulSoup(response.text, "lxml")
    reviews = []

    for review_div in soup.select('[data-hook="review"]'):
        rating_el = review_div.select_one('[data-hook="review-star-rating"] span.a-icon-alt')
        body_el = review_div.select_one('[data-hook="review-body"] span')
        title_el = review_div.select_one('[data-hook="review-title"] span')
        date_el = review_div.select_one('[data-hook="review-date"]')
        verified_el = review_div.select_one('[data-hook="avp-badge"]')
        review_id = review_div.get("id", "")

        rating = None
        if rating_el:
            match = re.search(r"([\d.]+) out of", rating_el.get_text())
            if match:
                rating = float(match.group(1))

        date_str = date_el.get_text(strip=True) if date_el else None
        date_match = re.search(r"(January|February|March|April|May|June|July|August|"
                               r"September|October|November|December)\s+\d+,\s+\d{4}",
                               date_str or "")

        reviews.append({
            "review_id": review_id,
            "asin": asin,
            "rating": rating,
            "title": title_el.get_text(strip=True) if title_el else None,
            "body": body_el.get_text(strip=True) if body_el else None,
            "date_str": date_match.group(0) if date_match else None,
            "verified_purchase": verified_el is not None,
        })

    return reviews

async def scrape_all_reviews(asin: str, max_pages: int = 10) -> list[dict]:
    """Scrape up to max_pages of reviews for an ASIN."""
    all_reviews = []
    for page in range(1, max_pages + 1):
        reviews = await fetch_reviews_page(asin, page)
        if not reviews:
            break
        all_reviews.extend(reviews)
        # Jittered delay between pages
        await asyncio.sleep(random.gauss(3.0, 0.8))
    return all_reviews



ASIN Discovery at Scale

Rather than scraping individual product pages, seed your pipeline with ASINs from category and search pages:


async def discover_asins_from_category(category_url: str) -> list[str]:
    """Extract ASINs from an Amazon category listing page."""
    async with AsyncSession(impersonate="chrome124") as session:
        response = await session.get(
            category_url,
            headers=HEADERS,
            proxies={"https": EVOMI_PROXY},
        )

    soup = BeautifulSoup(response.text, "lxml")
    asins = set()

    # ASINs appear in data-asin attributes on product cards
    for el in soup.select("[data-asin]"):
        asin = el.get("data-asin", "").strip()
        if len(asin) == 10 and asin.isalnum():
            asins.add(asin)

    # Also extract from product links
    for link in soup.select("a[href*='/dp/']"):
        href = link.get("href", "")
        match = re.search(r"/dp/([A-Z0-9]{10})", href)
        if match:
            asins.add(match.group(1))

    return list(asins)



Rate and Rotation Strategy

Amazon is sensitive to per-IP velocity. The rotation strategy that works:

  • One IP per session — don't rotate IPs within a single product page load

  • Rotate between products — new IP for each ASIN (or small batch)

  • Delay between requests — minimum 3 seconds, jittered, target ~5 seconds average

  • Vary entry points — sometimes enter via Google (with Referer header), sometimes direct

  • Respect business hours — scraping at 3am local time for the target geography is a signal; distribute across normal hours

Evomi's residential proxies with US geo-targeting are the baseline for Amazon US. For Amazon EU sites (.de, .fr, .it), use the corresponding country targeting. Amazon's geo-detection is sophisticated, a US IP scraping amazon.de can trigger different treatment than a DE IP.


Common Pitfalls

Pitfall 1: Scraping during high-traffic periods. Black Friday, Prime Day, Amazon's anti-bot systems are tuned tighter during peak traffic. Schedule intensive scraping runs outside these windows.

Pitfall 2: Ignoring price history complexity. Amazon's current price vs. list price vs. sale price vs. third-party Buy Box price are four different numbers. Your schema needs to distinguish them explicitly.

Pitfall 3: Review pagination limits. Amazon caps reviews at 10 pages regardless of the total count. If a product has 50,000 reviews, you'll only get the first ~100 via standard pagination. Use the sort parameter (sortBy=helpful, sortBy=recent) to access different review subsets.


Conclusion

Amazon scraping in 2026 is achievable with curl_cffi for TLS fingerprint matching and Evomi's residential proxies for clean IP access. The key variables are TLS fingerprint (don't use requests/httpx defaults), geo-matching, and per-IP rate control. Get those right and you have a reliable foundation for Amazon competitive intelligence.

Test the setup with Evomi's free trial against your target ASIN set before scaling volume.

Author

The Scraper

Engineer and Webscraping Specialist

About Author

The Scraper is a software engineer and web scraping specialist, focused on building production-grade data extraction systems. His work centers on large-scale crawling, anti-bot evasion, proxy infrastructure, and browser automation. He writes about real-world scraping failures, silent data corruption, and systems that operate at scale.

Like this article? Share it.
You asked, we answer - Users questions:

In This Article