When Your Scraper Becomes a Product: SLAs, Versioning & Data Contracts

The Scraper

Data Management

There's a moment in the lifecycle of most scraping projects when the scraper stops being internal infrastructure and becomes something other teams depend on. An analyst builds a dashboard on top of it. A product team starts building features on the data. An ML model uses it as a feature store. A partner integration consumes the feed.

At that moment, you've stopped running a scraper and started running a data product. And data products have requirements that scrapers traditionally ignore.

This transition breaks things. Silently, gradually, and then all at once.


What Changes When Consumers Exist

A scraper that only serves its author has informal contracts. If the schema changes, the author knows because they changed it. If the scraper goes down, the author knows because they're watching it. If the data is wrong, the author catches it before it matters.

A scraper that serves downstream consumers has external contracts. If the schema changes, someone's dashboard breaks and they find out when their Monday report is empty. If the scraper goes down, the product manager asking for the numbers is the one who notices. If the data is wrong, a decision gets made on it.

The gap between "it works for me" and "it's reliable for others" is what product engineering addresses. Applied to scrapers, this means:

SLA definition — what uptime, freshness, and accuracy are you promising? If you don't define it, consumers will assume the best. You need to set explicit expectations and design to meet them.

Schema versioning — when the data structure changes, consumers need warning, not surprise. Schema changes need to be versioned, communicated, and ideally backward-compatible for a migration window.

Data contracts — explicit documentation of what each field means, its data type, its null behavior, its update frequency. The consumer needs to know what they're building on.

Incident management — when the scraper breaks, who knows first? What's the notification path? How long before it's fixed?


Defining Your SLAs

Start with the freshness SLA, the maximum acceptable age of the data. Everything else follows from this.


# scraper-sla.yaml a template worth formalizing
service: product_catalog_scraper
version: 2.1

freshness:
  target: 4 hours        # Data should be no more than 4 hours stale
  critical_threshold: 8 hours  # Alert if data exceeds this age

coverage:
  target_pct: 98         # 98% of seed URLs should return valid records
  critical_threshold: 90

accuracy:
  null_rate_max: 0.03    # Max 3% null price fields
  anomaly_rate_max: 0.01 # Max 1% price anomalies

availability:
  uptime_target: 99.5    # % of scheduled runs completing successfully
  max_consecutive_failures: 2  # Alert after 2 consecutive job failures


These numbers should come from the consumer's requirements, not the scraper operator's comfort. Talk to the people using the data. What's the worst freshness they can tolerate before their use case breaks? That's your SLA.


Schema Versioning

A schema change that breaks a consumer is an outage, even if the scraper itself ran fine. Treat schema changes with the same rigor as API changes.

Additive changes (new fields, new enum values) should be safe, but inform consumers anyway. New fields they don't expect won't break their existing code, but they should know.

Breaking changes (renamed fields, removed fields, type changes) need a deprecation period. Keep the old field populated alongside the new one; give consumers a migration window; remove the old field in a later version.


# Versioned record schema
from pydantic import BaseModel
from typing import Optional
from datetime import datetime

class ProductRecordV2(BaseModel):
    """
    v2.1 schema changelog:
    - 2.1: added price_currency (was implicit USD)
    - 2.0: renamed 'cost' to 'price' (BREAKING v1.x deprecated)
    - 1.x: legacy schema, no longer supported
    """
    schema_version: str = "2.1"
    product_id: str
    name: str
    price: float                       # Added in v2.0
    price_currency: str = "USD"        # Added in v2.1
    availability: str
    scraped_at: datetime

    # Deprecated fields kept for backward compat, remove in v3.0
    cost: Optional[float] = None       # Deprecated in v2.0, alias for price

    def __init__(self, **data):
        super().__init__(**data)
        # Backward compat: populate deprecated alias
        if self.cost is None:
            object.__setattr__(self, 'cost', self.price)



DB Schema Evolution Without Breaking Consumers

In your DB (for example BigQuery or Athena, etc…), additive schema changes are safe, you can add a nullable column to an existing table without breaking existing queries. For breaking changes, the pattern is: new table version, dual-write period, consumer migration, old table deprecation.


-- Step 1: Create new versioned table
CREATE TABLE scraper.products_v2 AS
SELECT * FROM scraper.products_v1;

ALTER TABLE scraper.products_v2
ADD COLUMN price_currency STRING;

-- Step 2: Publish the new table; maintain dual-write to both
-- (pipeline writes to products_v1 and products_v2 simultaneously)

-- Step 3: Create a migration-period view that consumers can use
-- (automatically picks the current production table)
CREATE OR REPLACE VIEW scraper.products AS
SELECT * FROM scraper.products_v2;
-- Consumers using the view get the new schema automatically

-- Step 4: After migration window, stop writing to v1
-- Step 5: Drop v1


The view pattern decouples consumers from the physical table version. Consumers query scraper.products; you control which physical table backs it.


Data Contracts as Documentation

A data contract is a machine-readable (or at minimum clearly human-readable) specification of what a dataset contains. It answers the questions every consumer will eventually ask.


# data-contract-product-catalog.yaml
name: product_catalog
owner: data-engineering@company.com
version: 2.1
sla_ref: scraper-sla.yaml

description: >
  Daily scraped product catalog from e-commerce targets.
  Each record represents one product at one point in time.
  Multiple records per product_id use latest scraped_at for current state.

fields:
  product_id:
    type: STRING
    nullable: false
    description: Source platform's product identifier. Stable across scrapes.
    example: "B08N5WRWNW"

  name:
    type: STRING
    nullable: false
    description: Product display name as shown on source site.
    max_length: 500

  price:
    type: NUMERIC
    nullable: false
    description: List price at time of scraping. Does not include sale price.
    unit: currency (see price_currency field)
    range: [0.01, 100000]

  price_currency:
    type: STRING
    nullable: false
    description: ISO 4217 currency code for price field.
    enum: [USD, EUR, GBP, JPY, CAD, AUD]
    default: USD

  availability:
    type: STRING
    nullable: false
    description: Product availability status.
    enum: [in_stock, out_of_stock, unknown]

  scraped_at:
    type: TIMESTAMP
    nullable: false
    description: UTC timestamp when this record was scraped.

known_issues:
  - "availability='unknown' rate may spike to ~5% during target site maintenance windows"
  - "price field may reflect sale price on some targets during promotional periods"

changelog:
  "2.1": "Added price_currency field"
  "2.0": "Renamed cost → price (BREAKING)"
  "1.0": "Initial schema"



Incident Runbooks

When the scraper breaks, the oncall engineer shouldn't be figuring out the debug process from scratch. A runbook reduces mean time to resolution.


# Product Catalog Scraper Incident Runbook

## Alert: ScraperStaleData (data age > 8 hours)

**Impact:** Downstream dashboards show stale prices. Price alert system may miss changes.

**First check:** Was the scheduled job triggered?
  BigQuery: `SELECT MAX(scraped_at) FROM scraper.products`
  If NULL or > 8h: job didn't run or wrote no records

**Second check:** Did the job fail?
  Cloud Logging: filter for `scraper.job_status = ERROR`
  Common causes: proxy authentication expired, target site outage, schema change

**Third check:** Is it a soft block?
  Check Grafana: `scraper_null_field_rate{field="price"}` if > 5%, soft block suspected
  Action: rotate proxy pool, re-run job

**Escalation:** If not resolved in 2 hours, page team lead.

## Alert: HighNullFieldRate

**Likely cause:** Target site layout change OR soft block from proxy issues
**Action:** Pull 5 recent raw HTML files from GCS, check whether price element is present
  If present but wrong selector: update parser, re-parse from GCS
  If absent: target is serving bot-detection page check proxy pool health



The Mindset Shift

The scraper-to-product transition requires one fundamental mindset shift: you are no longer the primary customer of your own data.

When you're the only consumer, you can absorb variability and fix things as they break. When others depend on you, their tolerance is your constraint. Their use cases define your SLAs. Their migration timelines constrain your schema changes.

This shift makes scraping harder. It also makes it more valuable. Data that's reliable enough to build products on is worth an order of magnitude more than data that sometimes works.

Build the reliability in from the moment you know someone else is depending on it. Retrofitting reliability into an existing data product is much harder than building it in from the start.

There's a moment in the lifecycle of most scraping projects when the scraper stops being internal infrastructure and becomes something other teams depend on. An analyst builds a dashboard on top of it. A product team starts building features on the data. An ML model uses it as a feature store. A partner integration consumes the feed.

At that moment, you've stopped running a scraper and started running a data product. And data products have requirements that scrapers traditionally ignore.

This transition breaks things. Silently, gradually, and then all at once.


What Changes When Consumers Exist

A scraper that only serves its author has informal contracts. If the schema changes, the author knows because they changed it. If the scraper goes down, the author knows because they're watching it. If the data is wrong, the author catches it before it matters.

A scraper that serves downstream consumers has external contracts. If the schema changes, someone's dashboard breaks and they find out when their Monday report is empty. If the scraper goes down, the product manager asking for the numbers is the one who notices. If the data is wrong, a decision gets made on it.

The gap between "it works for me" and "it's reliable for others" is what product engineering addresses. Applied to scrapers, this means:

SLA definition — what uptime, freshness, and accuracy are you promising? If you don't define it, consumers will assume the best. You need to set explicit expectations and design to meet them.

Schema versioning — when the data structure changes, consumers need warning, not surprise. Schema changes need to be versioned, communicated, and ideally backward-compatible for a migration window.

Data contracts — explicit documentation of what each field means, its data type, its null behavior, its update frequency. The consumer needs to know what they're building on.

Incident management — when the scraper breaks, who knows first? What's the notification path? How long before it's fixed?


Defining Your SLAs

Start with the freshness SLA, the maximum acceptable age of the data. Everything else follows from this.


# scraper-sla.yaml a template worth formalizing
service: product_catalog_scraper
version: 2.1

freshness:
  target: 4 hours        # Data should be no more than 4 hours stale
  critical_threshold: 8 hours  # Alert if data exceeds this age

coverage:
  target_pct: 98         # 98% of seed URLs should return valid records
  critical_threshold: 90

accuracy:
  null_rate_max: 0.03    # Max 3% null price fields
  anomaly_rate_max: 0.01 # Max 1% price anomalies

availability:
  uptime_target: 99.5    # % of scheduled runs completing successfully
  max_consecutive_failures: 2  # Alert after 2 consecutive job failures


These numbers should come from the consumer's requirements, not the scraper operator's comfort. Talk to the people using the data. What's the worst freshness they can tolerate before their use case breaks? That's your SLA.


Schema Versioning

A schema change that breaks a consumer is an outage, even if the scraper itself ran fine. Treat schema changes with the same rigor as API changes.

Additive changes (new fields, new enum values) should be safe, but inform consumers anyway. New fields they don't expect won't break their existing code, but they should know.

Breaking changes (renamed fields, removed fields, type changes) need a deprecation period. Keep the old field populated alongside the new one; give consumers a migration window; remove the old field in a later version.


# Versioned record schema
from pydantic import BaseModel
from typing import Optional
from datetime import datetime

class ProductRecordV2(BaseModel):
    """
    v2.1 schema changelog:
    - 2.1: added price_currency (was implicit USD)
    - 2.0: renamed 'cost' to 'price' (BREAKING v1.x deprecated)
    - 1.x: legacy schema, no longer supported
    """
    schema_version: str = "2.1"
    product_id: str
    name: str
    price: float                       # Added in v2.0
    price_currency: str = "USD"        # Added in v2.1
    availability: str
    scraped_at: datetime

    # Deprecated fields kept for backward compat, remove in v3.0
    cost: Optional[float] = None       # Deprecated in v2.0, alias for price

    def __init__(self, **data):
        super().__init__(**data)
        # Backward compat: populate deprecated alias
        if self.cost is None:
            object.__setattr__(self, 'cost', self.price)



DB Schema Evolution Without Breaking Consumers

In your DB (for example BigQuery or Athena, etc…), additive schema changes are safe, you can add a nullable column to an existing table without breaking existing queries. For breaking changes, the pattern is: new table version, dual-write period, consumer migration, old table deprecation.


-- Step 1: Create new versioned table
CREATE TABLE scraper.products_v2 AS
SELECT * FROM scraper.products_v1;

ALTER TABLE scraper.products_v2
ADD COLUMN price_currency STRING;

-- Step 2: Publish the new table; maintain dual-write to both
-- (pipeline writes to products_v1 and products_v2 simultaneously)

-- Step 3: Create a migration-period view that consumers can use
-- (automatically picks the current production table)
CREATE OR REPLACE VIEW scraper.products AS
SELECT * FROM scraper.products_v2;
-- Consumers using the view get the new schema automatically

-- Step 4: After migration window, stop writing to v1
-- Step 5: Drop v1


The view pattern decouples consumers from the physical table version. Consumers query scraper.products; you control which physical table backs it.


Data Contracts as Documentation

A data contract is a machine-readable (or at minimum clearly human-readable) specification of what a dataset contains. It answers the questions every consumer will eventually ask.


# data-contract-product-catalog.yaml
name: product_catalog
owner: data-engineering@company.com
version: 2.1
sla_ref: scraper-sla.yaml

description: >
  Daily scraped product catalog from e-commerce targets.
  Each record represents one product at one point in time.
  Multiple records per product_id use latest scraped_at for current state.

fields:
  product_id:
    type: STRING
    nullable: false
    description: Source platform's product identifier. Stable across scrapes.
    example: "B08N5WRWNW"

  name:
    type: STRING
    nullable: false
    description: Product display name as shown on source site.
    max_length: 500

  price:
    type: NUMERIC
    nullable: false
    description: List price at time of scraping. Does not include sale price.
    unit: currency (see price_currency field)
    range: [0.01, 100000]

  price_currency:
    type: STRING
    nullable: false
    description: ISO 4217 currency code for price field.
    enum: [USD, EUR, GBP, JPY, CAD, AUD]
    default: USD

  availability:
    type: STRING
    nullable: false
    description: Product availability status.
    enum: [in_stock, out_of_stock, unknown]

  scraped_at:
    type: TIMESTAMP
    nullable: false
    description: UTC timestamp when this record was scraped.

known_issues:
  - "availability='unknown' rate may spike to ~5% during target site maintenance windows"
  - "price field may reflect sale price on some targets during promotional periods"

changelog:
  "2.1": "Added price_currency field"
  "2.0": "Renamed cost → price (BREAKING)"
  "1.0": "Initial schema"



Incident Runbooks

When the scraper breaks, the oncall engineer shouldn't be figuring out the debug process from scratch. A runbook reduces mean time to resolution.


# Product Catalog Scraper Incident Runbook

## Alert: ScraperStaleData (data age > 8 hours)

**Impact:** Downstream dashboards show stale prices. Price alert system may miss changes.

**First check:** Was the scheduled job triggered?
  BigQuery: `SELECT MAX(scraped_at) FROM scraper.products`
  If NULL or > 8h: job didn't run or wrote no records

**Second check:** Did the job fail?
  Cloud Logging: filter for `scraper.job_status = ERROR`
  Common causes: proxy authentication expired, target site outage, schema change

**Third check:** Is it a soft block?
  Check Grafana: `scraper_null_field_rate{field="price"}` if > 5%, soft block suspected
  Action: rotate proxy pool, re-run job

**Escalation:** If not resolved in 2 hours, page team lead.

## Alert: HighNullFieldRate

**Likely cause:** Target site layout change OR soft block from proxy issues
**Action:** Pull 5 recent raw HTML files from GCS, check whether price element is present
  If present but wrong selector: update parser, re-parse from GCS
  If absent: target is serving bot-detection page check proxy pool health



The Mindset Shift

The scraper-to-product transition requires one fundamental mindset shift: you are no longer the primary customer of your own data.

When you're the only consumer, you can absorb variability and fix things as they break. When others depend on you, their tolerance is your constraint. Their use cases define your SLAs. Their migration timelines constrain your schema changes.

This shift makes scraping harder. It also makes it more valuable. Data that's reliable enough to build products on is worth an order of magnitude more than data that sometimes works.

Build the reliability in from the moment you know someone else is depending on it. Retrofitting reliability into an existing data product is much harder than building it in from the start.

There's a moment in the lifecycle of most scraping projects when the scraper stops being internal infrastructure and becomes something other teams depend on. An analyst builds a dashboard on top of it. A product team starts building features on the data. An ML model uses it as a feature store. A partner integration consumes the feed.

At that moment, you've stopped running a scraper and started running a data product. And data products have requirements that scrapers traditionally ignore.

This transition breaks things. Silently, gradually, and then all at once.


What Changes When Consumers Exist

A scraper that only serves its author has informal contracts. If the schema changes, the author knows because they changed it. If the scraper goes down, the author knows because they're watching it. If the data is wrong, the author catches it before it matters.

A scraper that serves downstream consumers has external contracts. If the schema changes, someone's dashboard breaks and they find out when their Monday report is empty. If the scraper goes down, the product manager asking for the numbers is the one who notices. If the data is wrong, a decision gets made on it.

The gap between "it works for me" and "it's reliable for others" is what product engineering addresses. Applied to scrapers, this means:

SLA definition — what uptime, freshness, and accuracy are you promising? If you don't define it, consumers will assume the best. You need to set explicit expectations and design to meet them.

Schema versioning — when the data structure changes, consumers need warning, not surprise. Schema changes need to be versioned, communicated, and ideally backward-compatible for a migration window.

Data contracts — explicit documentation of what each field means, its data type, its null behavior, its update frequency. The consumer needs to know what they're building on.

Incident management — when the scraper breaks, who knows first? What's the notification path? How long before it's fixed?


Defining Your SLAs

Start with the freshness SLA, the maximum acceptable age of the data. Everything else follows from this.


# scraper-sla.yaml a template worth formalizing
service: product_catalog_scraper
version: 2.1

freshness:
  target: 4 hours        # Data should be no more than 4 hours stale
  critical_threshold: 8 hours  # Alert if data exceeds this age

coverage:
  target_pct: 98         # 98% of seed URLs should return valid records
  critical_threshold: 90

accuracy:
  null_rate_max: 0.03    # Max 3% null price fields
  anomaly_rate_max: 0.01 # Max 1% price anomalies

availability:
  uptime_target: 99.5    # % of scheduled runs completing successfully
  max_consecutive_failures: 2  # Alert after 2 consecutive job failures


These numbers should come from the consumer's requirements, not the scraper operator's comfort. Talk to the people using the data. What's the worst freshness they can tolerate before their use case breaks? That's your SLA.


Schema Versioning

A schema change that breaks a consumer is an outage, even if the scraper itself ran fine. Treat schema changes with the same rigor as API changes.

Additive changes (new fields, new enum values) should be safe, but inform consumers anyway. New fields they don't expect won't break their existing code, but they should know.

Breaking changes (renamed fields, removed fields, type changes) need a deprecation period. Keep the old field populated alongside the new one; give consumers a migration window; remove the old field in a later version.


# Versioned record schema
from pydantic import BaseModel
from typing import Optional
from datetime import datetime

class ProductRecordV2(BaseModel):
    """
    v2.1 schema changelog:
    - 2.1: added price_currency (was implicit USD)
    - 2.0: renamed 'cost' to 'price' (BREAKING v1.x deprecated)
    - 1.x: legacy schema, no longer supported
    """
    schema_version: str = "2.1"
    product_id: str
    name: str
    price: float                       # Added in v2.0
    price_currency: str = "USD"        # Added in v2.1
    availability: str
    scraped_at: datetime

    # Deprecated fields kept for backward compat, remove in v3.0
    cost: Optional[float] = None       # Deprecated in v2.0, alias for price

    def __init__(self, **data):
        super().__init__(**data)
        # Backward compat: populate deprecated alias
        if self.cost is None:
            object.__setattr__(self, 'cost', self.price)



DB Schema Evolution Without Breaking Consumers

In your DB (for example BigQuery or Athena, etc…), additive schema changes are safe, you can add a nullable column to an existing table without breaking existing queries. For breaking changes, the pattern is: new table version, dual-write period, consumer migration, old table deprecation.


-- Step 1: Create new versioned table
CREATE TABLE scraper.products_v2 AS
SELECT * FROM scraper.products_v1;

ALTER TABLE scraper.products_v2
ADD COLUMN price_currency STRING;

-- Step 2: Publish the new table; maintain dual-write to both
-- (pipeline writes to products_v1 and products_v2 simultaneously)

-- Step 3: Create a migration-period view that consumers can use
-- (automatically picks the current production table)
CREATE OR REPLACE VIEW scraper.products AS
SELECT * FROM scraper.products_v2;
-- Consumers using the view get the new schema automatically

-- Step 4: After migration window, stop writing to v1
-- Step 5: Drop v1


The view pattern decouples consumers from the physical table version. Consumers query scraper.products; you control which physical table backs it.


Data Contracts as Documentation

A data contract is a machine-readable (or at minimum clearly human-readable) specification of what a dataset contains. It answers the questions every consumer will eventually ask.


# data-contract-product-catalog.yaml
name: product_catalog
owner: data-engineering@company.com
version: 2.1
sla_ref: scraper-sla.yaml

description: >
  Daily scraped product catalog from e-commerce targets.
  Each record represents one product at one point in time.
  Multiple records per product_id use latest scraped_at for current state.

fields:
  product_id:
    type: STRING
    nullable: false
    description: Source platform's product identifier. Stable across scrapes.
    example: "B08N5WRWNW"

  name:
    type: STRING
    nullable: false
    description: Product display name as shown on source site.
    max_length: 500

  price:
    type: NUMERIC
    nullable: false
    description: List price at time of scraping. Does not include sale price.
    unit: currency (see price_currency field)
    range: [0.01, 100000]

  price_currency:
    type: STRING
    nullable: false
    description: ISO 4217 currency code for price field.
    enum: [USD, EUR, GBP, JPY, CAD, AUD]
    default: USD

  availability:
    type: STRING
    nullable: false
    description: Product availability status.
    enum: [in_stock, out_of_stock, unknown]

  scraped_at:
    type: TIMESTAMP
    nullable: false
    description: UTC timestamp when this record was scraped.

known_issues:
  - "availability='unknown' rate may spike to ~5% during target site maintenance windows"
  - "price field may reflect sale price on some targets during promotional periods"

changelog:
  "2.1": "Added price_currency field"
  "2.0": "Renamed cost → price (BREAKING)"
  "1.0": "Initial schema"



Incident Runbooks

When the scraper breaks, the oncall engineer shouldn't be figuring out the debug process from scratch. A runbook reduces mean time to resolution.


# Product Catalog Scraper Incident Runbook

## Alert: ScraperStaleData (data age > 8 hours)

**Impact:** Downstream dashboards show stale prices. Price alert system may miss changes.

**First check:** Was the scheduled job triggered?
  BigQuery: `SELECT MAX(scraped_at) FROM scraper.products`
  If NULL or > 8h: job didn't run or wrote no records

**Second check:** Did the job fail?
  Cloud Logging: filter for `scraper.job_status = ERROR`
  Common causes: proxy authentication expired, target site outage, schema change

**Third check:** Is it a soft block?
  Check Grafana: `scraper_null_field_rate{field="price"}` if > 5%, soft block suspected
  Action: rotate proxy pool, re-run job

**Escalation:** If not resolved in 2 hours, page team lead.

## Alert: HighNullFieldRate

**Likely cause:** Target site layout change OR soft block from proxy issues
**Action:** Pull 5 recent raw HTML files from GCS, check whether price element is present
  If present but wrong selector: update parser, re-parse from GCS
  If absent: target is serving bot-detection page check proxy pool health



The Mindset Shift

The scraper-to-product transition requires one fundamental mindset shift: you are no longer the primary customer of your own data.

When you're the only consumer, you can absorb variability and fix things as they break. When others depend on you, their tolerance is your constraint. Their use cases define your SLAs. Their migration timelines constrain your schema changes.

This shift makes scraping harder. It also makes it more valuable. Data that's reliable enough to build products on is worth an order of magnitude more than data that sometimes works.

Build the reliability in from the moment you know someone else is depending on it. Retrofitting reliability into an existing data product is much harder than building it in from the start.

Author

The Scraper

Engineer and Webscraping Specialist

About Author

The Scraper is a software engineer and web scraping specialist, focused on building production-grade data extraction systems. His work centers on large-scale crawling, anti-bot evasion, proxy infrastructure, and browser automation. He writes about real-world scraping failures, silent data corruption, and systems that operate at scale.

Like this article? Share it.
You asked, we answer - Users questions:

In This Article