The Compliance Squeeze: Scraping Between GDPR, the AI Act, and Platform TOS

The Scraper

Last updated on April 20, 2026

Proxy Fundamentals

The legal landscape around web scraping has shifted more in the last two years than in the previous ten. Three forces are converging simultaneously: GDPR enforcement is maturing and getting teeth, the EU AI Act is adding new data provenance requirements for AI training pipelines, and platform terms of service have escalated from boilerplate to litigation instruments.

None of this means scraping is becoming illegal. It means the compliance cost of scraping carelessly is rising, and the teams building sustainable data operations are the ones taking the legal layer seriously before they receive a letter.


What the Courts Have Actually Said

The legal status of web scraping is more settled than the headlines suggest. The foundational US case —hiQ Labs v. LinkedIn — survived multiple appeals with the conclusion that scraping publicly accessible data does not violate the Computer Fraud and Abuse Act (CFAA). Public data, accessible without authentication, is not "unauthorized access."

But "not illegal under the CFAA" is not the same as "legally unrestricted." The cases that are moving the needle now are different in character:

Contract-based claims. Platforms are increasingly treating terms-of-service violations as breach of contract rather than CFAA violations. If you created an account to access data and the TOS prohibits automated access, the exposure shifts. The data might be public; the access method was through a contractual relationship.

Data protection claims. In the EU and UK, GDPR creates rights around personal data regardless of whether that data is technically public. Email addresses, names, employment histories scraped from LinkedIn or company websites are personal data under GDPR. Processing them without a lawful basis, which "the website was public" is not, creates regulatory exposure.

Copyright claims. The Google v. SerpAPI dispute is the most prominent recent example of a platform arguing that scraped data itself is protected by database rights or copyright. This argument has had mixed success, but it's becoming a standard tool in platform legal arsenals.


GDPR in Practice for Scrapers

The GDPR compliance question for scraping isn't whether you can collect the data, it's what data you're collecting and what you do with it.

Personal data triggers GDPR. Any identifiable natural person's data, name, email, phone number, photo, social profile, employment record, is personal data. If your scraping pipeline collects it about EU residents, GDPR applies to you, regardless of where you're based.

Lawful basis is required for every processing activity. The two bases most relevant to scraping: legitimate interests and public interest / research. Legitimate interests requires a balancing test, your interest in the data must not be overridden by the subject's rights. Commercial competitive intelligence is a harder case than public health research.

Retention limits and data minimization. You cannot store personal data indefinitely. You must collect only what you need. For scrapers, this translates to: don't scrape personal data you're not actively using, don't retain it past a defined TTL, and don't replicate it across storage systems without justification.

The right to erasure. If a person asks you to delete their data, you must be able to comply. If your scraped dataset is an append-only log in BigQuery, you have a structural problem with GDPR compliance. Think about erasure paths at architecture time, not when you receive a request.


The EU AI Act and Data Provenance

The EU AI Act introduces a new compliance dimension that scraping teams building AI training pipelines need to understand now.

For general-purpose AI models (the kind trained on large web corpora), the Act requires providers to maintain documentation about the training data, specifically, whether it was subject to copyright restrictions or personal data protections. For models deployed within the EU, this creates a provenance requirement: you need to know where your training data came from and whether you had the right to use it.

This is not yet fully enforced, the relevant provisions phase in during 2026, but the direction is clear. Training data scraped without regard to copyright status or personal data content is going to create liability for the AI products trained on it.

For scraping operations that feed AI training pipelines, the practical implication is documentation: what did you scrape, from where, under what terms, and what did you do with personal data encountered during collection?


Platform TOS Escalation

The legal sophistication of platform TOS has increased significantly. The boilerplate "no automated access" clauses of 2015 have been replaced by detailed, heavily lawyered prohibitions with specific technical definitions.

LinkedIn's current User Agreement, for instance, explicitly prohibits scraping even public profile data, defines "scraping" to include any automated means, and reserves the right to technical countermeasures. Reddit's API terms introduced in 2023, and the subsequent legal action against third-party clients, established the template that other platforms have been adopting.

The pattern: platforms are asserting contractual rights over data access as a complement to technical anti-bot measures. The technical layer tries to stop you; the legal layer provides recourse if it doesn't.

This doesn't make scraping non-viable. It means the risk profile varies significantly by target. Scraping a public government database is different from scraping a social platform that has explicitly prohibited automated access in a contract you may have agreed to.


What Ethical Proxy Sourcing Has to Do With Compliance

There's a compliance dimension to proxy sourcing that gets less attention than it deserves.

Residential proxy pools sourced through consent-based opt-in programs have a fundamentally different legal profile than pools assembled through bundled software, browser extensions installed without clear disclosure, or malware-compromised devices. For operations subject to GDPR or conducting due diligence for enterprise clients, the sourcing practices of your proxy provider are part of your compliance picture.

The wave of enforcement actions against proxy providers in 2024, including the seizures and shutdowns of several mid-tier providers, was driven precisely by inadequate consent frameworks. Using a provider whose IP pool is legally clean isn't just an ethical preference; it's risk management.

Evomi's proxy pools are consent-based and ethically sourced. When a legal or compliance review of your data infrastructure arrives, and for AI teams operating in the EU, it will, "our proxy provider operates a legitimate, consent-based residential network" is a much better answer than the alternative. You can verify the approach and test the service with a completely free trial.


A Practical Compliance Framework

For teams building scraping operations that will survive regulatory scrutiny:

Tier your targets. Publicly accessible data from government, academic, or openly licensed sources is lowest risk. Platform data from social networks with explicit automated-access prohibitions is highest risk. Build your risk assessment into target selection, not as an afterthought.

Personal data hygiene. If you encounter personal data, decide immediately whether you need it. If you don't, filter it before storage. If you do, document your lawful basis and build retention/erasure into the pipeline.

Document your data provenance. Especially for AI training pipelines. Source, date, license status, personal data handling decisions. The EU AI Act's documentation requirements are directionally correct regardless of when enforcement matures.

Respect robots.txt, seriously. Courts have not uniformly held that violating robots.txt creates legal exposure, but the trend toward honoring it is clear. More practically, platforms that detect robots.txt violations treat it as a hostile signal, and the technical and legal response escalates together.

Use a compliant proxy provider. The legal provenance of your proxy pool is part of your compliance posture.


The Bottom Line

The compliance landscape isn't eliminating scraping. It's separating teams that think about it from teams that don't. The data operations that will still be running in five years are the ones being built today with legal sustainability in mind.

The rule of thumb: if you wouldn't be comfortable explaining your data collection practices to a regulator, redesign them. It's cheaper now than later.

Author

The Scraper

Engineer and Webscraping Specialist

About Author

The Scraper is a software engineer and web scraping specialist, focused on building production-grade data extraction systems. His work centers on large-scale crawling, anti-bot evasion, proxy infrastructure, and browser automation. He writes about real-world scraping failures, silent data corruption, and systems that operate at scale.

Like this article? Share it.
You asked, we answer - Users questions:

In This Article