Essential Playwright Web Scraping & Proxy Strategies (2025)





Sarah Whitmore
Scraping Techniques
Getting Started with Playwright for Web Scraping
Playwright is gaining serious traction in the web scraping world, and for good reason. This framework packs a suite of handy features that simplify extracting data from the web. It stands out by offering APIs for multiple popular programming languages like Python, JavaScript (via Node.js), Java, and .NET.
But multi-language support isn't the only perk. Playwright shines with advanced capabilities that streamline web scraping tasks, allowing deep customization of browser behavior to match the quirks of different websites.
So, What Exactly is Playwright?
Playwright is a relatively young open-source project, actively developed since 2020 and backed by Microsoft. Its core mission is to provide a unified API for automating browsers built on Chromium (like Chrome & Edge), WebKit (like Safari), and Firefox.
Its cross-platform, cross-language nature quickly won over developers. This versatility has made it a go-to tool for a variety of automation needs.
The two most prominent uses for Playwright are automated website testing and web scraping. Both benefit immensely from the framework's powerful features, enabling developers to automate complex browser interactions efficiently.
Using Playwright for web scraping is especially effective due to its knack for controlling multiple browser types while offering fine-grained control over each session. Moreover, Playwright is designed with performance in mind, making it a solid choice for demanding scraping jobs.
Is Playwright the Right Tool for Your Scraping Needs?
Generally speaking, Playwright offers a remarkably potent, flexible, and versatile solution for web scraping. Many automation libraries struggle with things like handling asynchronous operations smoothly or intelligently waiting for specific page elements to appear – areas where Playwright excels.
These built-in features are a huge boon for scraping developers, often making Playwright a more compelling option than many alternative browser automation tools.
However, it's not without its considerations. There's a bit of a learning curve; mastering its more advanced features requires some effort.
Also, the Playwright library itself is a bit on the larger side because it includes drivers for multiple browsers in both headed (visible) and headless (invisible) modes. While storage is cheap, this might be a factor in constrained environments.
Finally, being newer than giants like Selenium means that while its documentation is excellent, the community knowledge base isn't quite as vast yet, so finding pre-made solutions for niche problems might take a bit more digging.
Diving Into Web Scraping with Playwright
Playwright lets you write your automation scripts in several languages, but for this guide, we'll focus on JavaScript (Node.js) and Python examples.
The core concepts and Playwright methods are very similar across languages, so you should be able to adapt these examples even if you prefer Java or C#.
Setting Up Your Development Environment
Node.js
First things first, you'll need Node.js itself and a code editor. You can grab the Node.js installer from the official website. For an editor, popular choices include VS Code or IntelliJ IDEA.
Once Node.js is installed, install the Playwright library and its necessary browser binaries. Open your project directory in your terminal and run these commands:
npm
Python
For Python enthusiasts, you'll need Python installed, plus an editor. Download Python from its official site. Good free editor options include VS Code or PyCharm Community Edition.
With Python ready, open your project in your editor's terminal and install Playwright:
Finding the Right Elements
Playwright offers several strategies for pinpointing the data you need on a page. Each has its pros and cons, and experience will guide your choice. Here are the common ones:
CSS Selectors: A very popular method. You target elements using their CSS class, ID, attributes, or structure.
XPath: Another powerful option. XPath lets you navigate the HTML document's tree structure (the DOM) to select elements based on their position and relationships.
Text Content: Playwright allows you to locate elements based on the visible text they contain. This is great for clicking specific buttons or links.
Let's look at a few examples using a practice website like books.toscrape.com.
Example #1: CSS Selectors (Finding Book Titles)
// JavaScript Example
const bookTitles = await page.$$eval(
'article.product_pod h3 a',
links => links.map(link => link.getAttribute('title'))
);
console.log('Book Titles:', bookTitles);
# Python Example
title_elements = page.query_selector_all('article.product_pod h3 a')
titles = [element.get_attribute('title') for element in title_elements]
print('Book Titles:', titles)
Here, Playwright finds all <a>
tags that are descendants of an <h3>
within an <article>
having the class `product_pod`. It then extracts the `title` attribute from each found link and stores them.
Example #2: XPath (Finding Book Prices)
// JavaScript Example
const bookPrices = await page.$$eval(
'//article[@class="product_pod"]//p[@class="price_color"]',
priceElements => priceElements.map(el => el.textContent)
);
console.log('Book Prices:', bookPrices);
# Python Example
price_elements = page.query_selector_all('//article[@class="product_pod"]//p[@class="price_color"]')
prices = [element.text_content() for element in price_elements]
print('Book Prices:', prices)
This time, we use an XPath expression to locate <p>
elements with the class `price_color` anywhere inside an <article>
with the class `product_pod`. The text content (the price) is then extracted.
Example #3: Text-based Locations (Finding a Category Link)
// JavaScript Example
const travelLink = page.locator('a:has-text("Travel")');
await travelLink.click(); // Example action: clicking the link
console.log('Clicked on the "Travel" category link.');
# Python Example
travel_link = page.locator('a:has-text("Travel")')
await travel_link.click() # Example action: clicking the link
print('Clicked on the "Travel" category link.')
Locating by text is often quite intuitive. We simply tell Playwright to find an anchor tag (<a>
) that contains the exact text "Travel". We could then interact with it, like clicking it as shown.
Scraping Text Content
Let's put this together. We'll use the tried-and-true method of CSS selectors to scrape text. First, we need to initialize a Playwright browser instance and navigate to our target page (books.toscrape.com).
Here's the basic Node.js setup:
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch(); // Launches headless by default
const page = await browser.newPage();
await page.goto('https://books.toscrape.com/');
console.log('Page loaded:', await page.title());
// Scraping logic will go here
await browser.close();
})();
This script launches a Chromium browser, opens a new page, and navigates to the specified URL. Since it's headless, you won't see a browser window pop up.
Now, let's add the CSS selector logic from Example #1 to find and print the book titles:
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://books.toscrape.com/');
// Select all book title links and extract the 'title' attribute
const bookTitles = await page.$$eval(
'article.product_pod h3 a',
links => links.map(link => link.getAttribute('title'))
);
console.log('Found Titles:', bookTitles);
await browser.close();
})();
We use the $$eval
method with our CSS selector (`article.product_pod h3 a`). The second argument is a function that runs in the browser context, collecting the `title` attribute from each selected link.
Finding the right selector often involves using your browser's Developer Tools. Right-click the element you want, select "Inspect" or "Inspect Element", and examine the HTML structure. You can often right-click the HTML in the DevTools and find options like "Copy > Copy selector" or "Copy > Copy XPath".
Sometimes the automatically copied selector is overly specific. It pays to examine the HTML yourself to find a simpler, more robust selector, like we did by identifying the `article.product_pod h3 a` pattern.
Here’s the equivalent code for Python:
from playwright.sync_api import sync_playwright
def scrape_book_titles():
with sync_playwright() as p:
browser = p.chromium.launch() # Launches headless by default
page = browser.new_page()
page.goto('https://books.toscrape.com/')
# Use CSS selector to get title attributes
title_elements = page.query_selector_all('article.product_pod h3 a')
titles = [element.get_attribute('title') for element in title_elements]
print('Found Titles:', titles)
browser.close()
scrape_book_titles()
Scraping Images
Playwright isn't limited to text; you can easily grab image URLs and even download the images themselves. Keep in mind that images consume significantly more bandwidth and storage than text, so plan your scraping strategy accordingly, especially for large-scale tasks.
Let's modify our script to download the book cover images from the first page:
const { chromium } = require('playwright');
const fs = require('fs'); // Node.js file system module
const path = require('path'); // Node.js path module
const https = require('https'); // Needed for downloading images often served over https
// Helper function to download image
function downloadImage(url, filepath) {
return new Promise((resolve, reject) => {
https.get(url, (res) => {
if (res.statusCode === 200) {
res.pipe(fs.createWriteStream(filepath))
.on('error', reject)
.once('close', () => resolve(filepath));
} else {
// Consume response data to free up memory
res.resume();
reject(new Error(`Request Failed With Status Code: ${res.statusCode} for ${url}`));
}
});
});
}
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
// We need the base URL to construct full image paths
const baseUrl = 'https://books.toscrape.com/';
await page.goto(baseUrl);
// Select image elements and get their 'src' attribute
const imageRelPaths = await page.$$eval('article.product_pod .image_container img', imgs =>
imgs.map(img => img.getAttribute('src'))
);
// Create a directory to save images if it doesn't exist
const imageDir = path.join(__dirname, 'book_images');
if (!fs.existsSync(imageDir)) {
fs.mkdirSync(imageDir);
}
console.log(`Found ${imageRelPaths.length} images. Downloading...`);
for (let i = 0; i < imageRelPaths.length; i++) {
const imgRelPath = imageRelPaths[i];
// Construct the full URL (relative paths are common)
const imgUrl = new URL(imgRelPath, baseUrl).href;
const filename = path.basename(imgUrl); // Extract filename from URL
const filepath = path.join(imageDir, `${i}_${filename}`); // Prepend index to avoid name collisions
try {
await downloadImage(imgUrl, filepath);
console.log(`Downloaded: ${filepath}`);
} catch (error) {
console.error(`Failed to download ${imgUrl}: ${error.message}`);
}
}
await browser.close();
console.log('Image download process complete.');
})();
Key changes:We include Node.js's `fs` (File System), `path`, and `https` modules.We define a helper function `downloadImage` to handle fetching the image data via HTTPS and writing it to a file.We target the `img` tags within the `div.image_container`.We extract the `src` attribute, which often contains a relative path.We construct the absolute URL using the `baseUrl`.We create a directory `book_images` if it doesn't exist.We loop through the image URLs, download each one using our helper function, and save it with a unique filename in the created directory.
And here's the Python equivalent:
import os
import requests # Easier for downloading files in Python
from urllib.parse import urljoin
from playwright.sync_api import sync_playwright
def scrape_book_images():
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
base_url = 'https://books.toscrape.com/'
page.goto(base_url)
# Get relative image paths
image_elements = page.query_selector_all('article.product_pod .image_container img')
rel_image_paths = [img.get_attribute('src') for img in image_elements]
# Create directory for images
image_dir = 'book_images_py'
os.makedirs(image_dir, exist_ok=True)
print(f'Found {len(rel_image_paths)} images. Downloading...')
for i, rel_path in enumerate(rel_image_paths):
# Construct absolute URL
img_url = urljoin(base_url, rel_path)
filename = os.path.basename(img_url)
filepath = os.path.join(image_dir, f'{i}_{filename}')
try:
# Download image using requests
response = requests.get(img_url, stream=True)
response.raise_for_status() # Raise an exception for bad status codes
with open(filepath, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
print(f'Downloaded: {filepath}')
except requests.exceptions.RequestException as e:
print(f'Failed to download {img_url}: {e}')
browser.close()
print('Image download process complete.')
scrape_book_images()
The Python version uses the popular `requests` library for cleaner file downloads and `os` and `urllib.parse` for path manipulation.
Integrating Proxies with Playwright
When performing web scraping at any significant scale, proxies become essential. Websites employ anti-bot measures, and making too many requests from a single IP address is a surefire way to get blocked. No matter how cleverly you mimic human behavior, sooner or later, you'll likely encounter IP bans or CAPTCHAs.
Proxies act as intermediaries, masking your real IP address and allowing you to route requests through different ones. This bypasses IP-based blocking. Thankfully, Playwright has built-in support for proxies, making setup quite straightforward.
You just need to add proxy details to the `launch` options:
// JavaScript Example with Evomi Residential Proxy Placeholder
const browser = await playwright.chromium.launch({
proxy: {
server: 'http://rp.evomi.com:1000', // Example: Evomi Residential HTTP endpoint
username: 'YOUR_EVOMI_USERNAME',
password: 'YOUR_EVOMI_PASSWORD'
}
});
Simply provide the server address (hostname and port) and your credentials. This example uses placeholder details for an Evomi residential proxy – you'd replace `rp.evomi.com:1000`, `YOUR_EVOMI_USERNAME`, and `YOUR_EVOMI_PASSWORD` with your actual service details. Evomi offers various proxy types like Residential, Mobile, Datacenter, and Static ISP, each with different endpoints and ideal use cases.
Before using a proxy in your script, it's wise to verify it's working correctly. You can use a tool like the Evomi Free Proxy Tester to check its status and location.
Here’s the Python equivalent:
# Python Example with Evomi Datacenter Proxy Placeholder
from playwright.sync_api import sync_playwright
def launch_with_proxy():
with sync_playwright() as p:
browser = p.chromium.launch(proxy={
"server": "http://dc.evomi.com:2000", # Example: Evomi Datacenter HTTP endpoint
"username": "YOUR_EVOMI_USERNAME",
"password": "YOUR_EVOMI_PASSWORD"
})
page = browser.new_page()
# Test navigation through proxy
try:
# Use a site that shows your IP
page.goto('https://geo.evomi.com/', timeout=60000)
print("Page loaded via proxy. Check IP details.")
print(page.content()) # Print page content to see IP info
except Exception as e:
print(f"Failed to load page via proxy: {e}")
finally:
browser.close()
launch_with_proxy()
This Python example uses placeholder details for an Evomi datacenter proxy. We also added a visit to Evomi's IP Geolocation Checker to help confirm the proxy is active. Remember to replace the placeholder credentials and endpoint (`dc.evomi.com:2000`) with your specific Evomi plan details.
Choosing a reliable proxy provider like Evomi, known for ethically sourced proxies and Swiss quality standards, ensures better success rates and stability for your Playwright scraping projects. Many providers, including Evomi, offer rotating proxies that automatically change the IP address for you, simplifying management for large-scale scraping.
Getting Started with Playwright for Web Scraping
Playwright is gaining serious traction in the web scraping world, and for good reason. This framework packs a suite of handy features that simplify extracting data from the web. It stands out by offering APIs for multiple popular programming languages like Python, JavaScript (via Node.js), Java, and .NET.
But multi-language support isn't the only perk. Playwright shines with advanced capabilities that streamline web scraping tasks, allowing deep customization of browser behavior to match the quirks of different websites.
So, What Exactly is Playwright?
Playwright is a relatively young open-source project, actively developed since 2020 and backed by Microsoft. Its core mission is to provide a unified API for automating browsers built on Chromium (like Chrome & Edge), WebKit (like Safari), and Firefox.
Its cross-platform, cross-language nature quickly won over developers. This versatility has made it a go-to tool for a variety of automation needs.
The two most prominent uses for Playwright are automated website testing and web scraping. Both benefit immensely from the framework's powerful features, enabling developers to automate complex browser interactions efficiently.
Using Playwright for web scraping is especially effective due to its knack for controlling multiple browser types while offering fine-grained control over each session. Moreover, Playwright is designed with performance in mind, making it a solid choice for demanding scraping jobs.
Is Playwright the Right Tool for Your Scraping Needs?
Generally speaking, Playwright offers a remarkably potent, flexible, and versatile solution for web scraping. Many automation libraries struggle with things like handling asynchronous operations smoothly or intelligently waiting for specific page elements to appear – areas where Playwright excels.
These built-in features are a huge boon for scraping developers, often making Playwright a more compelling option than many alternative browser automation tools.
However, it's not without its considerations. There's a bit of a learning curve; mastering its more advanced features requires some effort.
Also, the Playwright library itself is a bit on the larger side because it includes drivers for multiple browsers in both headed (visible) and headless (invisible) modes. While storage is cheap, this might be a factor in constrained environments.
Finally, being newer than giants like Selenium means that while its documentation is excellent, the community knowledge base isn't quite as vast yet, so finding pre-made solutions for niche problems might take a bit more digging.
Diving Into Web Scraping with Playwright
Playwright lets you write your automation scripts in several languages, but for this guide, we'll focus on JavaScript (Node.js) and Python examples.
The core concepts and Playwright methods are very similar across languages, so you should be able to adapt these examples even if you prefer Java or C#.
Setting Up Your Development Environment
Node.js
First things first, you'll need Node.js itself and a code editor. You can grab the Node.js installer from the official website. For an editor, popular choices include VS Code or IntelliJ IDEA.
Once Node.js is installed, install the Playwright library and its necessary browser binaries. Open your project directory in your terminal and run these commands:
npm
Python
For Python enthusiasts, you'll need Python installed, plus an editor. Download Python from its official site. Good free editor options include VS Code or PyCharm Community Edition.
With Python ready, open your project in your editor's terminal and install Playwright:
Finding the Right Elements
Playwright offers several strategies for pinpointing the data you need on a page. Each has its pros and cons, and experience will guide your choice. Here are the common ones:
CSS Selectors: A very popular method. You target elements using their CSS class, ID, attributes, or structure.
XPath: Another powerful option. XPath lets you navigate the HTML document's tree structure (the DOM) to select elements based on their position and relationships.
Text Content: Playwright allows you to locate elements based on the visible text they contain. This is great for clicking specific buttons or links.
Let's look at a few examples using a practice website like books.toscrape.com.
Example #1: CSS Selectors (Finding Book Titles)
// JavaScript Example
const bookTitles = await page.$$eval(
'article.product_pod h3 a',
links => links.map(link => link.getAttribute('title'))
);
console.log('Book Titles:', bookTitles);
# Python Example
title_elements = page.query_selector_all('article.product_pod h3 a')
titles = [element.get_attribute('title') for element in title_elements]
print('Book Titles:', titles)
Here, Playwright finds all <a>
tags that are descendants of an <h3>
within an <article>
having the class `product_pod`. It then extracts the `title` attribute from each found link and stores them.
Example #2: XPath (Finding Book Prices)
// JavaScript Example
const bookPrices = await page.$$eval(
'//article[@class="product_pod"]//p[@class="price_color"]',
priceElements => priceElements.map(el => el.textContent)
);
console.log('Book Prices:', bookPrices);
# Python Example
price_elements = page.query_selector_all('//article[@class="product_pod"]//p[@class="price_color"]')
prices = [element.text_content() for element in price_elements]
print('Book Prices:', prices)
This time, we use an XPath expression to locate <p>
elements with the class `price_color` anywhere inside an <article>
with the class `product_pod`. The text content (the price) is then extracted.
Example #3: Text-based Locations (Finding a Category Link)
// JavaScript Example
const travelLink = page.locator('a:has-text("Travel")');
await travelLink.click(); // Example action: clicking the link
console.log('Clicked on the "Travel" category link.');
# Python Example
travel_link = page.locator('a:has-text("Travel")')
await travel_link.click() # Example action: clicking the link
print('Clicked on the "Travel" category link.')
Locating by text is often quite intuitive. We simply tell Playwright to find an anchor tag (<a>
) that contains the exact text "Travel". We could then interact with it, like clicking it as shown.
Scraping Text Content
Let's put this together. We'll use the tried-and-true method of CSS selectors to scrape text. First, we need to initialize a Playwright browser instance and navigate to our target page (books.toscrape.com).
Here's the basic Node.js setup:
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch(); // Launches headless by default
const page = await browser.newPage();
await page.goto('https://books.toscrape.com/');
console.log('Page loaded:', await page.title());
// Scraping logic will go here
await browser.close();
})();
This script launches a Chromium browser, opens a new page, and navigates to the specified URL. Since it's headless, you won't see a browser window pop up.
Now, let's add the CSS selector logic from Example #1 to find and print the book titles:
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://books.toscrape.com/');
// Select all book title links and extract the 'title' attribute
const bookTitles = await page.$$eval(
'article.product_pod h3 a',
links => links.map(link => link.getAttribute('title'))
);
console.log('Found Titles:', bookTitles);
await browser.close();
})();
We use the $$eval
method with our CSS selector (`article.product_pod h3 a`). The second argument is a function that runs in the browser context, collecting the `title` attribute from each selected link.
Finding the right selector often involves using your browser's Developer Tools. Right-click the element you want, select "Inspect" or "Inspect Element", and examine the HTML structure. You can often right-click the HTML in the DevTools and find options like "Copy > Copy selector" or "Copy > Copy XPath".
Sometimes the automatically copied selector is overly specific. It pays to examine the HTML yourself to find a simpler, more robust selector, like we did by identifying the `article.product_pod h3 a` pattern.
Here’s the equivalent code for Python:
from playwright.sync_api import sync_playwright
def scrape_book_titles():
with sync_playwright() as p:
browser = p.chromium.launch() # Launches headless by default
page = browser.new_page()
page.goto('https://books.toscrape.com/')
# Use CSS selector to get title attributes
title_elements = page.query_selector_all('article.product_pod h3 a')
titles = [element.get_attribute('title') for element in title_elements]
print('Found Titles:', titles)
browser.close()
scrape_book_titles()
Scraping Images
Playwright isn't limited to text; you can easily grab image URLs and even download the images themselves. Keep in mind that images consume significantly more bandwidth and storage than text, so plan your scraping strategy accordingly, especially for large-scale tasks.
Let's modify our script to download the book cover images from the first page:
const { chromium } = require('playwright');
const fs = require('fs'); // Node.js file system module
const path = require('path'); // Node.js path module
const https = require('https'); // Needed for downloading images often served over https
// Helper function to download image
function downloadImage(url, filepath) {
return new Promise((resolve, reject) => {
https.get(url, (res) => {
if (res.statusCode === 200) {
res.pipe(fs.createWriteStream(filepath))
.on('error', reject)
.once('close', () => resolve(filepath));
} else {
// Consume response data to free up memory
res.resume();
reject(new Error(`Request Failed With Status Code: ${res.statusCode} for ${url}`));
}
});
});
}
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
// We need the base URL to construct full image paths
const baseUrl = 'https://books.toscrape.com/';
await page.goto(baseUrl);
// Select image elements and get their 'src' attribute
const imageRelPaths = await page.$$eval('article.product_pod .image_container img', imgs =>
imgs.map(img => img.getAttribute('src'))
);
// Create a directory to save images if it doesn't exist
const imageDir = path.join(__dirname, 'book_images');
if (!fs.existsSync(imageDir)) {
fs.mkdirSync(imageDir);
}
console.log(`Found ${imageRelPaths.length} images. Downloading...`);
for (let i = 0; i < imageRelPaths.length; i++) {
const imgRelPath = imageRelPaths[i];
// Construct the full URL (relative paths are common)
const imgUrl = new URL(imgRelPath, baseUrl).href;
const filename = path.basename(imgUrl); // Extract filename from URL
const filepath = path.join(imageDir, `${i}_${filename}`); // Prepend index to avoid name collisions
try {
await downloadImage(imgUrl, filepath);
console.log(`Downloaded: ${filepath}`);
} catch (error) {
console.error(`Failed to download ${imgUrl}: ${error.message}`);
}
}
await browser.close();
console.log('Image download process complete.');
})();
Key changes:We include Node.js's `fs` (File System), `path`, and `https` modules.We define a helper function `downloadImage` to handle fetching the image data via HTTPS and writing it to a file.We target the `img` tags within the `div.image_container`.We extract the `src` attribute, which often contains a relative path.We construct the absolute URL using the `baseUrl`.We create a directory `book_images` if it doesn't exist.We loop through the image URLs, download each one using our helper function, and save it with a unique filename in the created directory.
And here's the Python equivalent:
import os
import requests # Easier for downloading files in Python
from urllib.parse import urljoin
from playwright.sync_api import sync_playwright
def scrape_book_images():
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
base_url = 'https://books.toscrape.com/'
page.goto(base_url)
# Get relative image paths
image_elements = page.query_selector_all('article.product_pod .image_container img')
rel_image_paths = [img.get_attribute('src') for img in image_elements]
# Create directory for images
image_dir = 'book_images_py'
os.makedirs(image_dir, exist_ok=True)
print(f'Found {len(rel_image_paths)} images. Downloading...')
for i, rel_path in enumerate(rel_image_paths):
# Construct absolute URL
img_url = urljoin(base_url, rel_path)
filename = os.path.basename(img_url)
filepath = os.path.join(image_dir, f'{i}_{filename}')
try:
# Download image using requests
response = requests.get(img_url, stream=True)
response.raise_for_status() # Raise an exception for bad status codes
with open(filepath, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
print(f'Downloaded: {filepath}')
except requests.exceptions.RequestException as e:
print(f'Failed to download {img_url}: {e}')
browser.close()
print('Image download process complete.')
scrape_book_images()
The Python version uses the popular `requests` library for cleaner file downloads and `os` and `urllib.parse` for path manipulation.
Integrating Proxies with Playwright
When performing web scraping at any significant scale, proxies become essential. Websites employ anti-bot measures, and making too many requests from a single IP address is a surefire way to get blocked. No matter how cleverly you mimic human behavior, sooner or later, you'll likely encounter IP bans or CAPTCHAs.
Proxies act as intermediaries, masking your real IP address and allowing you to route requests through different ones. This bypasses IP-based blocking. Thankfully, Playwright has built-in support for proxies, making setup quite straightforward.
You just need to add proxy details to the `launch` options:
// JavaScript Example with Evomi Residential Proxy Placeholder
const browser = await playwright.chromium.launch({
proxy: {
server: 'http://rp.evomi.com:1000', // Example: Evomi Residential HTTP endpoint
username: 'YOUR_EVOMI_USERNAME',
password: 'YOUR_EVOMI_PASSWORD'
}
});
Simply provide the server address (hostname and port) and your credentials. This example uses placeholder details for an Evomi residential proxy – you'd replace `rp.evomi.com:1000`, `YOUR_EVOMI_USERNAME`, and `YOUR_EVOMI_PASSWORD` with your actual service details. Evomi offers various proxy types like Residential, Mobile, Datacenter, and Static ISP, each with different endpoints and ideal use cases.
Before using a proxy in your script, it's wise to verify it's working correctly. You can use a tool like the Evomi Free Proxy Tester to check its status and location.
Here’s the Python equivalent:
# Python Example with Evomi Datacenter Proxy Placeholder
from playwright.sync_api import sync_playwright
def launch_with_proxy():
with sync_playwright() as p:
browser = p.chromium.launch(proxy={
"server": "http://dc.evomi.com:2000", # Example: Evomi Datacenter HTTP endpoint
"username": "YOUR_EVOMI_USERNAME",
"password": "YOUR_EVOMI_PASSWORD"
})
page = browser.new_page()
# Test navigation through proxy
try:
# Use a site that shows your IP
page.goto('https://geo.evomi.com/', timeout=60000)
print("Page loaded via proxy. Check IP details.")
print(page.content()) # Print page content to see IP info
except Exception as e:
print(f"Failed to load page via proxy: {e}")
finally:
browser.close()
launch_with_proxy()
This Python example uses placeholder details for an Evomi datacenter proxy. We also added a visit to Evomi's IP Geolocation Checker to help confirm the proxy is active. Remember to replace the placeholder credentials and endpoint (`dc.evomi.com:2000`) with your specific Evomi plan details.
Choosing a reliable proxy provider like Evomi, known for ethically sourced proxies and Swiss quality standards, ensures better success rates and stability for your Playwright scraping projects. Many providers, including Evomi, offer rotating proxies that automatically change the IP address for you, simplifying management for large-scale scraping.
Getting Started with Playwright for Web Scraping
Playwright is gaining serious traction in the web scraping world, and for good reason. This framework packs a suite of handy features that simplify extracting data from the web. It stands out by offering APIs for multiple popular programming languages like Python, JavaScript (via Node.js), Java, and .NET.
But multi-language support isn't the only perk. Playwright shines with advanced capabilities that streamline web scraping tasks, allowing deep customization of browser behavior to match the quirks of different websites.
So, What Exactly is Playwright?
Playwright is a relatively young open-source project, actively developed since 2020 and backed by Microsoft. Its core mission is to provide a unified API for automating browsers built on Chromium (like Chrome & Edge), WebKit (like Safari), and Firefox.
Its cross-platform, cross-language nature quickly won over developers. This versatility has made it a go-to tool for a variety of automation needs.
The two most prominent uses for Playwright are automated website testing and web scraping. Both benefit immensely from the framework's powerful features, enabling developers to automate complex browser interactions efficiently.
Using Playwright for web scraping is especially effective due to its knack for controlling multiple browser types while offering fine-grained control over each session. Moreover, Playwright is designed with performance in mind, making it a solid choice for demanding scraping jobs.
Is Playwright the Right Tool for Your Scraping Needs?
Generally speaking, Playwright offers a remarkably potent, flexible, and versatile solution for web scraping. Many automation libraries struggle with things like handling asynchronous operations smoothly or intelligently waiting for specific page elements to appear – areas where Playwright excels.
These built-in features are a huge boon for scraping developers, often making Playwright a more compelling option than many alternative browser automation tools.
However, it's not without its considerations. There's a bit of a learning curve; mastering its more advanced features requires some effort.
Also, the Playwright library itself is a bit on the larger side because it includes drivers for multiple browsers in both headed (visible) and headless (invisible) modes. While storage is cheap, this might be a factor in constrained environments.
Finally, being newer than giants like Selenium means that while its documentation is excellent, the community knowledge base isn't quite as vast yet, so finding pre-made solutions for niche problems might take a bit more digging.
Diving Into Web Scraping with Playwright
Playwright lets you write your automation scripts in several languages, but for this guide, we'll focus on JavaScript (Node.js) and Python examples.
The core concepts and Playwright methods are very similar across languages, so you should be able to adapt these examples even if you prefer Java or C#.
Setting Up Your Development Environment
Node.js
First things first, you'll need Node.js itself and a code editor. You can grab the Node.js installer from the official website. For an editor, popular choices include VS Code or IntelliJ IDEA.
Once Node.js is installed, install the Playwright library and its necessary browser binaries. Open your project directory in your terminal and run these commands:
npm
Python
For Python enthusiasts, you'll need Python installed, plus an editor. Download Python from its official site. Good free editor options include VS Code or PyCharm Community Edition.
With Python ready, open your project in your editor's terminal and install Playwright:
Finding the Right Elements
Playwright offers several strategies for pinpointing the data you need on a page. Each has its pros and cons, and experience will guide your choice. Here are the common ones:
CSS Selectors: A very popular method. You target elements using their CSS class, ID, attributes, or structure.
XPath: Another powerful option. XPath lets you navigate the HTML document's tree structure (the DOM) to select elements based on their position and relationships.
Text Content: Playwright allows you to locate elements based on the visible text they contain. This is great for clicking specific buttons or links.
Let's look at a few examples using a practice website like books.toscrape.com.
Example #1: CSS Selectors (Finding Book Titles)
// JavaScript Example
const bookTitles = await page.$$eval(
'article.product_pod h3 a',
links => links.map(link => link.getAttribute('title'))
);
console.log('Book Titles:', bookTitles);
# Python Example
title_elements = page.query_selector_all('article.product_pod h3 a')
titles = [element.get_attribute('title') for element in title_elements]
print('Book Titles:', titles)
Here, Playwright finds all <a>
tags that are descendants of an <h3>
within an <article>
having the class `product_pod`. It then extracts the `title` attribute from each found link and stores them.
Example #2: XPath (Finding Book Prices)
// JavaScript Example
const bookPrices = await page.$$eval(
'//article[@class="product_pod"]//p[@class="price_color"]',
priceElements => priceElements.map(el => el.textContent)
);
console.log('Book Prices:', bookPrices);
# Python Example
price_elements = page.query_selector_all('//article[@class="product_pod"]//p[@class="price_color"]')
prices = [element.text_content() for element in price_elements]
print('Book Prices:', prices)
This time, we use an XPath expression to locate <p>
elements with the class `price_color` anywhere inside an <article>
with the class `product_pod`. The text content (the price) is then extracted.
Example #3: Text-based Locations (Finding a Category Link)
// JavaScript Example
const travelLink = page.locator('a:has-text("Travel")');
await travelLink.click(); // Example action: clicking the link
console.log('Clicked on the "Travel" category link.');
# Python Example
travel_link = page.locator('a:has-text("Travel")')
await travel_link.click() # Example action: clicking the link
print('Clicked on the "Travel" category link.')
Locating by text is often quite intuitive. We simply tell Playwright to find an anchor tag (<a>
) that contains the exact text "Travel". We could then interact with it, like clicking it as shown.
Scraping Text Content
Let's put this together. We'll use the tried-and-true method of CSS selectors to scrape text. First, we need to initialize a Playwright browser instance and navigate to our target page (books.toscrape.com).
Here's the basic Node.js setup:
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch(); // Launches headless by default
const page = await browser.newPage();
await page.goto('https://books.toscrape.com/');
console.log('Page loaded:', await page.title());
// Scraping logic will go here
await browser.close();
})();
This script launches a Chromium browser, opens a new page, and navigates to the specified URL. Since it's headless, you won't see a browser window pop up.
Now, let's add the CSS selector logic from Example #1 to find and print the book titles:
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://books.toscrape.com/');
// Select all book title links and extract the 'title' attribute
const bookTitles = await page.$$eval(
'article.product_pod h3 a',
links => links.map(link => link.getAttribute('title'))
);
console.log('Found Titles:', bookTitles);
await browser.close();
})();
We use the $$eval
method with our CSS selector (`article.product_pod h3 a`). The second argument is a function that runs in the browser context, collecting the `title` attribute from each selected link.
Finding the right selector often involves using your browser's Developer Tools. Right-click the element you want, select "Inspect" or "Inspect Element", and examine the HTML structure. You can often right-click the HTML in the DevTools and find options like "Copy > Copy selector" or "Copy > Copy XPath".
Sometimes the automatically copied selector is overly specific. It pays to examine the HTML yourself to find a simpler, more robust selector, like we did by identifying the `article.product_pod h3 a` pattern.
Here’s the equivalent code for Python:
from playwright.sync_api import sync_playwright
def scrape_book_titles():
with sync_playwright() as p:
browser = p.chromium.launch() # Launches headless by default
page = browser.new_page()
page.goto('https://books.toscrape.com/')
# Use CSS selector to get title attributes
title_elements = page.query_selector_all('article.product_pod h3 a')
titles = [element.get_attribute('title') for element in title_elements]
print('Found Titles:', titles)
browser.close()
scrape_book_titles()
Scraping Images
Playwright isn't limited to text; you can easily grab image URLs and even download the images themselves. Keep in mind that images consume significantly more bandwidth and storage than text, so plan your scraping strategy accordingly, especially for large-scale tasks.
Let's modify our script to download the book cover images from the first page:
const { chromium } = require('playwright');
const fs = require('fs'); // Node.js file system module
const path = require('path'); // Node.js path module
const https = require('https'); // Needed for downloading images often served over https
// Helper function to download image
function downloadImage(url, filepath) {
return new Promise((resolve, reject) => {
https.get(url, (res) => {
if (res.statusCode === 200) {
res.pipe(fs.createWriteStream(filepath))
.on('error', reject)
.once('close', () => resolve(filepath));
} else {
// Consume response data to free up memory
res.resume();
reject(new Error(`Request Failed With Status Code: ${res.statusCode} for ${url}`));
}
});
});
}
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
// We need the base URL to construct full image paths
const baseUrl = 'https://books.toscrape.com/';
await page.goto(baseUrl);
// Select image elements and get their 'src' attribute
const imageRelPaths = await page.$$eval('article.product_pod .image_container img', imgs =>
imgs.map(img => img.getAttribute('src'))
);
// Create a directory to save images if it doesn't exist
const imageDir = path.join(__dirname, 'book_images');
if (!fs.existsSync(imageDir)) {
fs.mkdirSync(imageDir);
}
console.log(`Found ${imageRelPaths.length} images. Downloading...`);
for (let i = 0; i < imageRelPaths.length; i++) {
const imgRelPath = imageRelPaths[i];
// Construct the full URL (relative paths are common)
const imgUrl = new URL(imgRelPath, baseUrl).href;
const filename = path.basename(imgUrl); // Extract filename from URL
const filepath = path.join(imageDir, `${i}_${filename}`); // Prepend index to avoid name collisions
try {
await downloadImage(imgUrl, filepath);
console.log(`Downloaded: ${filepath}`);
} catch (error) {
console.error(`Failed to download ${imgUrl}: ${error.message}`);
}
}
await browser.close();
console.log('Image download process complete.');
})();
Key changes:We include Node.js's `fs` (File System), `path`, and `https` modules.We define a helper function `downloadImage` to handle fetching the image data via HTTPS and writing it to a file.We target the `img` tags within the `div.image_container`.We extract the `src` attribute, which often contains a relative path.We construct the absolute URL using the `baseUrl`.We create a directory `book_images` if it doesn't exist.We loop through the image URLs, download each one using our helper function, and save it with a unique filename in the created directory.
And here's the Python equivalent:
import os
import requests # Easier for downloading files in Python
from urllib.parse import urljoin
from playwright.sync_api import sync_playwright
def scrape_book_images():
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
base_url = 'https://books.toscrape.com/'
page.goto(base_url)
# Get relative image paths
image_elements = page.query_selector_all('article.product_pod .image_container img')
rel_image_paths = [img.get_attribute('src') for img in image_elements]
# Create directory for images
image_dir = 'book_images_py'
os.makedirs(image_dir, exist_ok=True)
print(f'Found {len(rel_image_paths)} images. Downloading...')
for i, rel_path in enumerate(rel_image_paths):
# Construct absolute URL
img_url = urljoin(base_url, rel_path)
filename = os.path.basename(img_url)
filepath = os.path.join(image_dir, f'{i}_{filename}')
try:
# Download image using requests
response = requests.get(img_url, stream=True)
response.raise_for_status() # Raise an exception for bad status codes
with open(filepath, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
print(f'Downloaded: {filepath}')
except requests.exceptions.RequestException as e:
print(f'Failed to download {img_url}: {e}')
browser.close()
print('Image download process complete.')
scrape_book_images()
The Python version uses the popular `requests` library for cleaner file downloads and `os` and `urllib.parse` for path manipulation.
Integrating Proxies with Playwright
When performing web scraping at any significant scale, proxies become essential. Websites employ anti-bot measures, and making too many requests from a single IP address is a surefire way to get blocked. No matter how cleverly you mimic human behavior, sooner or later, you'll likely encounter IP bans or CAPTCHAs.
Proxies act as intermediaries, masking your real IP address and allowing you to route requests through different ones. This bypasses IP-based blocking. Thankfully, Playwright has built-in support for proxies, making setup quite straightforward.
You just need to add proxy details to the `launch` options:
// JavaScript Example with Evomi Residential Proxy Placeholder
const browser = await playwright.chromium.launch({
proxy: {
server: 'http://rp.evomi.com:1000', // Example: Evomi Residential HTTP endpoint
username: 'YOUR_EVOMI_USERNAME',
password: 'YOUR_EVOMI_PASSWORD'
}
});
Simply provide the server address (hostname and port) and your credentials. This example uses placeholder details for an Evomi residential proxy – you'd replace `rp.evomi.com:1000`, `YOUR_EVOMI_USERNAME`, and `YOUR_EVOMI_PASSWORD` with your actual service details. Evomi offers various proxy types like Residential, Mobile, Datacenter, and Static ISP, each with different endpoints and ideal use cases.
Before using a proxy in your script, it's wise to verify it's working correctly. You can use a tool like the Evomi Free Proxy Tester to check its status and location.
Here’s the Python equivalent:
# Python Example with Evomi Datacenter Proxy Placeholder
from playwright.sync_api import sync_playwright
def launch_with_proxy():
with sync_playwright() as p:
browser = p.chromium.launch(proxy={
"server": "http://dc.evomi.com:2000", # Example: Evomi Datacenter HTTP endpoint
"username": "YOUR_EVOMI_USERNAME",
"password": "YOUR_EVOMI_PASSWORD"
})
page = browser.new_page()
# Test navigation through proxy
try:
# Use a site that shows your IP
page.goto('https://geo.evomi.com/', timeout=60000)
print("Page loaded via proxy. Check IP details.")
print(page.content()) # Print page content to see IP info
except Exception as e:
print(f"Failed to load page via proxy: {e}")
finally:
browser.close()
launch_with_proxy()
This Python example uses placeholder details for an Evomi datacenter proxy. We also added a visit to Evomi's IP Geolocation Checker to help confirm the proxy is active. Remember to replace the placeholder credentials and endpoint (`dc.evomi.com:2000`) with your specific Evomi plan details.
Choosing a reliable proxy provider like Evomi, known for ethically sourced proxies and Swiss quality standards, ensures better success rates and stability for your Playwright scraping projects. Many providers, including Evomi, offer rotating proxies that automatically change the IP address for you, simplifying management for large-scale scraping.

Author
Sarah Whitmore
Digital Privacy & Cybersecurity Consultant
About Author
Sarah is a cybersecurity strategist with a passion for online privacy and digital security. She explores how proxies, VPNs, and encryption tools protect users from tracking, cyber threats, and data breaches. With years of experience in cybersecurity consulting, she provides practical insights into safeguarding sensitive data in an increasingly digital world.