Essential Playwright Web Scraping & Proxy Strategies (2025)

Sarah Whitmore

Last edited on May 4, 2025
Last edited on May 4, 2025

Scraping Techniques

Getting Started with Playwright for Web Scraping

Playwright is gaining serious traction in the web scraping world, and for good reason. This framework packs a suite of handy features that simplify extracting data from the web. It stands out by offering APIs for multiple popular programming languages like Python, JavaScript (via Node.js), Java, and .NET.

But multi-language support isn't the only perk. Playwright shines with advanced capabilities that streamline web scraping tasks, allowing deep customization of browser behavior to match the quirks of different websites.

So, What Exactly is Playwright?

Playwright is a relatively young open-source project, actively developed since 2020 and backed by Microsoft. Its core mission is to provide a unified API for automating browsers built on Chromium (like Chrome & Edge), WebKit (like Safari), and Firefox.

Its cross-platform, cross-language nature quickly won over developers. This versatility has made it a go-to tool for a variety of automation needs.

The two most prominent uses for Playwright are automated website testing and web scraping. Both benefit immensely from the framework's powerful features, enabling developers to automate complex browser interactions efficiently.

Using Playwright for web scraping is especially effective due to its knack for controlling multiple browser types while offering fine-grained control over each session. Moreover, Playwright is designed with performance in mind, making it a solid choice for demanding scraping jobs.

Is Playwright the Right Tool for Your Scraping Needs?

Generally speaking, Playwright offers a remarkably potent, flexible, and versatile solution for web scraping. Many automation libraries struggle with things like handling asynchronous operations smoothly or intelligently waiting for specific page elements to appear – areas where Playwright excels.

These built-in features are a huge boon for scraping developers, often making Playwright a more compelling option than many alternative browser automation tools.

However, it's not without its considerations. There's a bit of a learning curve; mastering its more advanced features requires some effort.

Also, the Playwright library itself is a bit on the larger side because it includes drivers for multiple browsers in both headed (visible) and headless (invisible) modes. While storage is cheap, this might be a factor in constrained environments.

Finally, being newer than giants like Selenium means that while its documentation is excellent, the community knowledge base isn't quite as vast yet, so finding pre-made solutions for niche problems might take a bit more digging.

Diving Into Web Scraping with Playwright

Playwright lets you write your automation scripts in several languages, but for this guide, we'll focus on JavaScript (Node.js) and Python examples.

The core concepts and Playwright methods are very similar across languages, so you should be able to adapt these examples even if you prefer Java or C#.

Setting Up Your Development Environment

Node.js

First things first, you'll need Node.js itself and a code editor. You can grab the Node.js installer from the official website. For an editor, popular choices include VS Code or IntelliJ IDEA.

Once Node.js is installed, install the Playwright library and its necessary browser binaries. Open your project directory in your terminal and run these commands:

npm

Python

For Python enthusiasts, you'll need Python installed, plus an editor. Download Python from its official site. Good free editor options include VS Code or PyCharm Community Edition.

With Python ready, open your project in your editor's terminal and install Playwright:

Finding the Right Elements

Playwright offers several strategies for pinpointing the data you need on a page. Each has its pros and cons, and experience will guide your choice. Here are the common ones:

  • CSS Selectors: A very popular method. You target elements using their CSS class, ID, attributes, or structure.

  • XPath: Another powerful option. XPath lets you navigate the HTML document's tree structure (the DOM) to select elements based on their position and relationships.

  • Text Content: Playwright allows you to locate elements based on the visible text they contain. This is great for clicking specific buttons or links.

Let's look at a few examples using a practice website like books.toscrape.com.

Example #1: CSS Selectors (Finding Book Titles)

// JavaScript Example
const bookTitles = await page.$$eval(
  'article.product_pod h3 a',
  links => links.map(link => link.getAttribute('title'))
);
console.log('Book Titles:', bookTitles);
# Python Example
title_elements = page.query_selector_all('article.product_pod h3 a')
titles = [element.get_attribute('title') for element in title_elements]
print('Book Titles:', titles)

Here, Playwright finds all <a> tags that are descendants of an <h3> within an <article> having the class `product_pod`. It then extracts the `title` attribute from each found link and stores them.

Example #2: XPath (Finding Book Prices)

// JavaScript Example
const bookPrices = await page.$$eval(
  '//article[@class="product_pod"]//p[@class="price_color"]',
  priceElements => priceElements.map(el => el.textContent)
);
console.log('Book Prices:', bookPrices);
# Python Example
price_elements = page.query_selector_all('//article[@class="product_pod"]//p[@class="price_color"]')
prices = [element.text_content() for element in price_elements]
print('Book Prices:', prices)

This time, we use an XPath expression to locate <p> elements with the class `price_color` anywhere inside an <article> with the class `product_pod`. The text content (the price) is then extracted.

Example #3: Text-based Locations (Finding a Category Link)

// JavaScript Example
const travelLink = page.locator('a:has-text("Travel")');
await travelLink.click(); // Example action: clicking the link
console.log('Clicked on the "Travel" category link.');
# Python Example
travel_link = page.locator('a:has-text("Travel")')
await travel_link.click()  # Example action: clicking the link
print('Clicked on the "Travel" category link.')

Locating by text is often quite intuitive. We simply tell Playwright to find an anchor tag (<a>) that contains the exact text "Travel". We could then interact with it, like clicking it as shown.

Scraping Text Content

Let's put this together. We'll use the tried-and-true method of CSS selectors to scrape text. First, we need to initialize a Playwright browser instance and navigate to our target page (books.toscrape.com).

Here's the basic Node.js setup:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch(); // Launches headless by default
  const page = await browser.newPage();
  await page.goto('https://books.toscrape.com/');
  console.log('Page loaded:', await page.title());
  // Scraping logic will go here
  await browser.close();
})();

This script launches a Chromium browser, opens a new page, and navigates to the specified URL. Since it's headless, you won't see a browser window pop up.

Now, let's add the CSS selector logic from Example #1 to find and print the book titles:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('https://books.toscrape.com/');

  // Select all book title links and extract the 'title' attribute
  const bookTitles = await page.$$eval(
    'article.product_pod h3 a',
    links => links.map(link => link.getAttribute('title'))
  );

  console.log('Found Titles:', bookTitles);

  await browser.close();
})();

We use the $$eval method with our CSS selector (`article.product_pod h3 a`). The second argument is a function that runs in the browser context, collecting the `title` attribute from each selected link.

Finding the right selector often involves using your browser's Developer Tools. Right-click the element you want, select "Inspect" or "Inspect Element", and examine the HTML structure. You can often right-click the HTML in the DevTools and find options like "Copy > Copy selector" or "Copy > Copy XPath".

Sometimes the automatically copied selector is overly specific. It pays to examine the HTML yourself to find a simpler, more robust selector, like we did by identifying the `article.product_pod h3 a` pattern.

Here’s the equivalent code for Python:

from playwright.sync_api import sync_playwright

def scrape_book_titles():
    with sync_playwright() as p:
        browser = p.chromium.launch() # Launches headless by default
        page = browser.new_page()
        page.goto('https://books.toscrape.com/')
        # Use CSS selector to get title attributes
        title_elements = page.query_selector_all('article.product_pod h3 a')
        titles = [element.get_attribute('title') for element in title_elements]
        print('Found Titles:', titles)
        browser.close()

scrape_book_titles()

Scraping Images

Playwright isn't limited to text; you can easily grab image URLs and even download the images themselves. Keep in mind that images consume significantly more bandwidth and storage than text, so plan your scraping strategy accordingly, especially for large-scale tasks.

Let's modify our script to download the book cover images from the first page:

const { chromium } = require('playwright');
const fs = require('fs'); // Node.js file system module
const path = require('path'); // Node.js path module
const https = require('https'); // Needed for downloading images often served over https

// Helper function to download image
function downloadImage(url, filepath) {
    return new Promise((resolve, reject) => {
        https.get(url, (res) => {
            if (res.statusCode === 200) {
                res.pipe(fs.createWriteStream(filepath))
                    .on('error', reject)
                    .once('close', () => resolve(filepath));
            } else {
                // Consume response data to free up memory
                res.resume();
                reject(new Error(`Request Failed With Status Code: ${res.statusCode} for ${url}`));
            }
        });
    });
}

(async () => {
    const browser = await chromium.launch();
    const page = await browser.newPage();

    // We need the base URL to construct full image paths
    const baseUrl = 'https://books.toscrape.com/';
    await page.goto(baseUrl);

    // Select image elements and get their 'src' attribute
    const imageRelPaths = await page.$$eval('article.product_pod .image_container img', imgs =>
        imgs.map(img => img.getAttribute('src'))
    );

    // Create a directory to save images if it doesn't exist
    const imageDir = path.join(__dirname, 'book_images');
    if (!fs.existsSync(imageDir)) {
        fs.mkdirSync(imageDir);
    }

    console.log(`Found ${imageRelPaths.length} images. Downloading...`);

    for (let i = 0; i < imageRelPaths.length; i++) {
        const imgRelPath = imageRelPaths[i];
        // Construct the full URL (relative paths are common)
        const imgUrl = new URL(imgRelPath, baseUrl).href;
        const filename = path.basename(imgUrl); // Extract filename from URL
        const filepath = path.join(imageDir, `${i}_${filename}`); // Prepend index to avoid name collisions

        try {
            await downloadImage(imgUrl, filepath);
            console.log(`Downloaded: ${filepath}`);
        } catch (error) {
            console.error(`Failed to download ${imgUrl}: ${error.message}`);
        }
    }

    await browser.close();
    console.log('Image download process complete.');
})();

Key changes:We include Node.js's `fs` (File System), `path`, and `https` modules.We define a helper function `downloadImage` to handle fetching the image data via HTTPS and writing it to a file.We target the `img` tags within the `div.image_container`.We extract the `src` attribute, which often contains a relative path.We construct the absolute URL using the `baseUrl`.We create a directory `book_images` if it doesn't exist.We loop through the image URLs, download each one using our helper function, and save it with a unique filename in the created directory.

And here's the Python equivalent:

import os
import requests  # Easier for downloading files in Python
from urllib.parse import urljoin
from playwright.sync_api import sync_playwright


def scrape_book_images():
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        base_url = 'https://books.toscrape.com/'
        page.goto(base_url)

        # Get relative image paths
        image_elements = page.query_selector_all('article.product_pod .image_container img')
        rel_image_paths = [img.get_attribute('src') for img in image_elements]

        # Create directory for images
        image_dir = 'book_images_py'
        os.makedirs(image_dir, exist_ok=True)

        print(f'Found {len(rel_image_paths)} images. Downloading...')
        for i, rel_path in enumerate(rel_image_paths):
            # Construct absolute URL
            img_url = urljoin(base_url, rel_path)
            filename = os.path.basename(img_url)
            filepath = os.path.join(image_dir, f'{i}_{filename}')

            try:
                # Download image using requests
                response = requests.get(img_url, stream=True)
                response.raise_for_status()  # Raise an exception for bad status codes
                with open(filepath, 'wb') as f:
                    for chunk in response.iter_content(chunk_size=8192):
                        f.write(chunk)
                print(f'Downloaded: {filepath}')
            except requests.exceptions.RequestException as e:
                print(f'Failed to download {img_url}: {e}')

        browser.close()
        print('Image download process complete.')


scrape_book_images()

The Python version uses the popular `requests` library for cleaner file downloads and `os` and `urllib.parse` for path manipulation.

Integrating Proxies with Playwright

When performing web scraping at any significant scale, proxies become essential. Websites employ anti-bot measures, and making too many requests from a single IP address is a surefire way to get blocked. No matter how cleverly you mimic human behavior, sooner or later, you'll likely encounter IP bans or CAPTCHAs.

Proxies act as intermediaries, masking your real IP address and allowing you to route requests through different ones. This bypasses IP-based blocking. Thankfully, Playwright has built-in support for proxies, making setup quite straightforward.

You just need to add proxy details to the `launch` options:

// JavaScript Example with Evomi Residential Proxy Placeholder
const browser = await playwright.chromium.launch({
  proxy: {
    server: 'http://rp.evomi.com:1000', // Example: Evomi Residential HTTP endpoint
    username: 'YOUR_EVOMI_USERNAME',
    password: 'YOUR_EVOMI_PASSWORD'
  }
});

Simply provide the server address (hostname and port) and your credentials. This example uses placeholder details for an Evomi residential proxy – you'd replace `rp.evomi.com:1000`, `YOUR_EVOMI_USERNAME`, and `YOUR_EVOMI_PASSWORD` with your actual service details. Evomi offers various proxy types like Residential, Mobile, Datacenter, and Static ISP, each with different endpoints and ideal use cases.

Before using a proxy in your script, it's wise to verify it's working correctly. You can use a tool like the Evomi Free Proxy Tester to check its status and location.

Here’s the Python equivalent:

# Python Example with Evomi Datacenter Proxy Placeholder
from playwright.sync_api import sync_playwright


def launch_with_proxy():
    with sync_playwright() as p:
        browser = p.chromium.launch(proxy={
            "server": "http://dc.evomi.com:2000",  # Example: Evomi Datacenter HTTP endpoint
            "username": "YOUR_EVOMI_USERNAME",
            "password": "YOUR_EVOMI_PASSWORD"
        })
        page = browser.new_page()
        # Test navigation through proxy
        try:
            # Use a site that shows your IP
            page.goto('https://geo.evomi.com/', timeout=60000)
            print("Page loaded via proxy. Check IP details.")
            print(page.content())  # Print page content to see IP info
        except Exception as e:
            print(f"Failed to load page via proxy: {e}")
        finally:
            browser.close()


launch_with_proxy()

This Python example uses placeholder details for an Evomi datacenter proxy. We also added a visit to Evomi's IP Geolocation Checker to help confirm the proxy is active. Remember to replace the placeholder credentials and endpoint (`dc.evomi.com:2000`) with your specific Evomi plan details.

Choosing a reliable proxy provider like Evomi, known for ethically sourced proxies and Swiss quality standards, ensures better success rates and stability for your Playwright scraping projects. Many providers, including Evomi, offer rotating proxies that automatically change the IP address for you, simplifying management for large-scale scraping.

Getting Started with Playwright for Web Scraping

Playwright is gaining serious traction in the web scraping world, and for good reason. This framework packs a suite of handy features that simplify extracting data from the web. It stands out by offering APIs for multiple popular programming languages like Python, JavaScript (via Node.js), Java, and .NET.

But multi-language support isn't the only perk. Playwright shines with advanced capabilities that streamline web scraping tasks, allowing deep customization of browser behavior to match the quirks of different websites.

So, What Exactly is Playwright?

Playwright is a relatively young open-source project, actively developed since 2020 and backed by Microsoft. Its core mission is to provide a unified API for automating browsers built on Chromium (like Chrome & Edge), WebKit (like Safari), and Firefox.

Its cross-platform, cross-language nature quickly won over developers. This versatility has made it a go-to tool for a variety of automation needs.

The two most prominent uses for Playwright are automated website testing and web scraping. Both benefit immensely from the framework's powerful features, enabling developers to automate complex browser interactions efficiently.

Using Playwright for web scraping is especially effective due to its knack for controlling multiple browser types while offering fine-grained control over each session. Moreover, Playwright is designed with performance in mind, making it a solid choice for demanding scraping jobs.

Is Playwright the Right Tool for Your Scraping Needs?

Generally speaking, Playwright offers a remarkably potent, flexible, and versatile solution for web scraping. Many automation libraries struggle with things like handling asynchronous operations smoothly or intelligently waiting for specific page elements to appear – areas where Playwright excels.

These built-in features are a huge boon for scraping developers, often making Playwright a more compelling option than many alternative browser automation tools.

However, it's not without its considerations. There's a bit of a learning curve; mastering its more advanced features requires some effort.

Also, the Playwright library itself is a bit on the larger side because it includes drivers for multiple browsers in both headed (visible) and headless (invisible) modes. While storage is cheap, this might be a factor in constrained environments.

Finally, being newer than giants like Selenium means that while its documentation is excellent, the community knowledge base isn't quite as vast yet, so finding pre-made solutions for niche problems might take a bit more digging.

Diving Into Web Scraping with Playwright

Playwright lets you write your automation scripts in several languages, but for this guide, we'll focus on JavaScript (Node.js) and Python examples.

The core concepts and Playwright methods are very similar across languages, so you should be able to adapt these examples even if you prefer Java or C#.

Setting Up Your Development Environment

Node.js

First things first, you'll need Node.js itself and a code editor. You can grab the Node.js installer from the official website. For an editor, popular choices include VS Code or IntelliJ IDEA.

Once Node.js is installed, install the Playwright library and its necessary browser binaries. Open your project directory in your terminal and run these commands:

npm

Python

For Python enthusiasts, you'll need Python installed, plus an editor. Download Python from its official site. Good free editor options include VS Code or PyCharm Community Edition.

With Python ready, open your project in your editor's terminal and install Playwright:

Finding the Right Elements

Playwright offers several strategies for pinpointing the data you need on a page. Each has its pros and cons, and experience will guide your choice. Here are the common ones:

  • CSS Selectors: A very popular method. You target elements using their CSS class, ID, attributes, or structure.

  • XPath: Another powerful option. XPath lets you navigate the HTML document's tree structure (the DOM) to select elements based on their position and relationships.

  • Text Content: Playwright allows you to locate elements based on the visible text they contain. This is great for clicking specific buttons or links.

Let's look at a few examples using a practice website like books.toscrape.com.

Example #1: CSS Selectors (Finding Book Titles)

// JavaScript Example
const bookTitles = await page.$$eval(
  'article.product_pod h3 a',
  links => links.map(link => link.getAttribute('title'))
);
console.log('Book Titles:', bookTitles);
# Python Example
title_elements = page.query_selector_all('article.product_pod h3 a')
titles = [element.get_attribute('title') for element in title_elements]
print('Book Titles:', titles)

Here, Playwright finds all <a> tags that are descendants of an <h3> within an <article> having the class `product_pod`. It then extracts the `title` attribute from each found link and stores them.

Example #2: XPath (Finding Book Prices)

// JavaScript Example
const bookPrices = await page.$$eval(
  '//article[@class="product_pod"]//p[@class="price_color"]',
  priceElements => priceElements.map(el => el.textContent)
);
console.log('Book Prices:', bookPrices);
# Python Example
price_elements = page.query_selector_all('//article[@class="product_pod"]//p[@class="price_color"]')
prices = [element.text_content() for element in price_elements]
print('Book Prices:', prices)

This time, we use an XPath expression to locate <p> elements with the class `price_color` anywhere inside an <article> with the class `product_pod`. The text content (the price) is then extracted.

Example #3: Text-based Locations (Finding a Category Link)

// JavaScript Example
const travelLink = page.locator('a:has-text("Travel")');
await travelLink.click(); // Example action: clicking the link
console.log('Clicked on the "Travel" category link.');
# Python Example
travel_link = page.locator('a:has-text("Travel")')
await travel_link.click()  # Example action: clicking the link
print('Clicked on the "Travel" category link.')

Locating by text is often quite intuitive. We simply tell Playwright to find an anchor tag (<a>) that contains the exact text "Travel". We could then interact with it, like clicking it as shown.

Scraping Text Content

Let's put this together. We'll use the tried-and-true method of CSS selectors to scrape text. First, we need to initialize a Playwright browser instance and navigate to our target page (books.toscrape.com).

Here's the basic Node.js setup:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch(); // Launches headless by default
  const page = await browser.newPage();
  await page.goto('https://books.toscrape.com/');
  console.log('Page loaded:', await page.title());
  // Scraping logic will go here
  await browser.close();
})();

This script launches a Chromium browser, opens a new page, and navigates to the specified URL. Since it's headless, you won't see a browser window pop up.

Now, let's add the CSS selector logic from Example #1 to find and print the book titles:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('https://books.toscrape.com/');

  // Select all book title links and extract the 'title' attribute
  const bookTitles = await page.$$eval(
    'article.product_pod h3 a',
    links => links.map(link => link.getAttribute('title'))
  );

  console.log('Found Titles:', bookTitles);

  await browser.close();
})();

We use the $$eval method with our CSS selector (`article.product_pod h3 a`). The second argument is a function that runs in the browser context, collecting the `title` attribute from each selected link.

Finding the right selector often involves using your browser's Developer Tools. Right-click the element you want, select "Inspect" or "Inspect Element", and examine the HTML structure. You can often right-click the HTML in the DevTools and find options like "Copy > Copy selector" or "Copy > Copy XPath".

Sometimes the automatically copied selector is overly specific. It pays to examine the HTML yourself to find a simpler, more robust selector, like we did by identifying the `article.product_pod h3 a` pattern.

Here’s the equivalent code for Python:

from playwright.sync_api import sync_playwright

def scrape_book_titles():
    with sync_playwright() as p:
        browser = p.chromium.launch() # Launches headless by default
        page = browser.new_page()
        page.goto('https://books.toscrape.com/')
        # Use CSS selector to get title attributes
        title_elements = page.query_selector_all('article.product_pod h3 a')
        titles = [element.get_attribute('title') for element in title_elements]
        print('Found Titles:', titles)
        browser.close()

scrape_book_titles()

Scraping Images

Playwright isn't limited to text; you can easily grab image URLs and even download the images themselves. Keep in mind that images consume significantly more bandwidth and storage than text, so plan your scraping strategy accordingly, especially for large-scale tasks.

Let's modify our script to download the book cover images from the first page:

const { chromium } = require('playwright');
const fs = require('fs'); // Node.js file system module
const path = require('path'); // Node.js path module
const https = require('https'); // Needed for downloading images often served over https

// Helper function to download image
function downloadImage(url, filepath) {
    return new Promise((resolve, reject) => {
        https.get(url, (res) => {
            if (res.statusCode === 200) {
                res.pipe(fs.createWriteStream(filepath))
                    .on('error', reject)
                    .once('close', () => resolve(filepath));
            } else {
                // Consume response data to free up memory
                res.resume();
                reject(new Error(`Request Failed With Status Code: ${res.statusCode} for ${url}`));
            }
        });
    });
}

(async () => {
    const browser = await chromium.launch();
    const page = await browser.newPage();

    // We need the base URL to construct full image paths
    const baseUrl = 'https://books.toscrape.com/';
    await page.goto(baseUrl);

    // Select image elements and get their 'src' attribute
    const imageRelPaths = await page.$$eval('article.product_pod .image_container img', imgs =>
        imgs.map(img => img.getAttribute('src'))
    );

    // Create a directory to save images if it doesn't exist
    const imageDir = path.join(__dirname, 'book_images');
    if (!fs.existsSync(imageDir)) {
        fs.mkdirSync(imageDir);
    }

    console.log(`Found ${imageRelPaths.length} images. Downloading...`);

    for (let i = 0; i < imageRelPaths.length; i++) {
        const imgRelPath = imageRelPaths[i];
        // Construct the full URL (relative paths are common)
        const imgUrl = new URL(imgRelPath, baseUrl).href;
        const filename = path.basename(imgUrl); // Extract filename from URL
        const filepath = path.join(imageDir, `${i}_${filename}`); // Prepend index to avoid name collisions

        try {
            await downloadImage(imgUrl, filepath);
            console.log(`Downloaded: ${filepath}`);
        } catch (error) {
            console.error(`Failed to download ${imgUrl}: ${error.message}`);
        }
    }

    await browser.close();
    console.log('Image download process complete.');
})();

Key changes:We include Node.js's `fs` (File System), `path`, and `https` modules.We define a helper function `downloadImage` to handle fetching the image data via HTTPS and writing it to a file.We target the `img` tags within the `div.image_container`.We extract the `src` attribute, which often contains a relative path.We construct the absolute URL using the `baseUrl`.We create a directory `book_images` if it doesn't exist.We loop through the image URLs, download each one using our helper function, and save it with a unique filename in the created directory.

And here's the Python equivalent:

import os
import requests  # Easier for downloading files in Python
from urllib.parse import urljoin
from playwright.sync_api import sync_playwright


def scrape_book_images():
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        base_url = 'https://books.toscrape.com/'
        page.goto(base_url)

        # Get relative image paths
        image_elements = page.query_selector_all('article.product_pod .image_container img')
        rel_image_paths = [img.get_attribute('src') for img in image_elements]

        # Create directory for images
        image_dir = 'book_images_py'
        os.makedirs(image_dir, exist_ok=True)

        print(f'Found {len(rel_image_paths)} images. Downloading...')
        for i, rel_path in enumerate(rel_image_paths):
            # Construct absolute URL
            img_url = urljoin(base_url, rel_path)
            filename = os.path.basename(img_url)
            filepath = os.path.join(image_dir, f'{i}_{filename}')

            try:
                # Download image using requests
                response = requests.get(img_url, stream=True)
                response.raise_for_status()  # Raise an exception for bad status codes
                with open(filepath, 'wb') as f:
                    for chunk in response.iter_content(chunk_size=8192):
                        f.write(chunk)
                print(f'Downloaded: {filepath}')
            except requests.exceptions.RequestException as e:
                print(f'Failed to download {img_url}: {e}')

        browser.close()
        print('Image download process complete.')


scrape_book_images()

The Python version uses the popular `requests` library for cleaner file downloads and `os` and `urllib.parse` for path manipulation.

Integrating Proxies with Playwright

When performing web scraping at any significant scale, proxies become essential. Websites employ anti-bot measures, and making too many requests from a single IP address is a surefire way to get blocked. No matter how cleverly you mimic human behavior, sooner or later, you'll likely encounter IP bans or CAPTCHAs.

Proxies act as intermediaries, masking your real IP address and allowing you to route requests through different ones. This bypasses IP-based blocking. Thankfully, Playwright has built-in support for proxies, making setup quite straightforward.

You just need to add proxy details to the `launch` options:

// JavaScript Example with Evomi Residential Proxy Placeholder
const browser = await playwright.chromium.launch({
  proxy: {
    server: 'http://rp.evomi.com:1000', // Example: Evomi Residential HTTP endpoint
    username: 'YOUR_EVOMI_USERNAME',
    password: 'YOUR_EVOMI_PASSWORD'
  }
});

Simply provide the server address (hostname and port) and your credentials. This example uses placeholder details for an Evomi residential proxy – you'd replace `rp.evomi.com:1000`, `YOUR_EVOMI_USERNAME`, and `YOUR_EVOMI_PASSWORD` with your actual service details. Evomi offers various proxy types like Residential, Mobile, Datacenter, and Static ISP, each with different endpoints and ideal use cases.

Before using a proxy in your script, it's wise to verify it's working correctly. You can use a tool like the Evomi Free Proxy Tester to check its status and location.

Here’s the Python equivalent:

# Python Example with Evomi Datacenter Proxy Placeholder
from playwright.sync_api import sync_playwright


def launch_with_proxy():
    with sync_playwright() as p:
        browser = p.chromium.launch(proxy={
            "server": "http://dc.evomi.com:2000",  # Example: Evomi Datacenter HTTP endpoint
            "username": "YOUR_EVOMI_USERNAME",
            "password": "YOUR_EVOMI_PASSWORD"
        })
        page = browser.new_page()
        # Test navigation through proxy
        try:
            # Use a site that shows your IP
            page.goto('https://geo.evomi.com/', timeout=60000)
            print("Page loaded via proxy. Check IP details.")
            print(page.content())  # Print page content to see IP info
        except Exception as e:
            print(f"Failed to load page via proxy: {e}")
        finally:
            browser.close()


launch_with_proxy()

This Python example uses placeholder details for an Evomi datacenter proxy. We also added a visit to Evomi's IP Geolocation Checker to help confirm the proxy is active. Remember to replace the placeholder credentials and endpoint (`dc.evomi.com:2000`) with your specific Evomi plan details.

Choosing a reliable proxy provider like Evomi, known for ethically sourced proxies and Swiss quality standards, ensures better success rates and stability for your Playwright scraping projects. Many providers, including Evomi, offer rotating proxies that automatically change the IP address for you, simplifying management for large-scale scraping.

Getting Started with Playwright for Web Scraping

Playwright is gaining serious traction in the web scraping world, and for good reason. This framework packs a suite of handy features that simplify extracting data from the web. It stands out by offering APIs for multiple popular programming languages like Python, JavaScript (via Node.js), Java, and .NET.

But multi-language support isn't the only perk. Playwright shines with advanced capabilities that streamline web scraping tasks, allowing deep customization of browser behavior to match the quirks of different websites.

So, What Exactly is Playwright?

Playwright is a relatively young open-source project, actively developed since 2020 and backed by Microsoft. Its core mission is to provide a unified API for automating browsers built on Chromium (like Chrome & Edge), WebKit (like Safari), and Firefox.

Its cross-platform, cross-language nature quickly won over developers. This versatility has made it a go-to tool for a variety of automation needs.

The two most prominent uses for Playwright are automated website testing and web scraping. Both benefit immensely from the framework's powerful features, enabling developers to automate complex browser interactions efficiently.

Using Playwright for web scraping is especially effective due to its knack for controlling multiple browser types while offering fine-grained control over each session. Moreover, Playwright is designed with performance in mind, making it a solid choice for demanding scraping jobs.

Is Playwright the Right Tool for Your Scraping Needs?

Generally speaking, Playwright offers a remarkably potent, flexible, and versatile solution for web scraping. Many automation libraries struggle with things like handling asynchronous operations smoothly or intelligently waiting for specific page elements to appear – areas where Playwright excels.

These built-in features are a huge boon for scraping developers, often making Playwright a more compelling option than many alternative browser automation tools.

However, it's not without its considerations. There's a bit of a learning curve; mastering its more advanced features requires some effort.

Also, the Playwright library itself is a bit on the larger side because it includes drivers for multiple browsers in both headed (visible) and headless (invisible) modes. While storage is cheap, this might be a factor in constrained environments.

Finally, being newer than giants like Selenium means that while its documentation is excellent, the community knowledge base isn't quite as vast yet, so finding pre-made solutions for niche problems might take a bit more digging.

Diving Into Web Scraping with Playwright

Playwright lets you write your automation scripts in several languages, but for this guide, we'll focus on JavaScript (Node.js) and Python examples.

The core concepts and Playwright methods are very similar across languages, so you should be able to adapt these examples even if you prefer Java or C#.

Setting Up Your Development Environment

Node.js

First things first, you'll need Node.js itself and a code editor. You can grab the Node.js installer from the official website. For an editor, popular choices include VS Code or IntelliJ IDEA.

Once Node.js is installed, install the Playwright library and its necessary browser binaries. Open your project directory in your terminal and run these commands:

npm

Python

For Python enthusiasts, you'll need Python installed, plus an editor. Download Python from its official site. Good free editor options include VS Code or PyCharm Community Edition.

With Python ready, open your project in your editor's terminal and install Playwright:

Finding the Right Elements

Playwright offers several strategies for pinpointing the data you need on a page. Each has its pros and cons, and experience will guide your choice. Here are the common ones:

  • CSS Selectors: A very popular method. You target elements using their CSS class, ID, attributes, or structure.

  • XPath: Another powerful option. XPath lets you navigate the HTML document's tree structure (the DOM) to select elements based on their position and relationships.

  • Text Content: Playwright allows you to locate elements based on the visible text they contain. This is great for clicking specific buttons or links.

Let's look at a few examples using a practice website like books.toscrape.com.

Example #1: CSS Selectors (Finding Book Titles)

// JavaScript Example
const bookTitles = await page.$$eval(
  'article.product_pod h3 a',
  links => links.map(link => link.getAttribute('title'))
);
console.log('Book Titles:', bookTitles);
# Python Example
title_elements = page.query_selector_all('article.product_pod h3 a')
titles = [element.get_attribute('title') for element in title_elements]
print('Book Titles:', titles)

Here, Playwright finds all <a> tags that are descendants of an <h3> within an <article> having the class `product_pod`. It then extracts the `title` attribute from each found link and stores them.

Example #2: XPath (Finding Book Prices)

// JavaScript Example
const bookPrices = await page.$$eval(
  '//article[@class="product_pod"]//p[@class="price_color"]',
  priceElements => priceElements.map(el => el.textContent)
);
console.log('Book Prices:', bookPrices);
# Python Example
price_elements = page.query_selector_all('//article[@class="product_pod"]//p[@class="price_color"]')
prices = [element.text_content() for element in price_elements]
print('Book Prices:', prices)

This time, we use an XPath expression to locate <p> elements with the class `price_color` anywhere inside an <article> with the class `product_pod`. The text content (the price) is then extracted.

Example #3: Text-based Locations (Finding a Category Link)

// JavaScript Example
const travelLink = page.locator('a:has-text("Travel")');
await travelLink.click(); // Example action: clicking the link
console.log('Clicked on the "Travel" category link.');
# Python Example
travel_link = page.locator('a:has-text("Travel")')
await travel_link.click()  # Example action: clicking the link
print('Clicked on the "Travel" category link.')

Locating by text is often quite intuitive. We simply tell Playwright to find an anchor tag (<a>) that contains the exact text "Travel". We could then interact with it, like clicking it as shown.

Scraping Text Content

Let's put this together. We'll use the tried-and-true method of CSS selectors to scrape text. First, we need to initialize a Playwright browser instance and navigate to our target page (books.toscrape.com).

Here's the basic Node.js setup:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch(); // Launches headless by default
  const page = await browser.newPage();
  await page.goto('https://books.toscrape.com/');
  console.log('Page loaded:', await page.title());
  // Scraping logic will go here
  await browser.close();
})();

This script launches a Chromium browser, opens a new page, and navigates to the specified URL. Since it's headless, you won't see a browser window pop up.

Now, let's add the CSS selector logic from Example #1 to find and print the book titles:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto('https://books.toscrape.com/');

  // Select all book title links and extract the 'title' attribute
  const bookTitles = await page.$$eval(
    'article.product_pod h3 a',
    links => links.map(link => link.getAttribute('title'))
  );

  console.log('Found Titles:', bookTitles);

  await browser.close();
})();

We use the $$eval method with our CSS selector (`article.product_pod h3 a`). The second argument is a function that runs in the browser context, collecting the `title` attribute from each selected link.

Finding the right selector often involves using your browser's Developer Tools. Right-click the element you want, select "Inspect" or "Inspect Element", and examine the HTML structure. You can often right-click the HTML in the DevTools and find options like "Copy > Copy selector" or "Copy > Copy XPath".

Sometimes the automatically copied selector is overly specific. It pays to examine the HTML yourself to find a simpler, more robust selector, like we did by identifying the `article.product_pod h3 a` pattern.

Here’s the equivalent code for Python:

from playwright.sync_api import sync_playwright

def scrape_book_titles():
    with sync_playwright() as p:
        browser = p.chromium.launch() # Launches headless by default
        page = browser.new_page()
        page.goto('https://books.toscrape.com/')
        # Use CSS selector to get title attributes
        title_elements = page.query_selector_all('article.product_pod h3 a')
        titles = [element.get_attribute('title') for element in title_elements]
        print('Found Titles:', titles)
        browser.close()

scrape_book_titles()

Scraping Images

Playwright isn't limited to text; you can easily grab image URLs and even download the images themselves. Keep in mind that images consume significantly more bandwidth and storage than text, so plan your scraping strategy accordingly, especially for large-scale tasks.

Let's modify our script to download the book cover images from the first page:

const { chromium } = require('playwright');
const fs = require('fs'); // Node.js file system module
const path = require('path'); // Node.js path module
const https = require('https'); // Needed for downloading images often served over https

// Helper function to download image
function downloadImage(url, filepath) {
    return new Promise((resolve, reject) => {
        https.get(url, (res) => {
            if (res.statusCode === 200) {
                res.pipe(fs.createWriteStream(filepath))
                    .on('error', reject)
                    .once('close', () => resolve(filepath));
            } else {
                // Consume response data to free up memory
                res.resume();
                reject(new Error(`Request Failed With Status Code: ${res.statusCode} for ${url}`));
            }
        });
    });
}

(async () => {
    const browser = await chromium.launch();
    const page = await browser.newPage();

    // We need the base URL to construct full image paths
    const baseUrl = 'https://books.toscrape.com/';
    await page.goto(baseUrl);

    // Select image elements and get their 'src' attribute
    const imageRelPaths = await page.$$eval('article.product_pod .image_container img', imgs =>
        imgs.map(img => img.getAttribute('src'))
    );

    // Create a directory to save images if it doesn't exist
    const imageDir = path.join(__dirname, 'book_images');
    if (!fs.existsSync(imageDir)) {
        fs.mkdirSync(imageDir);
    }

    console.log(`Found ${imageRelPaths.length} images. Downloading...`);

    for (let i = 0; i < imageRelPaths.length; i++) {
        const imgRelPath = imageRelPaths[i];
        // Construct the full URL (relative paths are common)
        const imgUrl = new URL(imgRelPath, baseUrl).href;
        const filename = path.basename(imgUrl); // Extract filename from URL
        const filepath = path.join(imageDir, `${i}_${filename}`); // Prepend index to avoid name collisions

        try {
            await downloadImage(imgUrl, filepath);
            console.log(`Downloaded: ${filepath}`);
        } catch (error) {
            console.error(`Failed to download ${imgUrl}: ${error.message}`);
        }
    }

    await browser.close();
    console.log('Image download process complete.');
})();

Key changes:We include Node.js's `fs` (File System), `path`, and `https` modules.We define a helper function `downloadImage` to handle fetching the image data via HTTPS and writing it to a file.We target the `img` tags within the `div.image_container`.We extract the `src` attribute, which often contains a relative path.We construct the absolute URL using the `baseUrl`.We create a directory `book_images` if it doesn't exist.We loop through the image URLs, download each one using our helper function, and save it with a unique filename in the created directory.

And here's the Python equivalent:

import os
import requests  # Easier for downloading files in Python
from urllib.parse import urljoin
from playwright.sync_api import sync_playwright


def scrape_book_images():
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        base_url = 'https://books.toscrape.com/'
        page.goto(base_url)

        # Get relative image paths
        image_elements = page.query_selector_all('article.product_pod .image_container img')
        rel_image_paths = [img.get_attribute('src') for img in image_elements]

        # Create directory for images
        image_dir = 'book_images_py'
        os.makedirs(image_dir, exist_ok=True)

        print(f'Found {len(rel_image_paths)} images. Downloading...')
        for i, rel_path in enumerate(rel_image_paths):
            # Construct absolute URL
            img_url = urljoin(base_url, rel_path)
            filename = os.path.basename(img_url)
            filepath = os.path.join(image_dir, f'{i}_{filename}')

            try:
                # Download image using requests
                response = requests.get(img_url, stream=True)
                response.raise_for_status()  # Raise an exception for bad status codes
                with open(filepath, 'wb') as f:
                    for chunk in response.iter_content(chunk_size=8192):
                        f.write(chunk)
                print(f'Downloaded: {filepath}')
            except requests.exceptions.RequestException as e:
                print(f'Failed to download {img_url}: {e}')

        browser.close()
        print('Image download process complete.')


scrape_book_images()

The Python version uses the popular `requests` library for cleaner file downloads and `os` and `urllib.parse` for path manipulation.

Integrating Proxies with Playwright

When performing web scraping at any significant scale, proxies become essential. Websites employ anti-bot measures, and making too many requests from a single IP address is a surefire way to get blocked. No matter how cleverly you mimic human behavior, sooner or later, you'll likely encounter IP bans or CAPTCHAs.

Proxies act as intermediaries, masking your real IP address and allowing you to route requests through different ones. This bypasses IP-based blocking. Thankfully, Playwright has built-in support for proxies, making setup quite straightforward.

You just need to add proxy details to the `launch` options:

// JavaScript Example with Evomi Residential Proxy Placeholder
const browser = await playwright.chromium.launch({
  proxy: {
    server: 'http://rp.evomi.com:1000', // Example: Evomi Residential HTTP endpoint
    username: 'YOUR_EVOMI_USERNAME',
    password: 'YOUR_EVOMI_PASSWORD'
  }
});

Simply provide the server address (hostname and port) and your credentials. This example uses placeholder details for an Evomi residential proxy – you'd replace `rp.evomi.com:1000`, `YOUR_EVOMI_USERNAME`, and `YOUR_EVOMI_PASSWORD` with your actual service details. Evomi offers various proxy types like Residential, Mobile, Datacenter, and Static ISP, each with different endpoints and ideal use cases.

Before using a proxy in your script, it's wise to verify it's working correctly. You can use a tool like the Evomi Free Proxy Tester to check its status and location.

Here’s the Python equivalent:

# Python Example with Evomi Datacenter Proxy Placeholder
from playwright.sync_api import sync_playwright


def launch_with_proxy():
    with sync_playwright() as p:
        browser = p.chromium.launch(proxy={
            "server": "http://dc.evomi.com:2000",  # Example: Evomi Datacenter HTTP endpoint
            "username": "YOUR_EVOMI_USERNAME",
            "password": "YOUR_EVOMI_PASSWORD"
        })
        page = browser.new_page()
        # Test navigation through proxy
        try:
            # Use a site that shows your IP
            page.goto('https://geo.evomi.com/', timeout=60000)
            print("Page loaded via proxy. Check IP details.")
            print(page.content())  # Print page content to see IP info
        except Exception as e:
            print(f"Failed to load page via proxy: {e}")
        finally:
            browser.close()


launch_with_proxy()

This Python example uses placeholder details for an Evomi datacenter proxy. We also added a visit to Evomi's IP Geolocation Checker to help confirm the proxy is active. Remember to replace the placeholder credentials and endpoint (`dc.evomi.com:2000`) with your specific Evomi plan details.

Choosing a reliable proxy provider like Evomi, known for ethically sourced proxies and Swiss quality standards, ensures better success rates and stability for your Playwright scraping projects. Many providers, including Evomi, offer rotating proxies that automatically change the IP address for you, simplifying management for large-scale scraping.

Author

Sarah Whitmore

Digital Privacy & Cybersecurity Consultant

About Author

Sarah is a cybersecurity strategist with a passion for online privacy and digital security. She explores how proxies, VPNs, and encryption tools protect users from tracking, cyber threats, and data breaches. With years of experience in cybersecurity consulting, she provides practical insights into safeguarding sensitive data in an increasingly digital world.

Like this article? Share it.
You asked, we answer - Users questions:
How resource-intensive is Playwright compared to simpler libraries like Python's Requests for web scraping?+
Can Playwright handle dynamic content like infinite scroll or data loaded by JavaScript after the initial page load?+
How can I implement IP rotation within a single Playwright script if my proxy provider gives me a list of static IPs instead of a single rotating endpoint?+
When is it actually better to run Playwright in headed mode instead of the default headless mode for web scraping?+
What are some robust strategies for handling common errors like element timeouts or navigation failures in Playwright scraping scripts?+

In This Article

Read More Blogs