Stay Unblocked: Web Scraping with JavaScript & Node.js

Nathan Reynolds

Last edited on May 4, 2025
Last edited on May 4, 2025

Scraping Techniques

Tackling Web Scraping with JavaScript and Node.js

Using JavaScript and Node.js for web scraping? That's a smart move. This combination lets you gather data even from websites that load content dynamically, a common hurdle for simpler scraping methods.

Web scraping itself offers a ton of advantages. Think about monitoring competitor pricing, tracking supplier stock levels, automating repetitive online tasks, or even gauging brand sentiment across the web. It's a powerful technique for gathering large datasets automatically and transforming them into actionable insights for your business or project.

When you bring JavaScript into the mix via Node.js, you gain the ability to mimic user interactions almost perfectly, all while benefiting from the speed and convenience of automation, a rich ecosystem of libraries, and reusable code.

However, navigating the world of web scraping isn't without its challenges.

The biggest one? Getting blocked. While scraping publicly available data is generally legal, most websites actively try to detect and block automated scrapers. Staying undetected is key.

Secondly, you need the right tools for the job – tools that are both efficient and dependable. With numerous libraries available, choosing the best fit, especially when starting out with JavaScript and Node.js scraping, can feel a bit overwhelming.

So, this post aims to be your practical guide to web scraping using JavaScript and Node.js. We'll cover the essentials, show you how to keep your scrapers running smoothly without getting blocked, and explore how to perform various actions like clicking buttons, handling logins, and capturing screenshots.

Let's dive in!

Getting Started: Web Scraping with Node.js

Node.js provides the runtime environment that lets you execute JavaScript code outside the confines of a web browser, right on your server or local machine. This is the foundation for our scraping tasks.

By using Node.js, you can programmatically launch and control a web browser instance. This allows you to load web pages just like a real user, interact with elements, and extract the necessary data.

Several libraries facilitate this, including Puppeteer, Playwright, and others. These tools allow you to manage "headless" browsers – browsers without a graphical user interface, controlled entirely by your code. This enables actions like button clicks, form submissions, data extraction, and screenshot generation.

Each library comes with its own set of strengths and weaknesses, often balancing ease of use against the depth of features available. Simpler libraries might be quicker to learn but lack the advanced capabilities needed for complex scenarios.

Why Choose Node.js for Scraping?

Node.js is exceptionally well-suited for web scraping tasks. Its asynchronous, event-driven architecture handles network requests efficiently, and its vast package ecosystem (npm) provides powerful libraries like Puppeteer, which significantly streamlines the process of controlling headless browsers.

Selecting Your JavaScript Scraping Toolkit

When it comes to scraping web pages with JavaScript, you have a few different approaches you could take:

  • Fetching HTML + Regex: Trying to parse HTML with regular expressions might seem straightforward initially, but it's fragile. Minor changes in website structure or dynamic content loading can easily break your scraper. It's generally not recommended for anything beyond the simplest static pages.

  • Fetching HTML + Parser Libraries: Libraries like Cheerio or JSDom parse the HTML and create a Document Object Model (DOM) structure you can navigate. This is more robust than regex but still struggles with pages heavily reliant on client-side JavaScript to render content, as it doesn't execute that JavaScript.

  • Intercepting XHR/API Requests: Sometimes, the data you need is loaded via background requests (often called XHR or Fetch requests). You can inspect these in your browser's developer tools (Network tab). If you find a request returning the data directly (often in JSON format), you might be able to fetch this data endpoint directly, potentially bypassing the need to render the full page. This is efficient but relies on finding such an endpoint and understanding its structure.

  • Using a Headless Browser: This is often the most reliable and versatile method, especially for modern websites. Libraries like Puppeteer or Playwright automate a real browser engine (like Chromium) behind the scenes. This means JavaScript is executed, content is rendered dynamically, and you can interact with the page almost exactly like a human user would – clicking, typing, scrolling, and extracting the fully rendered content.

Given its power and reliability for handling complex, dynamic websites, we'll focus on the headless browser approach in this guide, specifically using Puppeteer.

Puppeteer, developed by Google, provides a high-level API to control Chrome or Chromium over the DevTools Protocol. While other similar libraries exist (like Playwright), the fundamental concepts and workflow are quite similar. Understanding Puppeteer will give you a solid foundation applicable to other tools as well.

A Step-by-Step Approach to JavaScript Web Scraping

Performing web scraping with JavaScript generally follows a logical sequence. Think of it as instructing a robot step-by-step:

  1. Setup: Initialize Node.js and install your chosen library (Puppeteer).

  2. Launch & Navigate: Start a browser instance (ideally configured with a proxy) and open a new tab to load your target URL.

  3. Interact & Extract: Locate elements on the page, perform necessary actions (clicks, input), and extract the desired data.

  4. Process & Close: Format or process the collected data, save it, and properly close the browser instance to free up resources.

Let's break down each step using Puppeteer.

Step 1: Setting Up Your Node.js Project with Puppeteer

First things first, ensure you have Node.js installed on your system. If not, head over to the official Node.js website and download the installer for your operating system.

Once Node.js is ready, create a new folder for your project. Open your terminal or command prompt, navigate into that folder, and initialize a new Node.js project:

npm init -y

This creates a package.json file. Now, install Puppeteer:

npm

This command downloads Puppeteer and a compatible version of Chromium. Create a file named scraper.js (or any name you prefer) in your project folder. Let's start with the basic setup:

// Import the Puppeteer library
const puppeteer = require('puppeteer');

// Get URL from command line arguments, or use a default
let targetUrl = process.argv[2];

if (!targetUrl) {
  // Using a site designed for scraping practice
  targetUrl = "http://quotes.toscrape.com/js/";
  console.log(`No URL provided. Using default: ${targetUrl}`);
}

// Define our main scraping function (asynchronous)
async function runScraper() {
  console.log(`Starting scraper for: ${targetUrl}`);

  // --- Browser launch and scraping logic will go here ---

  console.log('Scraper finished.');
}

// Execute the main function
runScraper();

This initial code imports Puppeteer and sets up a basic structure to accept a URL from the command line (e.g., node scraper.js https://example.com) or fall back to a default if none is provided. The core scraping logic will be placed inside the runScraper async function.

Step 2: Launching the Browser and Navigating (Avoiding Blocks with Proxies)

Now, let's launch the browser within our runScraper function. Simply doing this:

// Inside runScraper()
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(targetUrl);
// ... scraping logic ...
await browser.close();

...is functional, but it's a quick way to get your IP address blocked. Websites monitor incoming requests, and numerous rapid requests from the same IP are a major red flag for scraping activity.

This is where proxies become essential. By routing your requests through different IP addresses, you make your scraper appear as multiple distinct users, significantly reducing the chances of detection and blocking.

Integrating Evomi Proxies with Puppeteer

Using a reliable proxy service is crucial for sustained scraping. Evomi provides ethically sourced residential, mobile, datacenter, and static ISP proxies designed for performance and reliability. Residential proxies are often preferred for scraping as they use real IP addresses assigned by ISPs to homeowners, making them look like genuine users.

Let's modify the browser launch configuration to use an Evomi residential proxy. You'll need your Evomi proxy credentials and the appropriate endpoint.

// Inside runScraper() before browser launch
const proxyServer = 'rp.evomi.com:1000'; // Evomi Residential HTTP endpoint example
const proxyUser = 'YOUR_EVOMI_USERNAME'; // Replace with your actual username
const proxyPass = 'YOUR_EVOMI_PASSWORD'; // Replace with your actual password

console.log(`Launching browser via proxy: ${proxyServer}`);
const browser = await puppeteer.launch({
    headless: true, // Run headless (no GUI). Set to false for debugging.
    args: [
        `--proxy-server=${proxyServer}`
        // Add other arguments if needed, e.g., '--no-sandbox' on Linux
    ]
});

const page = await browser.newPage();

// Authenticate the proxy request
await page.authenticate({
    username: proxyUser,
    password: proxyPass
});

console.log(`Navigating to ${targetUrl}...`);
try {
    // Increase timeout for potentially slower proxy connections or complex pages
    await page.goto(targetUrl, { waitUntil: 'networkidle2', timeout: 60000 });
    console.log('Page loaded successfully.');

    // --- Take a screenshot to verify ---
    await page.screenshot({ path: 'page_screenshot.png' });
    console.log('Screenshot saved as page_screenshot.png');

    // --- Add scraping logic here ---

} catch (error) {
    console.error(`Error navigating to ${targetUrl}: ${error}`);
} finally {
    console.log('Closing browser...');
    await browser.close();
}

In this enhanced code:

  • We define variables for the proxy server address/port and your credentials (remember to replace the placeholders!). We used `rp.evomi.com:1000` as an example for Evomi's residential HTTP proxies.

  • The `--proxy-server` argument is passed during `puppeteer.launch`.

  • Crucially, `page.authenticate()` is called before navigating to handle proxy authentication.

  • We added a `try...catch...finally` block for better error handling and ensure the browser always closes.

  • `waitUntil: 'networkidle2'` tells Puppeteer to wait until the network is relatively quiet, indicating the page has likely finished loading dynamic content.

  • We increased the navigation timeout to 60 seconds to accommodate potentially slower load times.

  • A simple screenshot confirms the page loaded correctly through the proxy.

Using rotating residential proxies like those from Evomi means each request (or session, depending on configuration) can originate from a different IP, drastically improving your scraper's stealthiness. Evomi also offers a free trial if you want to test the waters first!

Step 3: Interacting with the Web Page

Puppeteer allows you to do virtually anything a human user can do in a browser. Let's explore common interactions:

Targeting Elements and Extracting Data

You need to tell Puppeteer which elements contain the data you want. CSS selectors are a common and effective way to do this. You can find selectors using your browser's developer tools (right-click an element -> Inspect -> right-click in the Elements panel -> Copy -> Copy selector).

Let's try extracting quotes and authors from our example site (`http://quotes.toscrape.com/js/`):

// Inside the try block, after page.goto()
console.log('Extracting quotes...');
// Use page.$$eval to select multiple elements and run code in the browser context
const quotesData = await page.$$eval('.quote', quotes => {
  // This function runs in the browser, not Node.js
  return quotes.map(quote => {
    const text = quote.querySelector('.text').innerText;
    const author = quote.querySelector('.author').innerText;
    const tags = Array.from(quote.querySelectorAll('.tag')).map(tag => tag.innerText);
    return { text, author, tags };
  });
});
console.log(`Found ${quotesData.length} quotes:`);
console.log(JSON.stringify(quotesData, null, 2)); // Pretty print the data
// Example of targeting a single element's text:
const firstAuthor = await page.$eval('.quote .author', element => element.innerText);
console.log(`First author found: ${firstAuthor}`);

Here:

  • page.$$eval(selector, pageFunction) finds all elements matching the selector (.quote) and executes the provided function within the browser's context. The function receives the found elements as an argument.

  • Inside the browser function, we use standard DOM methods (querySelector, querySelectorAll, innerText) to extract the text, author, and tags for each quote.

  • Array.from(...) converts the NodeList from querySelectorAll into an array we can map over.

  • The results are returned back to our Node.js script.

  • page.$eval(selector, pageFunction) works similarly but for a single element.

Performing User Actions

Puppeteer can simulate clicks, typing, scrolling, and more.

// --- Example: Clicking the 'Next' button (if it exists) ---
// Inside the try block
const nextButtonSelector = '.pager .next a'; // Selector for the 'Next >' link

try {
    // Check if the button exists before trying to click
    const nextButton = await page.$(nextButtonSelector);
    if (nextButton) {
        console.log('Clicking the "Next" button...');
        // Wait for navigation after click
        await Promise.all([
            page.waitForNavigation({ waitUntil: 'networkidle2', timeout: 60000 }),
            page.click(nextButtonSelector),
        ]);
        console.log('Navigated to the next page.');
        await page.screenshot({ path: 'next_page_screenshot.png' });
        console.log('Screenshot of next page saved.');
        // You could add logic here to scrape the new page
    } else {
        console.log('No "Next" button found.');
    }
} catch (error) {
    console.error('Error clicking "Next" button or navigating:', error);
}

// --- Example: Typing into a search box (hypothetical) ---
/*
const searchBoxSelector = '#search-input'; // Replace with actual selector
const searchTerm = 'web scraping';
if (await page.$(searchBoxSelector)) {
    console.log(`Typing "${searchTerm}" into search box...`);
    await page.type(searchBoxSelector, searchTerm, { delay: 50 }); // Add slight delay
    // await page.keyboard.press('Enter'); // Simulate pressing Enter
    // await page.waitForNavigation(); // Wait for results page
}
*/

Key methods used:

  • page.click(selector): Clicks the element matching the selector.

  • page.type(selector, text, options): Types text into an input field. The delay option adds a small pause between keystrokes, making it appear more human-like.

  • page.waitForNavigation(options): Waits for the page to navigate after an action like a click. Using Promise.all is a common pattern to initiate the click and wait for navigation concurrently.

  • page.$(selector): Checks if an element exists without throwing an error if it doesn't (returns null).

  • Other useful methods include page.focus(), page.select() (for dropdowns), page.mouse (for complex mouse actions), page.keyboard (for specific key presses), and page.evaluate() (to run arbitrary JavaScript on the page, e.g., window.scrollBy(0, window.innerHeight) to scroll down).

Taking Screenshots and PDFs

We already saw basic screenshots. You can customize them further:

// Inside the try block
// Screenshot the full scrollable page
await page.screenshot({ path: 'full_page.png', fullPage: true });
console.log('Full page screenshot saved.');

// Screenshot a specific element (e.g., the first quote)
const firstQuoteElement = await page.$('.quote');
if (firstQuoteElement) {
    await firstQuoteElement.screenshot({ path: 'first_quote.png' });
    console.log('First quote screenshot saved.');
}

// Save the page as a PDF
await page.pdf({ path: 'page_export.pdf', format: 'A4' });
console.log('Page saved as PDF.');
  • page.screenshot({ path: '...', fullPage: true }) captures the entire page, scrolling if necessary.

  • You can call .screenshot() directly on an element handle obtained via page.$().

  • page.pdf({ path: '...', format: '...' }) generates a PDF document.

Step 4: Processing, Saving, and Closing

Once you've extracted the data (like our `quotesData` array), you'll typically want to process it (clean up text, format numbers) and save it. This could involve writing to a JSON file, a CSV file, or inserting it into a database.

Finally, always ensure you close the browser instance using `browser.close()` within a `finally` block to release system resources, even if errors occurred during scraping.

Here’s a more complete `scraper.js` structure incorporating these steps:

const puppeteer = require('puppeteer');

let targetUrl = process.argv[2];
if (!targetUrl) {
  targetUrl = "http://quotes.toscrape.com/js/"; // Default example site
  console.log(`No URL provided. Using default: ${targetUrl}`);
}

// --- Proxy Configuration ---
const useProxy = true; // Set to false to run without proxy
const proxyServer = 'rp.evomi.com:1000'; // Evomi Residential HTTP Example
const proxyUser = 'YOUR_EVOMI_USERNAME'; // Replace
const proxyPass = 'YOUR_EVOMI_PASSWORD'; // Replace

async function runScraper() {
  console.log(`Starting scraper for: ${targetUrl}`);
  let browser; // Define browser outside try block for finally access

  try {
    const launchOptions = {
      headless: true, // Use true for production, false for debugging
      args: []
    };

    if (useProxy) {
      console.log(`Launching via proxy: ${proxyServer}`);
      launchOptions.args.push(`--proxy-server=${proxyServer}`);
      // Add other args like '--no-sandbox' if needed for your environment
    } else {
      console.log('Launching without proxy.');
    }

    browser = await puppeteer.launch(launchOptions);
    const page = await browser.newPage();

    // Set a realistic viewport
    await page.setViewport({ width: 1366, height: 768 });

    // Set a longer navigation timeout
    page.setDefaultNavigationTimeout(60000); // 60 seconds

    if (useProxy) {
      await page.authenticate({
        username: proxyUser,
        password: proxyPass
      });
    }

    console.log(`Navigating to ${targetUrl}...`);
    await page.goto(targetUrl, { waitUntil: 'networkidle2' });
    console.log('Page loaded successfully.');

    // --- Interaction and Extraction ---
    console.log('Extracting quotes...');
    const quotesData = await page.$$eval('.quote', quotes => {
      return quotes.map(quote => {
        const text = quote.querySelector('.text')?.innerText; // Optional chaining
        const author = quote.querySelector('.author')?.innerText;
        const tags = Array.from(quote.querySelectorAll('.tag')).map(tag => tag.innerText);
        return { text, author, tags };
      });
    });

    console.log(`Found ${quotesData.length} quotes.`);
    // In a real scenario, save this data (e.g., to a file or database)
    console.log(JSON.stringify(quotesData[0], null, 2)); // Log first quote as example

    // --- Optional Actions ---
    await page.screenshot({ path: 'final_screenshot.png' });
    console.log('Final screenshot saved.');

    // Example: Click 'Next' if available
    const nextButtonSelector = '.pager .next a';
    const nextButton = await page.$(nextButtonSelector);
    if (nextButton) {
      console.log('Found "Next" button (not clicking in this example).');
      // Add click logic here if needed for multi-page scraping
    }

  } catch (error) {
    console.error(`Scraping failed: ${error}`);
    // Log the error, maybe take a screenshot on error
    if (browser) {
      try {
        const errorPage = await browser.newPage();
        await errorPage.screenshot({ path: 'error_screenshot.png' });
      } catch (screenshotError) {
        console.error('Could not take error screenshot:', screenshotError);
      }
    }
  } finally {
    if (browser) {
      console.log('Closing browser...');
      await browser.close();
    }
    console.log('Scraper finished.');
  }
}

runScraper();

Wrapping Up

You've now seen how to leverage JavaScript, Node.js, and Puppeteer for effective web scraping. We covered setting up your environment, launching headless browsers, the critical importance of using proxies (like Evomi's residential options) to avoid blocks, interacting with page elements, and extracting the data you need.

Remember that successful scraping often requires adapting to specific website structures and implementing robust error handling. But with these techniques and tools, you're well-equipped to automate data collection from even complex, dynamic websites. Happy scraping!

Tackling Web Scraping with JavaScript and Node.js

Using JavaScript and Node.js for web scraping? That's a smart move. This combination lets you gather data even from websites that load content dynamically, a common hurdle for simpler scraping methods.

Web scraping itself offers a ton of advantages. Think about monitoring competitor pricing, tracking supplier stock levels, automating repetitive online tasks, or even gauging brand sentiment across the web. It's a powerful technique for gathering large datasets automatically and transforming them into actionable insights for your business or project.

When you bring JavaScript into the mix via Node.js, you gain the ability to mimic user interactions almost perfectly, all while benefiting from the speed and convenience of automation, a rich ecosystem of libraries, and reusable code.

However, navigating the world of web scraping isn't without its challenges.

The biggest one? Getting blocked. While scraping publicly available data is generally legal, most websites actively try to detect and block automated scrapers. Staying undetected is key.

Secondly, you need the right tools for the job – tools that are both efficient and dependable. With numerous libraries available, choosing the best fit, especially when starting out with JavaScript and Node.js scraping, can feel a bit overwhelming.

So, this post aims to be your practical guide to web scraping using JavaScript and Node.js. We'll cover the essentials, show you how to keep your scrapers running smoothly without getting blocked, and explore how to perform various actions like clicking buttons, handling logins, and capturing screenshots.

Let's dive in!

Getting Started: Web Scraping with Node.js

Node.js provides the runtime environment that lets you execute JavaScript code outside the confines of a web browser, right on your server or local machine. This is the foundation for our scraping tasks.

By using Node.js, you can programmatically launch and control a web browser instance. This allows you to load web pages just like a real user, interact with elements, and extract the necessary data.

Several libraries facilitate this, including Puppeteer, Playwright, and others. These tools allow you to manage "headless" browsers – browsers without a graphical user interface, controlled entirely by your code. This enables actions like button clicks, form submissions, data extraction, and screenshot generation.

Each library comes with its own set of strengths and weaknesses, often balancing ease of use against the depth of features available. Simpler libraries might be quicker to learn but lack the advanced capabilities needed for complex scenarios.

Why Choose Node.js for Scraping?

Node.js is exceptionally well-suited for web scraping tasks. Its asynchronous, event-driven architecture handles network requests efficiently, and its vast package ecosystem (npm) provides powerful libraries like Puppeteer, which significantly streamlines the process of controlling headless browsers.

Selecting Your JavaScript Scraping Toolkit

When it comes to scraping web pages with JavaScript, you have a few different approaches you could take:

  • Fetching HTML + Regex: Trying to parse HTML with regular expressions might seem straightforward initially, but it's fragile. Minor changes in website structure or dynamic content loading can easily break your scraper. It's generally not recommended for anything beyond the simplest static pages.

  • Fetching HTML + Parser Libraries: Libraries like Cheerio or JSDom parse the HTML and create a Document Object Model (DOM) structure you can navigate. This is more robust than regex but still struggles with pages heavily reliant on client-side JavaScript to render content, as it doesn't execute that JavaScript.

  • Intercepting XHR/API Requests: Sometimes, the data you need is loaded via background requests (often called XHR or Fetch requests). You can inspect these in your browser's developer tools (Network tab). If you find a request returning the data directly (often in JSON format), you might be able to fetch this data endpoint directly, potentially bypassing the need to render the full page. This is efficient but relies on finding such an endpoint and understanding its structure.

  • Using a Headless Browser: This is often the most reliable and versatile method, especially for modern websites. Libraries like Puppeteer or Playwright automate a real browser engine (like Chromium) behind the scenes. This means JavaScript is executed, content is rendered dynamically, and you can interact with the page almost exactly like a human user would – clicking, typing, scrolling, and extracting the fully rendered content.

Given its power and reliability for handling complex, dynamic websites, we'll focus on the headless browser approach in this guide, specifically using Puppeteer.

Puppeteer, developed by Google, provides a high-level API to control Chrome or Chromium over the DevTools Protocol. While other similar libraries exist (like Playwright), the fundamental concepts and workflow are quite similar. Understanding Puppeteer will give you a solid foundation applicable to other tools as well.

A Step-by-Step Approach to JavaScript Web Scraping

Performing web scraping with JavaScript generally follows a logical sequence. Think of it as instructing a robot step-by-step:

  1. Setup: Initialize Node.js and install your chosen library (Puppeteer).

  2. Launch & Navigate: Start a browser instance (ideally configured with a proxy) and open a new tab to load your target URL.

  3. Interact & Extract: Locate elements on the page, perform necessary actions (clicks, input), and extract the desired data.

  4. Process & Close: Format or process the collected data, save it, and properly close the browser instance to free up resources.

Let's break down each step using Puppeteer.

Step 1: Setting Up Your Node.js Project with Puppeteer

First things first, ensure you have Node.js installed on your system. If not, head over to the official Node.js website and download the installer for your operating system.

Once Node.js is ready, create a new folder for your project. Open your terminal or command prompt, navigate into that folder, and initialize a new Node.js project:

npm init -y

This creates a package.json file. Now, install Puppeteer:

npm

This command downloads Puppeteer and a compatible version of Chromium. Create a file named scraper.js (or any name you prefer) in your project folder. Let's start with the basic setup:

// Import the Puppeteer library
const puppeteer = require('puppeteer');

// Get URL from command line arguments, or use a default
let targetUrl = process.argv[2];

if (!targetUrl) {
  // Using a site designed for scraping practice
  targetUrl = "http://quotes.toscrape.com/js/";
  console.log(`No URL provided. Using default: ${targetUrl}`);
}

// Define our main scraping function (asynchronous)
async function runScraper() {
  console.log(`Starting scraper for: ${targetUrl}`);

  // --- Browser launch and scraping logic will go here ---

  console.log('Scraper finished.');
}

// Execute the main function
runScraper();

This initial code imports Puppeteer and sets up a basic structure to accept a URL from the command line (e.g., node scraper.js https://example.com) or fall back to a default if none is provided. The core scraping logic will be placed inside the runScraper async function.

Step 2: Launching the Browser and Navigating (Avoiding Blocks with Proxies)

Now, let's launch the browser within our runScraper function. Simply doing this:

// Inside runScraper()
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(targetUrl);
// ... scraping logic ...
await browser.close();

...is functional, but it's a quick way to get your IP address blocked. Websites monitor incoming requests, and numerous rapid requests from the same IP are a major red flag for scraping activity.

This is where proxies become essential. By routing your requests through different IP addresses, you make your scraper appear as multiple distinct users, significantly reducing the chances of detection and blocking.

Integrating Evomi Proxies with Puppeteer

Using a reliable proxy service is crucial for sustained scraping. Evomi provides ethically sourced residential, mobile, datacenter, and static ISP proxies designed for performance and reliability. Residential proxies are often preferred for scraping as they use real IP addresses assigned by ISPs to homeowners, making them look like genuine users.

Let's modify the browser launch configuration to use an Evomi residential proxy. You'll need your Evomi proxy credentials and the appropriate endpoint.

// Inside runScraper() before browser launch
const proxyServer = 'rp.evomi.com:1000'; // Evomi Residential HTTP endpoint example
const proxyUser = 'YOUR_EVOMI_USERNAME'; // Replace with your actual username
const proxyPass = 'YOUR_EVOMI_PASSWORD'; // Replace with your actual password

console.log(`Launching browser via proxy: ${proxyServer}`);
const browser = await puppeteer.launch({
    headless: true, // Run headless (no GUI). Set to false for debugging.
    args: [
        `--proxy-server=${proxyServer}`
        // Add other arguments if needed, e.g., '--no-sandbox' on Linux
    ]
});

const page = await browser.newPage();

// Authenticate the proxy request
await page.authenticate({
    username: proxyUser,
    password: proxyPass
});

console.log(`Navigating to ${targetUrl}...`);
try {
    // Increase timeout for potentially slower proxy connections or complex pages
    await page.goto(targetUrl, { waitUntil: 'networkidle2', timeout: 60000 });
    console.log('Page loaded successfully.');

    // --- Take a screenshot to verify ---
    await page.screenshot({ path: 'page_screenshot.png' });
    console.log('Screenshot saved as page_screenshot.png');

    // --- Add scraping logic here ---

} catch (error) {
    console.error(`Error navigating to ${targetUrl}: ${error}`);
} finally {
    console.log('Closing browser...');
    await browser.close();
}

In this enhanced code:

  • We define variables for the proxy server address/port and your credentials (remember to replace the placeholders!). We used `rp.evomi.com:1000` as an example for Evomi's residential HTTP proxies.

  • The `--proxy-server` argument is passed during `puppeteer.launch`.

  • Crucially, `page.authenticate()` is called before navigating to handle proxy authentication.

  • We added a `try...catch...finally` block for better error handling and ensure the browser always closes.

  • `waitUntil: 'networkidle2'` tells Puppeteer to wait until the network is relatively quiet, indicating the page has likely finished loading dynamic content.

  • We increased the navigation timeout to 60 seconds to accommodate potentially slower load times.

  • A simple screenshot confirms the page loaded correctly through the proxy.

Using rotating residential proxies like those from Evomi means each request (or session, depending on configuration) can originate from a different IP, drastically improving your scraper's stealthiness. Evomi also offers a free trial if you want to test the waters first!

Step 3: Interacting with the Web Page

Puppeteer allows you to do virtually anything a human user can do in a browser. Let's explore common interactions:

Targeting Elements and Extracting Data

You need to tell Puppeteer which elements contain the data you want. CSS selectors are a common and effective way to do this. You can find selectors using your browser's developer tools (right-click an element -> Inspect -> right-click in the Elements panel -> Copy -> Copy selector).

Let's try extracting quotes and authors from our example site (`http://quotes.toscrape.com/js/`):

// Inside the try block, after page.goto()
console.log('Extracting quotes...');
// Use page.$$eval to select multiple elements and run code in the browser context
const quotesData = await page.$$eval('.quote', quotes => {
  // This function runs in the browser, not Node.js
  return quotes.map(quote => {
    const text = quote.querySelector('.text').innerText;
    const author = quote.querySelector('.author').innerText;
    const tags = Array.from(quote.querySelectorAll('.tag')).map(tag => tag.innerText);
    return { text, author, tags };
  });
});
console.log(`Found ${quotesData.length} quotes:`);
console.log(JSON.stringify(quotesData, null, 2)); // Pretty print the data
// Example of targeting a single element's text:
const firstAuthor = await page.$eval('.quote .author', element => element.innerText);
console.log(`First author found: ${firstAuthor}`);

Here:

  • page.$$eval(selector, pageFunction) finds all elements matching the selector (.quote) and executes the provided function within the browser's context. The function receives the found elements as an argument.

  • Inside the browser function, we use standard DOM methods (querySelector, querySelectorAll, innerText) to extract the text, author, and tags for each quote.

  • Array.from(...) converts the NodeList from querySelectorAll into an array we can map over.

  • The results are returned back to our Node.js script.

  • page.$eval(selector, pageFunction) works similarly but for a single element.

Performing User Actions

Puppeteer can simulate clicks, typing, scrolling, and more.

// --- Example: Clicking the 'Next' button (if it exists) ---
// Inside the try block
const nextButtonSelector = '.pager .next a'; // Selector for the 'Next >' link

try {
    // Check if the button exists before trying to click
    const nextButton = await page.$(nextButtonSelector);
    if (nextButton) {
        console.log('Clicking the "Next" button...');
        // Wait for navigation after click
        await Promise.all([
            page.waitForNavigation({ waitUntil: 'networkidle2', timeout: 60000 }),
            page.click(nextButtonSelector),
        ]);
        console.log('Navigated to the next page.');
        await page.screenshot({ path: 'next_page_screenshot.png' });
        console.log('Screenshot of next page saved.');
        // You could add logic here to scrape the new page
    } else {
        console.log('No "Next" button found.');
    }
} catch (error) {
    console.error('Error clicking "Next" button or navigating:', error);
}

// --- Example: Typing into a search box (hypothetical) ---
/*
const searchBoxSelector = '#search-input'; // Replace with actual selector
const searchTerm = 'web scraping';
if (await page.$(searchBoxSelector)) {
    console.log(`Typing "${searchTerm}" into search box...`);
    await page.type(searchBoxSelector, searchTerm, { delay: 50 }); // Add slight delay
    // await page.keyboard.press('Enter'); // Simulate pressing Enter
    // await page.waitForNavigation(); // Wait for results page
}
*/

Key methods used:

  • page.click(selector): Clicks the element matching the selector.

  • page.type(selector, text, options): Types text into an input field. The delay option adds a small pause between keystrokes, making it appear more human-like.

  • page.waitForNavigation(options): Waits for the page to navigate after an action like a click. Using Promise.all is a common pattern to initiate the click and wait for navigation concurrently.

  • page.$(selector): Checks if an element exists without throwing an error if it doesn't (returns null).

  • Other useful methods include page.focus(), page.select() (for dropdowns), page.mouse (for complex mouse actions), page.keyboard (for specific key presses), and page.evaluate() (to run arbitrary JavaScript on the page, e.g., window.scrollBy(0, window.innerHeight) to scroll down).

Taking Screenshots and PDFs

We already saw basic screenshots. You can customize them further:

// Inside the try block
// Screenshot the full scrollable page
await page.screenshot({ path: 'full_page.png', fullPage: true });
console.log('Full page screenshot saved.');

// Screenshot a specific element (e.g., the first quote)
const firstQuoteElement = await page.$('.quote');
if (firstQuoteElement) {
    await firstQuoteElement.screenshot({ path: 'first_quote.png' });
    console.log('First quote screenshot saved.');
}

// Save the page as a PDF
await page.pdf({ path: 'page_export.pdf', format: 'A4' });
console.log('Page saved as PDF.');
  • page.screenshot({ path: '...', fullPage: true }) captures the entire page, scrolling if necessary.

  • You can call .screenshot() directly on an element handle obtained via page.$().

  • page.pdf({ path: '...', format: '...' }) generates a PDF document.

Step 4: Processing, Saving, and Closing

Once you've extracted the data (like our `quotesData` array), you'll typically want to process it (clean up text, format numbers) and save it. This could involve writing to a JSON file, a CSV file, or inserting it into a database.

Finally, always ensure you close the browser instance using `browser.close()` within a `finally` block to release system resources, even if errors occurred during scraping.

Here’s a more complete `scraper.js` structure incorporating these steps:

const puppeteer = require('puppeteer');

let targetUrl = process.argv[2];
if (!targetUrl) {
  targetUrl = "http://quotes.toscrape.com/js/"; // Default example site
  console.log(`No URL provided. Using default: ${targetUrl}`);
}

// --- Proxy Configuration ---
const useProxy = true; // Set to false to run without proxy
const proxyServer = 'rp.evomi.com:1000'; // Evomi Residential HTTP Example
const proxyUser = 'YOUR_EVOMI_USERNAME'; // Replace
const proxyPass = 'YOUR_EVOMI_PASSWORD'; // Replace

async function runScraper() {
  console.log(`Starting scraper for: ${targetUrl}`);
  let browser; // Define browser outside try block for finally access

  try {
    const launchOptions = {
      headless: true, // Use true for production, false for debugging
      args: []
    };

    if (useProxy) {
      console.log(`Launching via proxy: ${proxyServer}`);
      launchOptions.args.push(`--proxy-server=${proxyServer}`);
      // Add other args like '--no-sandbox' if needed for your environment
    } else {
      console.log('Launching without proxy.');
    }

    browser = await puppeteer.launch(launchOptions);
    const page = await browser.newPage();

    // Set a realistic viewport
    await page.setViewport({ width: 1366, height: 768 });

    // Set a longer navigation timeout
    page.setDefaultNavigationTimeout(60000); // 60 seconds

    if (useProxy) {
      await page.authenticate({
        username: proxyUser,
        password: proxyPass
      });
    }

    console.log(`Navigating to ${targetUrl}...`);
    await page.goto(targetUrl, { waitUntil: 'networkidle2' });
    console.log('Page loaded successfully.');

    // --- Interaction and Extraction ---
    console.log('Extracting quotes...');
    const quotesData = await page.$$eval('.quote', quotes => {
      return quotes.map(quote => {
        const text = quote.querySelector('.text')?.innerText; // Optional chaining
        const author = quote.querySelector('.author')?.innerText;
        const tags = Array.from(quote.querySelectorAll('.tag')).map(tag => tag.innerText);
        return { text, author, tags };
      });
    });

    console.log(`Found ${quotesData.length} quotes.`);
    // In a real scenario, save this data (e.g., to a file or database)
    console.log(JSON.stringify(quotesData[0], null, 2)); // Log first quote as example

    // --- Optional Actions ---
    await page.screenshot({ path: 'final_screenshot.png' });
    console.log('Final screenshot saved.');

    // Example: Click 'Next' if available
    const nextButtonSelector = '.pager .next a';
    const nextButton = await page.$(nextButtonSelector);
    if (nextButton) {
      console.log('Found "Next" button (not clicking in this example).');
      // Add click logic here if needed for multi-page scraping
    }

  } catch (error) {
    console.error(`Scraping failed: ${error}`);
    // Log the error, maybe take a screenshot on error
    if (browser) {
      try {
        const errorPage = await browser.newPage();
        await errorPage.screenshot({ path: 'error_screenshot.png' });
      } catch (screenshotError) {
        console.error('Could not take error screenshot:', screenshotError);
      }
    }
  } finally {
    if (browser) {
      console.log('Closing browser...');
      await browser.close();
    }
    console.log('Scraper finished.');
  }
}

runScraper();

Wrapping Up

You've now seen how to leverage JavaScript, Node.js, and Puppeteer for effective web scraping. We covered setting up your environment, launching headless browsers, the critical importance of using proxies (like Evomi's residential options) to avoid blocks, interacting with page elements, and extracting the data you need.

Remember that successful scraping often requires adapting to specific website structures and implementing robust error handling. But with these techniques and tools, you're well-equipped to automate data collection from even complex, dynamic websites. Happy scraping!

Tackling Web Scraping with JavaScript and Node.js

Using JavaScript and Node.js for web scraping? That's a smart move. This combination lets you gather data even from websites that load content dynamically, a common hurdle for simpler scraping methods.

Web scraping itself offers a ton of advantages. Think about monitoring competitor pricing, tracking supplier stock levels, automating repetitive online tasks, or even gauging brand sentiment across the web. It's a powerful technique for gathering large datasets automatically and transforming them into actionable insights for your business or project.

When you bring JavaScript into the mix via Node.js, you gain the ability to mimic user interactions almost perfectly, all while benefiting from the speed and convenience of automation, a rich ecosystem of libraries, and reusable code.

However, navigating the world of web scraping isn't without its challenges.

The biggest one? Getting blocked. While scraping publicly available data is generally legal, most websites actively try to detect and block automated scrapers. Staying undetected is key.

Secondly, you need the right tools for the job – tools that are both efficient and dependable. With numerous libraries available, choosing the best fit, especially when starting out with JavaScript and Node.js scraping, can feel a bit overwhelming.

So, this post aims to be your practical guide to web scraping using JavaScript and Node.js. We'll cover the essentials, show you how to keep your scrapers running smoothly without getting blocked, and explore how to perform various actions like clicking buttons, handling logins, and capturing screenshots.

Let's dive in!

Getting Started: Web Scraping with Node.js

Node.js provides the runtime environment that lets you execute JavaScript code outside the confines of a web browser, right on your server or local machine. This is the foundation for our scraping tasks.

By using Node.js, you can programmatically launch and control a web browser instance. This allows you to load web pages just like a real user, interact with elements, and extract the necessary data.

Several libraries facilitate this, including Puppeteer, Playwright, and others. These tools allow you to manage "headless" browsers – browsers without a graphical user interface, controlled entirely by your code. This enables actions like button clicks, form submissions, data extraction, and screenshot generation.

Each library comes with its own set of strengths and weaknesses, often balancing ease of use against the depth of features available. Simpler libraries might be quicker to learn but lack the advanced capabilities needed for complex scenarios.

Why Choose Node.js for Scraping?

Node.js is exceptionally well-suited for web scraping tasks. Its asynchronous, event-driven architecture handles network requests efficiently, and its vast package ecosystem (npm) provides powerful libraries like Puppeteer, which significantly streamlines the process of controlling headless browsers.

Selecting Your JavaScript Scraping Toolkit

When it comes to scraping web pages with JavaScript, you have a few different approaches you could take:

  • Fetching HTML + Regex: Trying to parse HTML with regular expressions might seem straightforward initially, but it's fragile. Minor changes in website structure or dynamic content loading can easily break your scraper. It's generally not recommended for anything beyond the simplest static pages.

  • Fetching HTML + Parser Libraries: Libraries like Cheerio or JSDom parse the HTML and create a Document Object Model (DOM) structure you can navigate. This is more robust than regex but still struggles with pages heavily reliant on client-side JavaScript to render content, as it doesn't execute that JavaScript.

  • Intercepting XHR/API Requests: Sometimes, the data you need is loaded via background requests (often called XHR or Fetch requests). You can inspect these in your browser's developer tools (Network tab). If you find a request returning the data directly (often in JSON format), you might be able to fetch this data endpoint directly, potentially bypassing the need to render the full page. This is efficient but relies on finding such an endpoint and understanding its structure.

  • Using a Headless Browser: This is often the most reliable and versatile method, especially for modern websites. Libraries like Puppeteer or Playwright automate a real browser engine (like Chromium) behind the scenes. This means JavaScript is executed, content is rendered dynamically, and you can interact with the page almost exactly like a human user would – clicking, typing, scrolling, and extracting the fully rendered content.

Given its power and reliability for handling complex, dynamic websites, we'll focus on the headless browser approach in this guide, specifically using Puppeteer.

Puppeteer, developed by Google, provides a high-level API to control Chrome or Chromium over the DevTools Protocol. While other similar libraries exist (like Playwright), the fundamental concepts and workflow are quite similar. Understanding Puppeteer will give you a solid foundation applicable to other tools as well.

A Step-by-Step Approach to JavaScript Web Scraping

Performing web scraping with JavaScript generally follows a logical sequence. Think of it as instructing a robot step-by-step:

  1. Setup: Initialize Node.js and install your chosen library (Puppeteer).

  2. Launch & Navigate: Start a browser instance (ideally configured with a proxy) and open a new tab to load your target URL.

  3. Interact & Extract: Locate elements on the page, perform necessary actions (clicks, input), and extract the desired data.

  4. Process & Close: Format or process the collected data, save it, and properly close the browser instance to free up resources.

Let's break down each step using Puppeteer.

Step 1: Setting Up Your Node.js Project with Puppeteer

First things first, ensure you have Node.js installed on your system. If not, head over to the official Node.js website and download the installer for your operating system.

Once Node.js is ready, create a new folder for your project. Open your terminal or command prompt, navigate into that folder, and initialize a new Node.js project:

npm init -y

This creates a package.json file. Now, install Puppeteer:

npm

This command downloads Puppeteer and a compatible version of Chromium. Create a file named scraper.js (or any name you prefer) in your project folder. Let's start with the basic setup:

// Import the Puppeteer library
const puppeteer = require('puppeteer');

// Get URL from command line arguments, or use a default
let targetUrl = process.argv[2];

if (!targetUrl) {
  // Using a site designed for scraping practice
  targetUrl = "http://quotes.toscrape.com/js/";
  console.log(`No URL provided. Using default: ${targetUrl}`);
}

// Define our main scraping function (asynchronous)
async function runScraper() {
  console.log(`Starting scraper for: ${targetUrl}`);

  // --- Browser launch and scraping logic will go here ---

  console.log('Scraper finished.');
}

// Execute the main function
runScraper();

This initial code imports Puppeteer and sets up a basic structure to accept a URL from the command line (e.g., node scraper.js https://example.com) or fall back to a default if none is provided. The core scraping logic will be placed inside the runScraper async function.

Step 2: Launching the Browser and Navigating (Avoiding Blocks with Proxies)

Now, let's launch the browser within our runScraper function. Simply doing this:

// Inside runScraper()
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(targetUrl);
// ... scraping logic ...
await browser.close();

...is functional, but it's a quick way to get your IP address blocked. Websites monitor incoming requests, and numerous rapid requests from the same IP are a major red flag for scraping activity.

This is where proxies become essential. By routing your requests through different IP addresses, you make your scraper appear as multiple distinct users, significantly reducing the chances of detection and blocking.

Integrating Evomi Proxies with Puppeteer

Using a reliable proxy service is crucial for sustained scraping. Evomi provides ethically sourced residential, mobile, datacenter, and static ISP proxies designed for performance and reliability. Residential proxies are often preferred for scraping as they use real IP addresses assigned by ISPs to homeowners, making them look like genuine users.

Let's modify the browser launch configuration to use an Evomi residential proxy. You'll need your Evomi proxy credentials and the appropriate endpoint.

// Inside runScraper() before browser launch
const proxyServer = 'rp.evomi.com:1000'; // Evomi Residential HTTP endpoint example
const proxyUser = 'YOUR_EVOMI_USERNAME'; // Replace with your actual username
const proxyPass = 'YOUR_EVOMI_PASSWORD'; // Replace with your actual password

console.log(`Launching browser via proxy: ${proxyServer}`);
const browser = await puppeteer.launch({
    headless: true, // Run headless (no GUI). Set to false for debugging.
    args: [
        `--proxy-server=${proxyServer}`
        // Add other arguments if needed, e.g., '--no-sandbox' on Linux
    ]
});

const page = await browser.newPage();

// Authenticate the proxy request
await page.authenticate({
    username: proxyUser,
    password: proxyPass
});

console.log(`Navigating to ${targetUrl}...`);
try {
    // Increase timeout for potentially slower proxy connections or complex pages
    await page.goto(targetUrl, { waitUntil: 'networkidle2', timeout: 60000 });
    console.log('Page loaded successfully.');

    // --- Take a screenshot to verify ---
    await page.screenshot({ path: 'page_screenshot.png' });
    console.log('Screenshot saved as page_screenshot.png');

    // --- Add scraping logic here ---

} catch (error) {
    console.error(`Error navigating to ${targetUrl}: ${error}`);
} finally {
    console.log('Closing browser...');
    await browser.close();
}

In this enhanced code:

  • We define variables for the proxy server address/port and your credentials (remember to replace the placeholders!). We used `rp.evomi.com:1000` as an example for Evomi's residential HTTP proxies.

  • The `--proxy-server` argument is passed during `puppeteer.launch`.

  • Crucially, `page.authenticate()` is called before navigating to handle proxy authentication.

  • We added a `try...catch...finally` block for better error handling and ensure the browser always closes.

  • `waitUntil: 'networkidle2'` tells Puppeteer to wait until the network is relatively quiet, indicating the page has likely finished loading dynamic content.

  • We increased the navigation timeout to 60 seconds to accommodate potentially slower load times.

  • A simple screenshot confirms the page loaded correctly through the proxy.

Using rotating residential proxies like those from Evomi means each request (or session, depending on configuration) can originate from a different IP, drastically improving your scraper's stealthiness. Evomi also offers a free trial if you want to test the waters first!

Step 3: Interacting with the Web Page

Puppeteer allows you to do virtually anything a human user can do in a browser. Let's explore common interactions:

Targeting Elements and Extracting Data

You need to tell Puppeteer which elements contain the data you want. CSS selectors are a common and effective way to do this. You can find selectors using your browser's developer tools (right-click an element -> Inspect -> right-click in the Elements panel -> Copy -> Copy selector).

Let's try extracting quotes and authors from our example site (`http://quotes.toscrape.com/js/`):

// Inside the try block, after page.goto()
console.log('Extracting quotes...');
// Use page.$$eval to select multiple elements and run code in the browser context
const quotesData = await page.$$eval('.quote', quotes => {
  // This function runs in the browser, not Node.js
  return quotes.map(quote => {
    const text = quote.querySelector('.text').innerText;
    const author = quote.querySelector('.author').innerText;
    const tags = Array.from(quote.querySelectorAll('.tag')).map(tag => tag.innerText);
    return { text, author, tags };
  });
});
console.log(`Found ${quotesData.length} quotes:`);
console.log(JSON.stringify(quotesData, null, 2)); // Pretty print the data
// Example of targeting a single element's text:
const firstAuthor = await page.$eval('.quote .author', element => element.innerText);
console.log(`First author found: ${firstAuthor}`);

Here:

  • page.$$eval(selector, pageFunction) finds all elements matching the selector (.quote) and executes the provided function within the browser's context. The function receives the found elements as an argument.

  • Inside the browser function, we use standard DOM methods (querySelector, querySelectorAll, innerText) to extract the text, author, and tags for each quote.

  • Array.from(...) converts the NodeList from querySelectorAll into an array we can map over.

  • The results are returned back to our Node.js script.

  • page.$eval(selector, pageFunction) works similarly but for a single element.

Performing User Actions

Puppeteer can simulate clicks, typing, scrolling, and more.

// --- Example: Clicking the 'Next' button (if it exists) ---
// Inside the try block
const nextButtonSelector = '.pager .next a'; // Selector for the 'Next >' link

try {
    // Check if the button exists before trying to click
    const nextButton = await page.$(nextButtonSelector);
    if (nextButton) {
        console.log('Clicking the "Next" button...');
        // Wait for navigation after click
        await Promise.all([
            page.waitForNavigation({ waitUntil: 'networkidle2', timeout: 60000 }),
            page.click(nextButtonSelector),
        ]);
        console.log('Navigated to the next page.');
        await page.screenshot({ path: 'next_page_screenshot.png' });
        console.log('Screenshot of next page saved.');
        // You could add logic here to scrape the new page
    } else {
        console.log('No "Next" button found.');
    }
} catch (error) {
    console.error('Error clicking "Next" button or navigating:', error);
}

// --- Example: Typing into a search box (hypothetical) ---
/*
const searchBoxSelector = '#search-input'; // Replace with actual selector
const searchTerm = 'web scraping';
if (await page.$(searchBoxSelector)) {
    console.log(`Typing "${searchTerm}" into search box...`);
    await page.type(searchBoxSelector, searchTerm, { delay: 50 }); // Add slight delay
    // await page.keyboard.press('Enter'); // Simulate pressing Enter
    // await page.waitForNavigation(); // Wait for results page
}
*/

Key methods used:

  • page.click(selector): Clicks the element matching the selector.

  • page.type(selector, text, options): Types text into an input field. The delay option adds a small pause between keystrokes, making it appear more human-like.

  • page.waitForNavigation(options): Waits for the page to navigate after an action like a click. Using Promise.all is a common pattern to initiate the click and wait for navigation concurrently.

  • page.$(selector): Checks if an element exists without throwing an error if it doesn't (returns null).

  • Other useful methods include page.focus(), page.select() (for dropdowns), page.mouse (for complex mouse actions), page.keyboard (for specific key presses), and page.evaluate() (to run arbitrary JavaScript on the page, e.g., window.scrollBy(0, window.innerHeight) to scroll down).

Taking Screenshots and PDFs

We already saw basic screenshots. You can customize them further:

// Inside the try block
// Screenshot the full scrollable page
await page.screenshot({ path: 'full_page.png', fullPage: true });
console.log('Full page screenshot saved.');

// Screenshot a specific element (e.g., the first quote)
const firstQuoteElement = await page.$('.quote');
if (firstQuoteElement) {
    await firstQuoteElement.screenshot({ path: 'first_quote.png' });
    console.log('First quote screenshot saved.');
}

// Save the page as a PDF
await page.pdf({ path: 'page_export.pdf', format: 'A4' });
console.log('Page saved as PDF.');
  • page.screenshot({ path: '...', fullPage: true }) captures the entire page, scrolling if necessary.

  • You can call .screenshot() directly on an element handle obtained via page.$().

  • page.pdf({ path: '...', format: '...' }) generates a PDF document.

Step 4: Processing, Saving, and Closing

Once you've extracted the data (like our `quotesData` array), you'll typically want to process it (clean up text, format numbers) and save it. This could involve writing to a JSON file, a CSV file, or inserting it into a database.

Finally, always ensure you close the browser instance using `browser.close()` within a `finally` block to release system resources, even if errors occurred during scraping.

Here’s a more complete `scraper.js` structure incorporating these steps:

const puppeteer = require('puppeteer');

let targetUrl = process.argv[2];
if (!targetUrl) {
  targetUrl = "http://quotes.toscrape.com/js/"; // Default example site
  console.log(`No URL provided. Using default: ${targetUrl}`);
}

// --- Proxy Configuration ---
const useProxy = true; // Set to false to run without proxy
const proxyServer = 'rp.evomi.com:1000'; // Evomi Residential HTTP Example
const proxyUser = 'YOUR_EVOMI_USERNAME'; // Replace
const proxyPass = 'YOUR_EVOMI_PASSWORD'; // Replace

async function runScraper() {
  console.log(`Starting scraper for: ${targetUrl}`);
  let browser; // Define browser outside try block for finally access

  try {
    const launchOptions = {
      headless: true, // Use true for production, false for debugging
      args: []
    };

    if (useProxy) {
      console.log(`Launching via proxy: ${proxyServer}`);
      launchOptions.args.push(`--proxy-server=${proxyServer}`);
      // Add other args like '--no-sandbox' if needed for your environment
    } else {
      console.log('Launching without proxy.');
    }

    browser = await puppeteer.launch(launchOptions);
    const page = await browser.newPage();

    // Set a realistic viewport
    await page.setViewport({ width: 1366, height: 768 });

    // Set a longer navigation timeout
    page.setDefaultNavigationTimeout(60000); // 60 seconds

    if (useProxy) {
      await page.authenticate({
        username: proxyUser,
        password: proxyPass
      });
    }

    console.log(`Navigating to ${targetUrl}...`);
    await page.goto(targetUrl, { waitUntil: 'networkidle2' });
    console.log('Page loaded successfully.');

    // --- Interaction and Extraction ---
    console.log('Extracting quotes...');
    const quotesData = await page.$$eval('.quote', quotes => {
      return quotes.map(quote => {
        const text = quote.querySelector('.text')?.innerText; // Optional chaining
        const author = quote.querySelector('.author')?.innerText;
        const tags = Array.from(quote.querySelectorAll('.tag')).map(tag => tag.innerText);
        return { text, author, tags };
      });
    });

    console.log(`Found ${quotesData.length} quotes.`);
    // In a real scenario, save this data (e.g., to a file or database)
    console.log(JSON.stringify(quotesData[0], null, 2)); // Log first quote as example

    // --- Optional Actions ---
    await page.screenshot({ path: 'final_screenshot.png' });
    console.log('Final screenshot saved.');

    // Example: Click 'Next' if available
    const nextButtonSelector = '.pager .next a';
    const nextButton = await page.$(nextButtonSelector);
    if (nextButton) {
      console.log('Found "Next" button (not clicking in this example).');
      // Add click logic here if needed for multi-page scraping
    }

  } catch (error) {
    console.error(`Scraping failed: ${error}`);
    // Log the error, maybe take a screenshot on error
    if (browser) {
      try {
        const errorPage = await browser.newPage();
        await errorPage.screenshot({ path: 'error_screenshot.png' });
      } catch (screenshotError) {
        console.error('Could not take error screenshot:', screenshotError);
      }
    }
  } finally {
    if (browser) {
      console.log('Closing browser...');
      await browser.close();
    }
    console.log('Scraper finished.');
  }
}

runScraper();

Wrapping Up

You've now seen how to leverage JavaScript, Node.js, and Puppeteer for effective web scraping. We covered setting up your environment, launching headless browsers, the critical importance of using proxies (like Evomi's residential options) to avoid blocks, interacting with page elements, and extracting the data you need.

Remember that successful scraping often requires adapting to specific website structures and implementing robust error handling. But with these techniques and tools, you're well-equipped to automate data collection from even complex, dynamic websites. Happy scraping!

Author

Nathan Reynolds

Web Scraping & Automation Specialist

About Author

Nathan specializes in web scraping techniques, automation tools, and data-driven decision-making. He helps businesses extract valuable insights from the web using ethical and efficient scraping methods powered by advanced proxies. His expertise covers overcoming anti-bot mechanisms, optimizing proxy rotation, and ensuring compliance with data privacy regulations.

Like this article? Share it.
You asked, we answer - Users questions:
Besides using proxies, what are the key ethical and legal considerations when scraping websites with Node.js?+
My Node.js scraper still gets blocked sometimes even when using residential proxies. What other advanced anti-bot techniques should I be aware of?+
How can I efficiently scrape data from multiple pages (pagination) or run multiple Node.js scraping tasks concurrently?+
The article focuses on Puppeteer. How does Playwright compare for Node.js web scraping, and when might I choose it instead?+
Websites frequently change their layout. What strategies can make my Node.js/Puppeteer scraper more resilient to these changes and easier to maintain?+

In This Article

Read More Blogs