Scraping JavaScript Sites: Puppeteer, Node.js & Proxies





Michael Chen
Scraping Techniques
Tackling Dynamic Websites: Scraping with Puppeteer, Node.js, and Proxies
When you need to scrape data from simple, static websites, libraries like Axios or tools like Cheerio often do the trick. They are great for parsing plain HTML. However, the modern web is dynamic; many sites rely heavily on JavaScript to load and display content. Basic HTML scrapers fall short here because they can't execute this client-side code.
To effectively scrape dynamic websites, you need a tool capable of running a full browser environment, executing JavaScript just like a real user's browser would.
Enter Puppeteer: a powerful Node.js library developed by Google. It provides a clean, high-level API to control Chrome or Chromium browsers programmatically.
This guide will walk you through Puppeteer, demonstrating how to build a simple scraper to extract top video results for specific keywords from YouTube.
So, What Exactly is Puppeteer?
Puppeteer is essentially a browser automation framework. While often used for automated testing of web applications, its ability to mimic browser actions makes it invaluable for web scraping tasks involving JavaScript-rendered content.
It operates using the Chrome DevTools Protocol, giving you programmatic access to the inner workings of the browser.
With Puppeteer, you can automate nearly anything you'd do manually in a browser: navigating pages, clicking buttons, filling out forms, scrolling, typing text, and even executing custom JavaScript snippets within the page's context.
By default, Puppeteer uses Chromium (the open-source base for Google Chrome), but it's flexible enough to be configured with full Chrome or even Firefox (though support might vary).
Tutorial: Scraping YouTube Search Results with Puppeteer
In this hands-on section, we'll build a Node.js script using Puppeteer. The goal is to take a search term, query YouTube, and extract the titles and URLs of the top video results.
Later, we'll enhance the script by integrating a proxy service to help manage our digital footprint during scraping.
Setting Up Your Environment
Before we start coding, ensure you have Node.js installed. If not, head over to the official Node.js website for download and installation instructions.
First, create a dedicated directory for our project, navigate into it, and initialize a new Node.js project using npm (Node Package Manager):
mkdir evomi-puppeteer-demo
cd evomi-puppeteer-demo
npm init -y
Next, install the core Puppeteer package. This command also downloads a compatible version of Chromium for Puppeteer to control:
npm
Great! Now create a file named scraper.js
(or any name you prefer) and open it in your text editor. Let's start coding.
Launching Puppeteer and Opening a Page
Let's begin with a fundamental Puppeteer script:
const puppeteer = require('puppeteer');
(async () => {
// Launch the browser
const browserInstance = await puppeteer.launch({
headless: false, // Show the browser window
defaultViewport: null // Use the browser's default viewport size
});
// Open a new tab
const page = await browserInstance.newPage();
// Navigate to YouTube
await page.goto('https://www.youtube.com/', { waitUntil: 'networkidle2' });
console.log('YouTube loaded!');
// We'll add more logic here later...
// await browserInstance.close(); // Keep it open for now
})();
This script initializes Puppeteer, launches a visible browser instance, opens a new page (tab), and navigates to YouTube's homepage.
The headless: false
option is crucial during development. It lets you visually observe the browser performing the automated actions, making debugging much easier. For production or unattended scripts, setting this to true
(or omitting it, as true
is the default) conserves server resources by not rendering the UI. The waitUntil: 'networkidle2'
option tells Puppeteer to consider navigation successful when there are no more than 2 network connections for at least 500 ms, which is often a good indicator that the main content has loaded.
To execute this script, save your changes to scraper.js
and run it from your terminal:
node
You should see a Chromium browser window pop up and load YouTube.
If you encounter issues launching the browser, the official Puppeteer Troubleshooting guide is an excellent resource.
Next, we'll automate interaction with the page elements.
Handling the Cookie Consent Banner
Most websites, including YouTube, present a cookie consent banner on the first visit. We need to accept it to proceed.
Puppeteer allows element selection using standard CSS selectors and also supports XPath selectors, which are sometimes more convenient for finding elements based on their text content.
Let's use an XPath selector to find the button containing the text "Accept all". The page.waitForXPath
method waits for the element to appear in the DOM.
// Wait for the cookie consent button using XPath and click it
try {
const acceptButtonXPath = '//span[contains(text(), "Accept all")]/ancestor::button';
const cookieAcceptButton = await page.waitForXPath(acceptButtonXPath, { timeout: 5000 }); // Wait max 5 seconds
if (cookieAcceptButton) {
await cookieAcceptButton.click();
console.log('Accepted cookies.');
// Wait for potential page reload or navigation after accepting
await page.waitForNavigation({ waitUntil: 'networkidle0', timeout: 10000 });
console.log('Page reloaded after cookie acceptance.');
} else {
console.log('Cookie banner accept button not found or timed out.');
}
} catch (error) {
console.log('Cookie banner not found or already accepted.', error.message);
// Continue script execution even if the banner isn't found
}
This snippet looks for a button associated with the "Accept all" text. We wrap it in a try-catch block because the banner might not always appear (e.g., on subsequent runs). After clicking, we use page.waitForNavigation
to pause the script until the page finishes loading again, as clicking the consent often triggers a reload or state change. The waitUntil: 'networkidle0'
option waits until there are no network connections for 500ms.
Here’s the updated script incorporating the cookie handling:
const puppeteer = require('puppeteer');
(async () => {
const browserInstance = await puppeteer.launch({
headless: false,
defaultViewport: null
});
const page = await browserInstance.newPage();
await page.goto('https://www.youtube.com/', {
waitUntil: 'networkidle2'
});
console.log('YouTube loaded!');
// Handle Cookie Banner
try {
const acceptButtonXPath = '//span[contains(text(), "Accept all")]/ancestor::button';
const cookieAcceptButton = await page.waitForXPath(acceptButtonXPath, {
timeout: 5000
});
if (cookieAcceptButton) {
await cookieAcceptButton.click();
console.log('Accepted cookies.');
await page.waitForNavigation({
waitUntil: 'networkidle0',
timeout: 10000
});
console.log('Page reloaded after cookie acceptance.');
} else {
console.log('Cookie banner accept button not found or timed out.');
}
} catch (error) {
console.log('Cookie banner not found or already accepted.', error.message);
}
console.log('Ready for next steps...');
// await browserInstance.close();
})();
Automating the Search
With the cookie banner out of the way, let's focus on the search bar. We need to locate it, type our search query, and then click the search button.
We can use CSS selectors here. YouTube's search input often has an ID like search
within a specific form or container. We'll use page.waitForSelector
to ensure the element exists before interacting.
// Find and interact with the search bar
const searchInputSelector = 'input#search';
const searchButtonSelector = 'button#search-icon-legacy';
const searchQuery = 'web scraping best practices'; // Our search term
try {
await page.waitForSelector(searchInputSelector, { visible: true, timeout: 10000 });
await page.type(searchInputSelector, searchQuery, { delay: 50 }); // Type slowly like a human
console.log(`Typed "${searchQuery}" into search bar.`);
await page.waitForSelector(searchButtonSelector, { visible: true });
await page.click(searchButtonSelector);
console.log('Clicked search button.');
// Wait for search results page to load
await page.waitForNavigation({ waitUntil: 'networkidle2', timeout: 15000 });
console.log('Search results page loaded.');
} catch (error) {
console.error('Error during search:', error.message);
await browserInstance.close();
return; // Stop script if search fails
}
This code waits for the search input (input#search
) to be visible, types our query (`web scraping best practices`) with a slight delay between keystrokes (making it look less robotic), waits for the search button (button#search-icon-legacy
), clicks it, and finally waits for the results page to load using waitForNavigation
again.
Let's integrate this into our script:
const puppeteer = require('puppeteer');
(async () => {
const browserInstance = await puppeteer.launch({
headless: false,
defaultViewport: null
});
const page = await browserInstance.newPage();
await page.goto('https://www.youtube.com/', { waitUntil: 'networkidle2' });
console.log('YouTube loaded!');
// Handle Cookie Banner (same code as before)
try {
const acceptButtonXPath = '//span[contains(text(), "Accept all")]/ancestor::button';
const cookieAcceptButton = await page.waitForXPath(acceptButtonXPath, { timeout: 5000 });
if (cookieAcceptButton) {
await cookieAcceptButton.click();
console.log('Accepted cookies.');
await page.waitForNavigation({ waitUntil: 'networkidle0', timeout: 10000 });
console.log('Page reloaded after cookie acceptance.');
} else {
console.log('Cookie banner accept button not found or timed out.');
}
} catch (error) {
console.log('Cookie banner not found or already accepted.', error.message);
}
// Perform Search
const searchInputSelector = 'input#search';
const searchButtonSelector = 'button#search-icon-legacy';
const searchQuery = 'web scraping best practices';
try {
await page.waitForSelector(searchInputSelector, { visible: true, timeout: 10000 });
await page.type(searchInputSelector, searchQuery, { delay: 50 });
console.log(`Typed "${searchQuery}" into search bar.`);
await page.waitForSelector(searchButtonSelector, { visible: true });
await page.click(searchButtonSelector);
console.log('Clicked search button.');
await page.waitForNavigation({ waitUntil: 'networkidle2', timeout: 15000 });
console.log('Search results page loaded.');
} catch (error) {
console.error('Error during search:', error.message);
await browserInstance.close();
return;
}
console.log('Ready to scrape results...');
// await browserInstance.close();
})();
Running this should now navigate to YouTube, accept cookies (if present), perform the search, and land you on the results page.
Extracting Video Information
Now for the core scraping logic. We need to identify the video entries on the results page and extract their titles and links.
The page.evaluate()
method is perfect for this. It allows us to run JavaScript code within the context of the browser page, effectively giving us access to the page's DOM as if we were using the browser's developer console.
Here's a function using page.evaluate()
to grab the video titles and links. The selectors target specific elements commonly used by YouTube for video results.
// Scrape video data
const videoData = await page.evaluate(() => {
const results = [];
// Selector for video links/titles - targets the title link within a result item
const videoElements = document.querySelectorAll('ytd-video-renderer h3 a#video-title');
videoElements.forEach(element => {
const title = element.innerText.trim();
const link = element.href;
if (title && link) { // Ensure we have both title and link
results.push({ title, link });
}
});
return results.slice(0, 5); // Return top 5 results
});
console.log('Scraped Video Data:');
console.log(videoData);
await browserInstance.close(); // Close the browser now
Inside page.evaluate
:
We initialize an empty array
results
.We use
document.querySelectorAll
with the selectorytd-video-renderer h3 a#video-title
. This targets the anchor (a
) tag with the IDvideo-title
, which is typically inside anh3
tag within the main container (ytd-video-renderer
) for each video result. This is generally more specific than just selecting allh3
tags.We iterate through the found elements (
videoElements
).For each element, we extract its visible text content (
innerText
) and trim whitespace to get the title.We get the
href
attribute for the video link.We create an object with the
title
andlink
and push it to ourresults
array.Finally, we return the first 5 results using
slice(0, 5)
. This returned value from the function insideevaluate
is assigned to thevideoData
variable in our Node.js script.
After extracting the data, we log it to the console and close the browser instance using browserInstance.close()
.
Here is the complete script:
const puppeteer = require('puppeteer');
(async () => {
console.log('Launching browser...');
const browserInstance = await puppeteer.launch({
headless: false, // Keep false for debugging, set true for production
defaultViewport: null
});
const page = await browserInstance.newPage();
console.log('Navigating to YouTube...');
await page.goto('https://www.youtube.com/', { waitUntil: 'networkidle2' });
console.log('YouTube loaded!');
// Handle Cookie Banner
try {
const acceptButtonXPath = '//span[contains(text(), "Accept all")]/ancestor::button';
const cookieAcceptButton = await page.waitForXPath(acceptButtonXPath, { timeout: 5000 });
if (cookieAcceptButton) {
await cookieAcceptButton.click();
console.log('Accepted cookies.');
await page.waitForNavigation({ waitUntil: 'networkidle0', timeout: 10000 });
console.log('Page reloaded after cookie acceptance.');
} else {
console.log('Cookie banner accept button not found or timed out.');
}
} catch (error) {
console.log('Cookie banner not found or already accepted.', error.message);
}
// Perform Search
const searchInputSelector = 'input#search';
const searchButtonSelector = 'button#search-icon-legacy';
const searchQuery = 'web scraping best practices';
try {
console.log('Waiting for search input...');
await page.waitForSelector(searchInputSelector, { visible: true, timeout: 10000 });
console.log(`Typing "${searchQuery}"...`);
await page.type(searchInputSelector, searchQuery, { delay: 50 });
console.log('Waiting for search button...');
await page.waitForSelector(searchButtonSelector, { visible: true });
console.log('Clicking search button...');
await page.click(searchButtonSelector);
console.log('Waiting for search results page navigation...');
await page.waitForNavigation({ waitUntil: 'networkidle2', timeout: 15000 });
console.log('Search results page loaded.');
} catch (error) {
console.error('Error during search:', error.message);
await browserInstance.close();
return;
}
// Scrape video data
console.log('Scraping video results...');
const videoData = await page.evaluate(() => {
const results = [];
const videoElements = document.querySelectorAll('ytd-video-renderer h3 a#video-title');
videoElements.forEach(element => {
const title = element.innerText.trim();
const link = element.href;
if (title && link) {
results.push({ title, link });
}
});
return results.slice(0, 5); // Return top 5 results
});
console.log('--- Scraped Video Data ---');
console.log(videoData);
console.log('--------------------------');
console.log('Closing browser...');
await browserInstance.close();
console.log('Script finished.');
})();
Running this script should now print an array of objects, each containing the title and link of the top search results, similar to this (titles/links will vary):
// --- Scraped Video Data ---
[
{
title: 'Web Scraping Tutorial For Beginners | Scrape Anything!',
link: 'https://www.youtube.com/watch?v=someVideoId1'
},
{
title: 'Ethical Web Scraping: Best Practices & Legal Considerations',
link: 'https://www.youtube.com/watch?v=someVideoId2'
},
{
title: 'How to Avoid Getting Blocked While Web Scraping',
link: 'https://www.youtube.com/watch?v=someVideoId3'
},
{
title: 'Advanced Web Scraping Techniques in Python',
link: 'https://www.youtube.com/watch?v=someVideoId4'
},
{
title: 'Web Scraping with Node.js and Puppeteer - Full Course',
link: 'https://www.youtube.com/watch?v=someVideoId5'
}
]
// --------------------------
Integrating Proxies for Reliable Scraping
While occasional light scraping might go unnoticed, performing extensive or frequent scraping from a single IP address is a surefire way to get rate-limited or even permanently blocked by the target website. Imagine losing access to YouTube entirely from your home network!
This is where proxies come in. A proxy acts as an intermediary between your script and the website. Your requests are routed through the proxy server, masking your real IP address. If the proxy IP gets flagged or blocked, your own IP remains safe.
Using a pool of proxies, especially residential or mobile ones, further enhances reliability and stealth. Services like Evomi offer access to vast pools of ethically sourced IPs. By rotating through different IPs (often automatically handled by the proxy service endpoint), your scraping activity appears as traffic from many different, legitimate users rather than a single automated bot.
For scraping dynamic sites like YouTube, Evomi's Residential Proxies are an excellent choice, offering IPs from real user devices worldwide, making requests look highly authentic. They start at a competitive price point of just $0.49 per GB.
Configuring Puppeteer to use a proxy is straightforward. You'll need your proxy provider's details: the server address (endpoint) and port, plus username/password credentials if authentication is required.
Let's modify the puppeteer.launch()
options to include the proxy server details via the args
array:
// Example using Evomi Residential Proxy endpoint (replace with your details)
const proxyServer = 'rp.evomi.com:1000'; // Example: HTTP endpoint for Evomi Residential
const proxyUsername = 'your-evomi-username'; // Replace with your Evomi username
const proxyPassword = 'your-evomi-password'; // Replace with your Evomi password
console.log(`Launching browser with proxy: ${proxyServer}...`);
const browserInstance = await puppeteer.launch({
headless: false, // Keep false for debugging
defaultViewport: null,
args: [
`--proxy-server=${proxyServer}`
]
});
Replace rp.evomi.com:1000
, your-evomi-username
, and your-evomi-password
with your actual Evomi credentials and the appropriate endpoint/port (e.g., 1001
for HTTPS, 1002
for SOCKS5 with residential proxies).
Since most quality proxy services require authentication, we need to tell Puppeteer how to authenticate with the proxy. This is done using the page.authenticate()
method, called right after creating the new page:
const page = await browserInstance.newPage();
// Authenticate with the proxy server
await page.authenticate({
username: proxyUsername,
password: proxyPassword
});
console.log('Proxy authentication set.');
Here's how the initial part of the script looks with proxy integration:
const puppeteer = require('puppeteer');
(async () => {
// --- Proxy Configuration ---
// Example using Evomi Residential Proxy endpoint (replace with your details)
const proxyServer = 'rp.evomi.com:1000'; // Example: HTTP endpoint for Evomi Residential
const proxyUsername = 'your-evomi-username'; // Replace with your Evomi username
const proxyPassword = 'your-evomi-password'; // Replace with your Evomi password
// -------------------------
console.log(`Launching browser with proxy: ${proxyServer}...`);
const browserInstance = await puppeteer.launch({
headless: false, // Keep false for debugging, set true for production
defaultViewport: null,
args: [
`--proxy-server=${proxyServer}`
]
});
const page = await browserInstance.newPage();
// Authenticate with the proxy server
await page.authenticate({
username: proxyUsername,
password: proxyPassword
});
console.log('Proxy authentication set.');
console.log('Navigating to YouTube via proxy...');
await page.goto('https://www.youtube.com/', { waitUntil: 'networkidle2' });
console.log('YouTube loaded via proxy!');
// ... (rest of the script: cookie handling, search, scraping) ...
console.log('Closing browser...');
await browserInstance.close();
console.log('Script finished.');
})();
Now, when you run the script, all traffic to YouTube will be routed through your configured Evomi proxy, significantly reducing the risk of your personal IP being flagged or blocked.
Wrapping Up
In this guide, we explored Puppeteer, a versatile library for browser automation that shines when scraping JavaScript-heavy websites. You learned how to launch Puppeteer, navigate pages, interact with elements like buttons and input fields, execute JavaScript within the page context to extract data, and critically, how to integrate proxies like those from Evomi to protect your IP and improve scraping reliability.
Puppeteer opens up possibilities for interacting with almost any website, no matter how dynamic. Try applying these techniques to other complex sites to further hone your skills!
Tackling Dynamic Websites: Scraping with Puppeteer, Node.js, and Proxies
When you need to scrape data from simple, static websites, libraries like Axios or tools like Cheerio often do the trick. They are great for parsing plain HTML. However, the modern web is dynamic; many sites rely heavily on JavaScript to load and display content. Basic HTML scrapers fall short here because they can't execute this client-side code.
To effectively scrape dynamic websites, you need a tool capable of running a full browser environment, executing JavaScript just like a real user's browser would.
Enter Puppeteer: a powerful Node.js library developed by Google. It provides a clean, high-level API to control Chrome or Chromium browsers programmatically.
This guide will walk you through Puppeteer, demonstrating how to build a simple scraper to extract top video results for specific keywords from YouTube.
So, What Exactly is Puppeteer?
Puppeteer is essentially a browser automation framework. While often used for automated testing of web applications, its ability to mimic browser actions makes it invaluable for web scraping tasks involving JavaScript-rendered content.
It operates using the Chrome DevTools Protocol, giving you programmatic access to the inner workings of the browser.
With Puppeteer, you can automate nearly anything you'd do manually in a browser: navigating pages, clicking buttons, filling out forms, scrolling, typing text, and even executing custom JavaScript snippets within the page's context.
By default, Puppeteer uses Chromium (the open-source base for Google Chrome), but it's flexible enough to be configured with full Chrome or even Firefox (though support might vary).
Tutorial: Scraping YouTube Search Results with Puppeteer
In this hands-on section, we'll build a Node.js script using Puppeteer. The goal is to take a search term, query YouTube, and extract the titles and URLs of the top video results.
Later, we'll enhance the script by integrating a proxy service to help manage our digital footprint during scraping.
Setting Up Your Environment
Before we start coding, ensure you have Node.js installed. If not, head over to the official Node.js website for download and installation instructions.
First, create a dedicated directory for our project, navigate into it, and initialize a new Node.js project using npm (Node Package Manager):
mkdir evomi-puppeteer-demo
cd evomi-puppeteer-demo
npm init -y
Next, install the core Puppeteer package. This command also downloads a compatible version of Chromium for Puppeteer to control:
npm
Great! Now create a file named scraper.js
(or any name you prefer) and open it in your text editor. Let's start coding.
Launching Puppeteer and Opening a Page
Let's begin with a fundamental Puppeteer script:
const puppeteer = require('puppeteer');
(async () => {
// Launch the browser
const browserInstance = await puppeteer.launch({
headless: false, // Show the browser window
defaultViewport: null // Use the browser's default viewport size
});
// Open a new tab
const page = await browserInstance.newPage();
// Navigate to YouTube
await page.goto('https://www.youtube.com/', { waitUntil: 'networkidle2' });
console.log('YouTube loaded!');
// We'll add more logic here later...
// await browserInstance.close(); // Keep it open for now
})();
This script initializes Puppeteer, launches a visible browser instance, opens a new page (tab), and navigates to YouTube's homepage.
The headless: false
option is crucial during development. It lets you visually observe the browser performing the automated actions, making debugging much easier. For production or unattended scripts, setting this to true
(or omitting it, as true
is the default) conserves server resources by not rendering the UI. The waitUntil: 'networkidle2'
option tells Puppeteer to consider navigation successful when there are no more than 2 network connections for at least 500 ms, which is often a good indicator that the main content has loaded.
To execute this script, save your changes to scraper.js
and run it from your terminal:
node
You should see a Chromium browser window pop up and load YouTube.
If you encounter issues launching the browser, the official Puppeteer Troubleshooting guide is an excellent resource.
Next, we'll automate interaction with the page elements.
Handling the Cookie Consent Banner
Most websites, including YouTube, present a cookie consent banner on the first visit. We need to accept it to proceed.
Puppeteer allows element selection using standard CSS selectors and also supports XPath selectors, which are sometimes more convenient for finding elements based on their text content.
Let's use an XPath selector to find the button containing the text "Accept all". The page.waitForXPath
method waits for the element to appear in the DOM.
// Wait for the cookie consent button using XPath and click it
try {
const acceptButtonXPath = '//span[contains(text(), "Accept all")]/ancestor::button';
const cookieAcceptButton = await page.waitForXPath(acceptButtonXPath, { timeout: 5000 }); // Wait max 5 seconds
if (cookieAcceptButton) {
await cookieAcceptButton.click();
console.log('Accepted cookies.');
// Wait for potential page reload or navigation after accepting
await page.waitForNavigation({ waitUntil: 'networkidle0', timeout: 10000 });
console.log('Page reloaded after cookie acceptance.');
} else {
console.log('Cookie banner accept button not found or timed out.');
}
} catch (error) {
console.log('Cookie banner not found or already accepted.', error.message);
// Continue script execution even if the banner isn't found
}
This snippet looks for a button associated with the "Accept all" text. We wrap it in a try-catch block because the banner might not always appear (e.g., on subsequent runs). After clicking, we use page.waitForNavigation
to pause the script until the page finishes loading again, as clicking the consent often triggers a reload or state change. The waitUntil: 'networkidle0'
option waits until there are no network connections for 500ms.
Here’s the updated script incorporating the cookie handling:
const puppeteer = require('puppeteer');
(async () => {
const browserInstance = await puppeteer.launch({
headless: false,
defaultViewport: null
});
const page = await browserInstance.newPage();
await page.goto('https://www.youtube.com/', {
waitUntil: 'networkidle2'
});
console.log('YouTube loaded!');
// Handle Cookie Banner
try {
const acceptButtonXPath = '//span[contains(text(), "Accept all")]/ancestor::button';
const cookieAcceptButton = await page.waitForXPath(acceptButtonXPath, {
timeout: 5000
});
if (cookieAcceptButton) {
await cookieAcceptButton.click();
console.log('Accepted cookies.');
await page.waitForNavigation({
waitUntil: 'networkidle0',
timeout: 10000
});
console.log('Page reloaded after cookie acceptance.');
} else {
console.log('Cookie banner accept button not found or timed out.');
}
} catch (error) {
console.log('Cookie banner not found or already accepted.', error.message);
}
console.log('Ready for next steps...');
// await browserInstance.close();
})();
Automating the Search
With the cookie banner out of the way, let's focus on the search bar. We need to locate it, type our search query, and then click the search button.
We can use CSS selectors here. YouTube's search input often has an ID like search
within a specific form or container. We'll use page.waitForSelector
to ensure the element exists before interacting.
// Find and interact with the search bar
const searchInputSelector = 'input#search';
const searchButtonSelector = 'button#search-icon-legacy';
const searchQuery = 'web scraping best practices'; // Our search term
try {
await page.waitForSelector(searchInputSelector, { visible: true, timeout: 10000 });
await page.type(searchInputSelector, searchQuery, { delay: 50 }); // Type slowly like a human
console.log(`Typed "${searchQuery}" into search bar.`);
await page.waitForSelector(searchButtonSelector, { visible: true });
await page.click(searchButtonSelector);
console.log('Clicked search button.');
// Wait for search results page to load
await page.waitForNavigation({ waitUntil: 'networkidle2', timeout: 15000 });
console.log('Search results page loaded.');
} catch (error) {
console.error('Error during search:', error.message);
await browserInstance.close();
return; // Stop script if search fails
}
This code waits for the search input (input#search
) to be visible, types our query (`web scraping best practices`) with a slight delay between keystrokes (making it look less robotic), waits for the search button (button#search-icon-legacy
), clicks it, and finally waits for the results page to load using waitForNavigation
again.
Let's integrate this into our script:
const puppeteer = require('puppeteer');
(async () => {
const browserInstance = await puppeteer.launch({
headless: false,
defaultViewport: null
});
const page = await browserInstance.newPage();
await page.goto('https://www.youtube.com/', { waitUntil: 'networkidle2' });
console.log('YouTube loaded!');
// Handle Cookie Banner (same code as before)
try {
const acceptButtonXPath = '//span[contains(text(), "Accept all")]/ancestor::button';
const cookieAcceptButton = await page.waitForXPath(acceptButtonXPath, { timeout: 5000 });
if (cookieAcceptButton) {
await cookieAcceptButton.click();
console.log('Accepted cookies.');
await page.waitForNavigation({ waitUntil: 'networkidle0', timeout: 10000 });
console.log('Page reloaded after cookie acceptance.');
} else {
console.log('Cookie banner accept button not found or timed out.');
}
} catch (error) {
console.log('Cookie banner not found or already accepted.', error.message);
}
// Perform Search
const searchInputSelector = 'input#search';
const searchButtonSelector = 'button#search-icon-legacy';
const searchQuery = 'web scraping best practices';
try {
await page.waitForSelector(searchInputSelector, { visible: true, timeout: 10000 });
await page.type(searchInputSelector, searchQuery, { delay: 50 });
console.log(`Typed "${searchQuery}" into search bar.`);
await page.waitForSelector(searchButtonSelector, { visible: true });
await page.click(searchButtonSelector);
console.log('Clicked search button.');
await page.waitForNavigation({ waitUntil: 'networkidle2', timeout: 15000 });
console.log('Search results page loaded.');
} catch (error) {
console.error('Error during search:', error.message);
await browserInstance.close();
return;
}
console.log('Ready to scrape results...');
// await browserInstance.close();
})();
Running this should now navigate to YouTube, accept cookies (if present), perform the search, and land you on the results page.
Extracting Video Information
Now for the core scraping logic. We need to identify the video entries on the results page and extract their titles and links.
The page.evaluate()
method is perfect for this. It allows us to run JavaScript code within the context of the browser page, effectively giving us access to the page's DOM as if we were using the browser's developer console.
Here's a function using page.evaluate()
to grab the video titles and links. The selectors target specific elements commonly used by YouTube for video results.
// Scrape video data
const videoData = await page.evaluate(() => {
const results = [];
// Selector for video links/titles - targets the title link within a result item
const videoElements = document.querySelectorAll('ytd-video-renderer h3 a#video-title');
videoElements.forEach(element => {
const title = element.innerText.trim();
const link = element.href;
if (title && link) { // Ensure we have both title and link
results.push({ title, link });
}
});
return results.slice(0, 5); // Return top 5 results
});
console.log('Scraped Video Data:');
console.log(videoData);
await browserInstance.close(); // Close the browser now
Inside page.evaluate
:
We initialize an empty array
results
.We use
document.querySelectorAll
with the selectorytd-video-renderer h3 a#video-title
. This targets the anchor (a
) tag with the IDvideo-title
, which is typically inside anh3
tag within the main container (ytd-video-renderer
) for each video result. This is generally more specific than just selecting allh3
tags.We iterate through the found elements (
videoElements
).For each element, we extract its visible text content (
innerText
) and trim whitespace to get the title.We get the
href
attribute for the video link.We create an object with the
title
andlink
and push it to ourresults
array.Finally, we return the first 5 results using
slice(0, 5)
. This returned value from the function insideevaluate
is assigned to thevideoData
variable in our Node.js script.
After extracting the data, we log it to the console and close the browser instance using browserInstance.close()
.
Here is the complete script:
const puppeteer = require('puppeteer');
(async () => {
console.log('Launching browser...');
const browserInstance = await puppeteer.launch({
headless: false, // Keep false for debugging, set true for production
defaultViewport: null
});
const page = await browserInstance.newPage();
console.log('Navigating to YouTube...');
await page.goto('https://www.youtube.com/', { waitUntil: 'networkidle2' });
console.log('YouTube loaded!');
// Handle Cookie Banner
try {
const acceptButtonXPath = '//span[contains(text(), "Accept all")]/ancestor::button';
const cookieAcceptButton = await page.waitForXPath(acceptButtonXPath, { timeout: 5000 });
if (cookieAcceptButton) {
await cookieAcceptButton.click();
console.log('Accepted cookies.');
await page.waitForNavigation({ waitUntil: 'networkidle0', timeout: 10000 });
console.log('Page reloaded after cookie acceptance.');
} else {
console.log('Cookie banner accept button not found or timed out.');
}
} catch (error) {
console.log('Cookie banner not found or already accepted.', error.message);
}
// Perform Search
const searchInputSelector = 'input#search';
const searchButtonSelector = 'button#search-icon-legacy';
const searchQuery = 'web scraping best practices';
try {
console.log('Waiting for search input...');
await page.waitForSelector(searchInputSelector, { visible: true, timeout: 10000 });
console.log(`Typing "${searchQuery}"...`);
await page.type(searchInputSelector, searchQuery, { delay: 50 });
console.log('Waiting for search button...');
await page.waitForSelector(searchButtonSelector, { visible: true });
console.log('Clicking search button...');
await page.click(searchButtonSelector);
console.log('Waiting for search results page navigation...');
await page.waitForNavigation({ waitUntil: 'networkidle2', timeout: 15000 });
console.log('Search results page loaded.');
} catch (error) {
console.error('Error during search:', error.message);
await browserInstance.close();
return;
}
// Scrape video data
console.log('Scraping video results...');
const videoData = await page.evaluate(() => {
const results = [];
const videoElements = document.querySelectorAll('ytd-video-renderer h3 a#video-title');
videoElements.forEach(element => {
const title = element.innerText.trim();
const link = element.href;
if (title && link) {
results.push({ title, link });
}
});
return results.slice(0, 5); // Return top 5 results
});
console.log('--- Scraped Video Data ---');
console.log(videoData);
console.log('--------------------------');
console.log('Closing browser...');
await browserInstance.close();
console.log('Script finished.');
})();
Running this script should now print an array of objects, each containing the title and link of the top search results, similar to this (titles/links will vary):
// --- Scraped Video Data ---
[
{
title: 'Web Scraping Tutorial For Beginners | Scrape Anything!',
link: 'https://www.youtube.com/watch?v=someVideoId1'
},
{
title: 'Ethical Web Scraping: Best Practices & Legal Considerations',
link: 'https://www.youtube.com/watch?v=someVideoId2'
},
{
title: 'How to Avoid Getting Blocked While Web Scraping',
link: 'https://www.youtube.com/watch?v=someVideoId3'
},
{
title: 'Advanced Web Scraping Techniques in Python',
link: 'https://www.youtube.com/watch?v=someVideoId4'
},
{
title: 'Web Scraping with Node.js and Puppeteer - Full Course',
link: 'https://www.youtube.com/watch?v=someVideoId5'
}
]
// --------------------------
Integrating Proxies for Reliable Scraping
While occasional light scraping might go unnoticed, performing extensive or frequent scraping from a single IP address is a surefire way to get rate-limited or even permanently blocked by the target website. Imagine losing access to YouTube entirely from your home network!
This is where proxies come in. A proxy acts as an intermediary between your script and the website. Your requests are routed through the proxy server, masking your real IP address. If the proxy IP gets flagged or blocked, your own IP remains safe.
Using a pool of proxies, especially residential or mobile ones, further enhances reliability and stealth. Services like Evomi offer access to vast pools of ethically sourced IPs. By rotating through different IPs (often automatically handled by the proxy service endpoint), your scraping activity appears as traffic from many different, legitimate users rather than a single automated bot.
For scraping dynamic sites like YouTube, Evomi's Residential Proxies are an excellent choice, offering IPs from real user devices worldwide, making requests look highly authentic. They start at a competitive price point of just $0.49 per GB.
Configuring Puppeteer to use a proxy is straightforward. You'll need your proxy provider's details: the server address (endpoint) and port, plus username/password credentials if authentication is required.
Let's modify the puppeteer.launch()
options to include the proxy server details via the args
array:
// Example using Evomi Residential Proxy endpoint (replace with your details)
const proxyServer = 'rp.evomi.com:1000'; // Example: HTTP endpoint for Evomi Residential
const proxyUsername = 'your-evomi-username'; // Replace with your Evomi username
const proxyPassword = 'your-evomi-password'; // Replace with your Evomi password
console.log(`Launching browser with proxy: ${proxyServer}...`);
const browserInstance = await puppeteer.launch({
headless: false, // Keep false for debugging
defaultViewport: null,
args: [
`--proxy-server=${proxyServer}`
]
});
Replace rp.evomi.com:1000
, your-evomi-username
, and your-evomi-password
with your actual Evomi credentials and the appropriate endpoint/port (e.g., 1001
for HTTPS, 1002
for SOCKS5 with residential proxies).
Since most quality proxy services require authentication, we need to tell Puppeteer how to authenticate with the proxy. This is done using the page.authenticate()
method, called right after creating the new page:
const page = await browserInstance.newPage();
// Authenticate with the proxy server
await page.authenticate({
username: proxyUsername,
password: proxyPassword
});
console.log('Proxy authentication set.');
Here's how the initial part of the script looks with proxy integration:
const puppeteer = require('puppeteer');
(async () => {
// --- Proxy Configuration ---
// Example using Evomi Residential Proxy endpoint (replace with your details)
const proxyServer = 'rp.evomi.com:1000'; // Example: HTTP endpoint for Evomi Residential
const proxyUsername = 'your-evomi-username'; // Replace with your Evomi username
const proxyPassword = 'your-evomi-password'; // Replace with your Evomi password
// -------------------------
console.log(`Launching browser with proxy: ${proxyServer}...`);
const browserInstance = await puppeteer.launch({
headless: false, // Keep false for debugging, set true for production
defaultViewport: null,
args: [
`--proxy-server=${proxyServer}`
]
});
const page = await browserInstance.newPage();
// Authenticate with the proxy server
await page.authenticate({
username: proxyUsername,
password: proxyPassword
});
console.log('Proxy authentication set.');
console.log('Navigating to YouTube via proxy...');
await page.goto('https://www.youtube.com/', { waitUntil: 'networkidle2' });
console.log('YouTube loaded via proxy!');
// ... (rest of the script: cookie handling, search, scraping) ...
console.log('Closing browser...');
await browserInstance.close();
console.log('Script finished.');
})();
Now, when you run the script, all traffic to YouTube will be routed through your configured Evomi proxy, significantly reducing the risk of your personal IP being flagged or blocked.
Wrapping Up
In this guide, we explored Puppeteer, a versatile library for browser automation that shines when scraping JavaScript-heavy websites. You learned how to launch Puppeteer, navigate pages, interact with elements like buttons and input fields, execute JavaScript within the page context to extract data, and critically, how to integrate proxies like those from Evomi to protect your IP and improve scraping reliability.
Puppeteer opens up possibilities for interacting with almost any website, no matter how dynamic. Try applying these techniques to other complex sites to further hone your skills!
Tackling Dynamic Websites: Scraping with Puppeteer, Node.js, and Proxies
When you need to scrape data from simple, static websites, libraries like Axios or tools like Cheerio often do the trick. They are great for parsing plain HTML. However, the modern web is dynamic; many sites rely heavily on JavaScript to load and display content. Basic HTML scrapers fall short here because they can't execute this client-side code.
To effectively scrape dynamic websites, you need a tool capable of running a full browser environment, executing JavaScript just like a real user's browser would.
Enter Puppeteer: a powerful Node.js library developed by Google. It provides a clean, high-level API to control Chrome or Chromium browsers programmatically.
This guide will walk you through Puppeteer, demonstrating how to build a simple scraper to extract top video results for specific keywords from YouTube.
So, What Exactly is Puppeteer?
Puppeteer is essentially a browser automation framework. While often used for automated testing of web applications, its ability to mimic browser actions makes it invaluable for web scraping tasks involving JavaScript-rendered content.
It operates using the Chrome DevTools Protocol, giving you programmatic access to the inner workings of the browser.
With Puppeteer, you can automate nearly anything you'd do manually in a browser: navigating pages, clicking buttons, filling out forms, scrolling, typing text, and even executing custom JavaScript snippets within the page's context.
By default, Puppeteer uses Chromium (the open-source base for Google Chrome), but it's flexible enough to be configured with full Chrome or even Firefox (though support might vary).
Tutorial: Scraping YouTube Search Results with Puppeteer
In this hands-on section, we'll build a Node.js script using Puppeteer. The goal is to take a search term, query YouTube, and extract the titles and URLs of the top video results.
Later, we'll enhance the script by integrating a proxy service to help manage our digital footprint during scraping.
Setting Up Your Environment
Before we start coding, ensure you have Node.js installed. If not, head over to the official Node.js website for download and installation instructions.
First, create a dedicated directory for our project, navigate into it, and initialize a new Node.js project using npm (Node Package Manager):
mkdir evomi-puppeteer-demo
cd evomi-puppeteer-demo
npm init -y
Next, install the core Puppeteer package. This command also downloads a compatible version of Chromium for Puppeteer to control:
npm
Great! Now create a file named scraper.js
(or any name you prefer) and open it in your text editor. Let's start coding.
Launching Puppeteer and Opening a Page
Let's begin with a fundamental Puppeteer script:
const puppeteer = require('puppeteer');
(async () => {
// Launch the browser
const browserInstance = await puppeteer.launch({
headless: false, // Show the browser window
defaultViewport: null // Use the browser's default viewport size
});
// Open a new tab
const page = await browserInstance.newPage();
// Navigate to YouTube
await page.goto('https://www.youtube.com/', { waitUntil: 'networkidle2' });
console.log('YouTube loaded!');
// We'll add more logic here later...
// await browserInstance.close(); // Keep it open for now
})();
This script initializes Puppeteer, launches a visible browser instance, opens a new page (tab), and navigates to YouTube's homepage.
The headless: false
option is crucial during development. It lets you visually observe the browser performing the automated actions, making debugging much easier. For production or unattended scripts, setting this to true
(or omitting it, as true
is the default) conserves server resources by not rendering the UI. The waitUntil: 'networkidle2'
option tells Puppeteer to consider navigation successful when there are no more than 2 network connections for at least 500 ms, which is often a good indicator that the main content has loaded.
To execute this script, save your changes to scraper.js
and run it from your terminal:
node
You should see a Chromium browser window pop up and load YouTube.
If you encounter issues launching the browser, the official Puppeteer Troubleshooting guide is an excellent resource.
Next, we'll automate interaction with the page elements.
Handling the Cookie Consent Banner
Most websites, including YouTube, present a cookie consent banner on the first visit. We need to accept it to proceed.
Puppeteer allows element selection using standard CSS selectors and also supports XPath selectors, which are sometimes more convenient for finding elements based on their text content.
Let's use an XPath selector to find the button containing the text "Accept all". The page.waitForXPath
method waits for the element to appear in the DOM.
// Wait for the cookie consent button using XPath and click it
try {
const acceptButtonXPath = '//span[contains(text(), "Accept all")]/ancestor::button';
const cookieAcceptButton = await page.waitForXPath(acceptButtonXPath, { timeout: 5000 }); // Wait max 5 seconds
if (cookieAcceptButton) {
await cookieAcceptButton.click();
console.log('Accepted cookies.');
// Wait for potential page reload or navigation after accepting
await page.waitForNavigation({ waitUntil: 'networkidle0', timeout: 10000 });
console.log('Page reloaded after cookie acceptance.');
} else {
console.log('Cookie banner accept button not found or timed out.');
}
} catch (error) {
console.log('Cookie banner not found or already accepted.', error.message);
// Continue script execution even if the banner isn't found
}
This snippet looks for a button associated with the "Accept all" text. We wrap it in a try-catch block because the banner might not always appear (e.g., on subsequent runs). After clicking, we use page.waitForNavigation
to pause the script until the page finishes loading again, as clicking the consent often triggers a reload or state change. The waitUntil: 'networkidle0'
option waits until there are no network connections for 500ms.
Here’s the updated script incorporating the cookie handling:
const puppeteer = require('puppeteer');
(async () => {
const browserInstance = await puppeteer.launch({
headless: false,
defaultViewport: null
});
const page = await browserInstance.newPage();
await page.goto('https://www.youtube.com/', {
waitUntil: 'networkidle2'
});
console.log('YouTube loaded!');
// Handle Cookie Banner
try {
const acceptButtonXPath = '//span[contains(text(), "Accept all")]/ancestor::button';
const cookieAcceptButton = await page.waitForXPath(acceptButtonXPath, {
timeout: 5000
});
if (cookieAcceptButton) {
await cookieAcceptButton.click();
console.log('Accepted cookies.');
await page.waitForNavigation({
waitUntil: 'networkidle0',
timeout: 10000
});
console.log('Page reloaded after cookie acceptance.');
} else {
console.log('Cookie banner accept button not found or timed out.');
}
} catch (error) {
console.log('Cookie banner not found or already accepted.', error.message);
}
console.log('Ready for next steps...');
// await browserInstance.close();
})();
Automating the Search
With the cookie banner out of the way, let's focus on the search bar. We need to locate it, type our search query, and then click the search button.
We can use CSS selectors here. YouTube's search input often has an ID like search
within a specific form or container. We'll use page.waitForSelector
to ensure the element exists before interacting.
// Find and interact with the search bar
const searchInputSelector = 'input#search';
const searchButtonSelector = 'button#search-icon-legacy';
const searchQuery = 'web scraping best practices'; // Our search term
try {
await page.waitForSelector(searchInputSelector, { visible: true, timeout: 10000 });
await page.type(searchInputSelector, searchQuery, { delay: 50 }); // Type slowly like a human
console.log(`Typed "${searchQuery}" into search bar.`);
await page.waitForSelector(searchButtonSelector, { visible: true });
await page.click(searchButtonSelector);
console.log('Clicked search button.');
// Wait for search results page to load
await page.waitForNavigation({ waitUntil: 'networkidle2', timeout: 15000 });
console.log('Search results page loaded.');
} catch (error) {
console.error('Error during search:', error.message);
await browserInstance.close();
return; // Stop script if search fails
}
This code waits for the search input (input#search
) to be visible, types our query (`web scraping best practices`) with a slight delay between keystrokes (making it look less robotic), waits for the search button (button#search-icon-legacy
), clicks it, and finally waits for the results page to load using waitForNavigation
again.
Let's integrate this into our script:
const puppeteer = require('puppeteer');
(async () => {
const browserInstance = await puppeteer.launch({
headless: false,
defaultViewport: null
});
const page = await browserInstance.newPage();
await page.goto('https://www.youtube.com/', { waitUntil: 'networkidle2' });
console.log('YouTube loaded!');
// Handle Cookie Banner (same code as before)
try {
const acceptButtonXPath = '//span[contains(text(), "Accept all")]/ancestor::button';
const cookieAcceptButton = await page.waitForXPath(acceptButtonXPath, { timeout: 5000 });
if (cookieAcceptButton) {
await cookieAcceptButton.click();
console.log('Accepted cookies.');
await page.waitForNavigation({ waitUntil: 'networkidle0', timeout: 10000 });
console.log('Page reloaded after cookie acceptance.');
} else {
console.log('Cookie banner accept button not found or timed out.');
}
} catch (error) {
console.log('Cookie banner not found or already accepted.', error.message);
}
// Perform Search
const searchInputSelector = 'input#search';
const searchButtonSelector = 'button#search-icon-legacy';
const searchQuery = 'web scraping best practices';
try {
await page.waitForSelector(searchInputSelector, { visible: true, timeout: 10000 });
await page.type(searchInputSelector, searchQuery, { delay: 50 });
console.log(`Typed "${searchQuery}" into search bar.`);
await page.waitForSelector(searchButtonSelector, { visible: true });
await page.click(searchButtonSelector);
console.log('Clicked search button.');
await page.waitForNavigation({ waitUntil: 'networkidle2', timeout: 15000 });
console.log('Search results page loaded.');
} catch (error) {
console.error('Error during search:', error.message);
await browserInstance.close();
return;
}
console.log('Ready to scrape results...');
// await browserInstance.close();
})();
Running this should now navigate to YouTube, accept cookies (if present), perform the search, and land you on the results page.
Extracting Video Information
Now for the core scraping logic. We need to identify the video entries on the results page and extract their titles and links.
The page.evaluate()
method is perfect for this. It allows us to run JavaScript code within the context of the browser page, effectively giving us access to the page's DOM as if we were using the browser's developer console.
Here's a function using page.evaluate()
to grab the video titles and links. The selectors target specific elements commonly used by YouTube for video results.
// Scrape video data
const videoData = await page.evaluate(() => {
const results = [];
// Selector for video links/titles - targets the title link within a result item
const videoElements = document.querySelectorAll('ytd-video-renderer h3 a#video-title');
videoElements.forEach(element => {
const title = element.innerText.trim();
const link = element.href;
if (title && link) { // Ensure we have both title and link
results.push({ title, link });
}
});
return results.slice(0, 5); // Return top 5 results
});
console.log('Scraped Video Data:');
console.log(videoData);
await browserInstance.close(); // Close the browser now
Inside page.evaluate
:
We initialize an empty array
results
.We use
document.querySelectorAll
with the selectorytd-video-renderer h3 a#video-title
. This targets the anchor (a
) tag with the IDvideo-title
, which is typically inside anh3
tag within the main container (ytd-video-renderer
) for each video result. This is generally more specific than just selecting allh3
tags.We iterate through the found elements (
videoElements
).For each element, we extract its visible text content (
innerText
) and trim whitespace to get the title.We get the
href
attribute for the video link.We create an object with the
title
andlink
and push it to ourresults
array.Finally, we return the first 5 results using
slice(0, 5)
. This returned value from the function insideevaluate
is assigned to thevideoData
variable in our Node.js script.
After extracting the data, we log it to the console and close the browser instance using browserInstance.close()
.
Here is the complete script:
const puppeteer = require('puppeteer');
(async () => {
console.log('Launching browser...');
const browserInstance = await puppeteer.launch({
headless: false, // Keep false for debugging, set true for production
defaultViewport: null
});
const page = await browserInstance.newPage();
console.log('Navigating to YouTube...');
await page.goto('https://www.youtube.com/', { waitUntil: 'networkidle2' });
console.log('YouTube loaded!');
// Handle Cookie Banner
try {
const acceptButtonXPath = '//span[contains(text(), "Accept all")]/ancestor::button';
const cookieAcceptButton = await page.waitForXPath(acceptButtonXPath, { timeout: 5000 });
if (cookieAcceptButton) {
await cookieAcceptButton.click();
console.log('Accepted cookies.');
await page.waitForNavigation({ waitUntil: 'networkidle0', timeout: 10000 });
console.log('Page reloaded after cookie acceptance.');
} else {
console.log('Cookie banner accept button not found or timed out.');
}
} catch (error) {
console.log('Cookie banner not found or already accepted.', error.message);
}
// Perform Search
const searchInputSelector = 'input#search';
const searchButtonSelector = 'button#search-icon-legacy';
const searchQuery = 'web scraping best practices';
try {
console.log('Waiting for search input...');
await page.waitForSelector(searchInputSelector, { visible: true, timeout: 10000 });
console.log(`Typing "${searchQuery}"...`);
await page.type(searchInputSelector, searchQuery, { delay: 50 });
console.log('Waiting for search button...');
await page.waitForSelector(searchButtonSelector, { visible: true });
console.log('Clicking search button...');
await page.click(searchButtonSelector);
console.log('Waiting for search results page navigation...');
await page.waitForNavigation({ waitUntil: 'networkidle2', timeout: 15000 });
console.log('Search results page loaded.');
} catch (error) {
console.error('Error during search:', error.message);
await browserInstance.close();
return;
}
// Scrape video data
console.log('Scraping video results...');
const videoData = await page.evaluate(() => {
const results = [];
const videoElements = document.querySelectorAll('ytd-video-renderer h3 a#video-title');
videoElements.forEach(element => {
const title = element.innerText.trim();
const link = element.href;
if (title && link) {
results.push({ title, link });
}
});
return results.slice(0, 5); // Return top 5 results
});
console.log('--- Scraped Video Data ---');
console.log(videoData);
console.log('--------------------------');
console.log('Closing browser...');
await browserInstance.close();
console.log('Script finished.');
})();
Running this script should now print an array of objects, each containing the title and link of the top search results, similar to this (titles/links will vary):
// --- Scraped Video Data ---
[
{
title: 'Web Scraping Tutorial For Beginners | Scrape Anything!',
link: 'https://www.youtube.com/watch?v=someVideoId1'
},
{
title: 'Ethical Web Scraping: Best Practices & Legal Considerations',
link: 'https://www.youtube.com/watch?v=someVideoId2'
},
{
title: 'How to Avoid Getting Blocked While Web Scraping',
link: 'https://www.youtube.com/watch?v=someVideoId3'
},
{
title: 'Advanced Web Scraping Techniques in Python',
link: 'https://www.youtube.com/watch?v=someVideoId4'
},
{
title: 'Web Scraping with Node.js and Puppeteer - Full Course',
link: 'https://www.youtube.com/watch?v=someVideoId5'
}
]
// --------------------------
Integrating Proxies for Reliable Scraping
While occasional light scraping might go unnoticed, performing extensive or frequent scraping from a single IP address is a surefire way to get rate-limited or even permanently blocked by the target website. Imagine losing access to YouTube entirely from your home network!
This is where proxies come in. A proxy acts as an intermediary between your script and the website. Your requests are routed through the proxy server, masking your real IP address. If the proxy IP gets flagged or blocked, your own IP remains safe.
Using a pool of proxies, especially residential or mobile ones, further enhances reliability and stealth. Services like Evomi offer access to vast pools of ethically sourced IPs. By rotating through different IPs (often automatically handled by the proxy service endpoint), your scraping activity appears as traffic from many different, legitimate users rather than a single automated bot.
For scraping dynamic sites like YouTube, Evomi's Residential Proxies are an excellent choice, offering IPs from real user devices worldwide, making requests look highly authentic. They start at a competitive price point of just $0.49 per GB.
Configuring Puppeteer to use a proxy is straightforward. You'll need your proxy provider's details: the server address (endpoint) and port, plus username/password credentials if authentication is required.
Let's modify the puppeteer.launch()
options to include the proxy server details via the args
array:
// Example using Evomi Residential Proxy endpoint (replace with your details)
const proxyServer = 'rp.evomi.com:1000'; // Example: HTTP endpoint for Evomi Residential
const proxyUsername = 'your-evomi-username'; // Replace with your Evomi username
const proxyPassword = 'your-evomi-password'; // Replace with your Evomi password
console.log(`Launching browser with proxy: ${proxyServer}...`);
const browserInstance = await puppeteer.launch({
headless: false, // Keep false for debugging
defaultViewport: null,
args: [
`--proxy-server=${proxyServer}`
]
});
Replace rp.evomi.com:1000
, your-evomi-username
, and your-evomi-password
with your actual Evomi credentials and the appropriate endpoint/port (e.g., 1001
for HTTPS, 1002
for SOCKS5 with residential proxies).
Since most quality proxy services require authentication, we need to tell Puppeteer how to authenticate with the proxy. This is done using the page.authenticate()
method, called right after creating the new page:
const page = await browserInstance.newPage();
// Authenticate with the proxy server
await page.authenticate({
username: proxyUsername,
password: proxyPassword
});
console.log('Proxy authentication set.');
Here's how the initial part of the script looks with proxy integration:
const puppeteer = require('puppeteer');
(async () => {
// --- Proxy Configuration ---
// Example using Evomi Residential Proxy endpoint (replace with your details)
const proxyServer = 'rp.evomi.com:1000'; // Example: HTTP endpoint for Evomi Residential
const proxyUsername = 'your-evomi-username'; // Replace with your Evomi username
const proxyPassword = 'your-evomi-password'; // Replace with your Evomi password
// -------------------------
console.log(`Launching browser with proxy: ${proxyServer}...`);
const browserInstance = await puppeteer.launch({
headless: false, // Keep false for debugging, set true for production
defaultViewport: null,
args: [
`--proxy-server=${proxyServer}`
]
});
const page = await browserInstance.newPage();
// Authenticate with the proxy server
await page.authenticate({
username: proxyUsername,
password: proxyPassword
});
console.log('Proxy authentication set.');
console.log('Navigating to YouTube via proxy...');
await page.goto('https://www.youtube.com/', { waitUntil: 'networkidle2' });
console.log('YouTube loaded via proxy!');
// ... (rest of the script: cookie handling, search, scraping) ...
console.log('Closing browser...');
await browserInstance.close();
console.log('Script finished.');
})();
Now, when you run the script, all traffic to YouTube will be routed through your configured Evomi proxy, significantly reducing the risk of your personal IP being flagged or blocked.
Wrapping Up
In this guide, we explored Puppeteer, a versatile library for browser automation that shines when scraping JavaScript-heavy websites. You learned how to launch Puppeteer, navigate pages, interact with elements like buttons and input fields, execute JavaScript within the page context to extract data, and critically, how to integrate proxies like those from Evomi to protect your IP and improve scraping reliability.
Puppeteer opens up possibilities for interacting with almost any website, no matter how dynamic. Try applying these techniques to other complex sites to further hone your skills!

Author
Michael Chen
AI & Network Infrastructure Analyst
About Author
Michael bridges the gap between artificial intelligence and network security, analyzing how AI-driven technologies enhance proxy performance and security. His work focuses on AI-powered anti-detection techniques, predictive traffic routing, and how proxies integrate with machine learning applications for smarter data access.