Puppeteer Sharp, XPath & Proxies for Advanced Web Scraping

Michael Chen

Last edited on May 3, 2025
Last edited on May 3, 2025

Scraping Techniques

Mastering Web Scraping with Puppeteer Sharp, XPath, and Proxies

Getting started with web scraping? Puppeteer Sharp is a seriously powerful tool in your C# arsenal. It brings the user-friendly headless browser control of Node.js's Puppeteer into the high-performance world of C#.

However, pulling data effectively from the modern web isn't always a walk in the park. You face a few key hurdles.

First off, pinpointing the exact data you need can be tricky. While various selectors exist, CSS alone can sometimes lead you down the wrong path or grab unintended elements.

Next, you've got to navigate the digital minefield of anti-scraping measures. Websites actively try to detect and block automated data collection, so staying under the radar is crucial for consistent results.

Finally, successful scraping often requires more than just grabbing static text. You need the know-how to interact with pages dynamically – filling forms, clicking buttons, executing JavaScript, and generally mimicking how a real person uses a site.

Good news: this guide will show you how to tackle these challenges. We'll combine the precision of XPath selectors with the stealth of reliable proxies, all orchestrated using Puppeteer Sharp in C#. We'll go beyond just data extraction, covering interactions, screenshots, and more.

With these techniques, you'll be equipped to build robust scrapers for various tasks, especially for dynamic, JavaScript-heavy websites.

This knowledge also pairs nicely with building programmatic websites. We recently explored this in-depth in a four-part series:

You could certainly adapt the scraping methods in part two of that series to use Puppeteer Sharp instead of other libraries. Let's dive in!

So, What Exactly is Puppeteer Sharp?

Puppeteer Sharp is essentially a .NET port of the popular Node.js Puppeteer library. It gives you programmatic control over headless Chrome or Chromium browsers using C# code. Think of it as having a robot arm that can drive a web browser just like a human would, but automatically.

This capability makes it incredibly useful for automated testing, task automation, and, as we're focusing on here, web scraping. We'll explore how its features allow you to extract precisely the data you need from websites.

Getting Started with Puppeteer Sharp

To begin using Puppeteer Sharp, you'll need a C# development environment like Visual Studio. Start by creating a new project (e.g., a Console App or Web Application):

Creating a new project in Visual Studio

Next, navigate to Tools > NuGet Package Manager > Manage NuGet Packages for Solution... Search for "PuppeteerSharp" in the Browse tab.

Finding PuppeteerSharp in NuGet Package Manager

Select the package, check the box next to your project, and click "Install". Accept any license agreements, and you're ready to code.

If you're exploring alternatives, Playwright is another strong contender. You can learn more about it in our Web Scraping with Playwright and C# guide.

Leveraging XPath in Puppeteer Sharp

Puppeteer Sharp needs instructions on which elements to interact with or extract data from. These instructions come in the form of "locators". Locators can be standard CSS selectors, text content matches, layout-based selectors (e.g., finding an element to the left of another), or, crucially for precision, XPath expressions.

XPath (XML Path Language) defines a path to navigate through the structure (the DOM tree) of an HTML or XML document. While it can select multiple elements like CSS, XPath excels at uniquely identifying a single, specific element, even in complex structures where CSS might be ambiguous.

Consider a typical CSS selector:

html body div#main-content section article p.highlight span

Now, look at an equivalent XPath expression:

/html/body/div[2]/section[1]/article/p[3]/span[1]

The XPath explicitly defines each step down the DOM tree, including the index (like the 3rd paragraph p[3]), leading directly to the target element without ambiguity.

Finding an Element's XPath

You can easily find the XPath for any element using your browser's developer tools. Right-click the element on the web page and choose "Inspect" or "Inspect Element". In the Elements panel that appears, right-click the highlighted HTML code for your element, navigate through the "Copy" submenu, and select "Copy XPath":

Copying XPath from browser developer tools

Selecting Elements with XPath in Code

With the XPath copied, you can use it in Puppeteer Sharp. Here’s a C# snippet demonstrating this:

// Ensure Chromium is downloaded
using var browserFetcher = new BrowserFetcher();
await browserFetcher.DownloadAsync();

// Launch the browser
await using var browser = await Puppeteer.LaunchAsync(new LaunchOptions { Headless = true });
await using var page = await browser.NewPageAsync();

// Navigate to a target page
await page.GoToAsync("https://en.wikipedia.org/wiki/Web_scraping");

// Select an element using XPath
var elements = await page.XPathAsync("/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/ul[1]/li[3]/a");

if (elements.Length > 0)
{
    // XPathAsync returns an array, access the first element
    var element = elements[0];
    var textHandle = await element.GetPropertyAsync("textContent");
    var text = await textHandle.JsonValueAsync<string>();

    // Output the text content
    Console.WriteLine($"Found link text: {text}");
}
else
{
    Console.WriteLine("Element not found using the specified XPath.");
}

await browser.CloseAsync

This code initializes Puppeteer, opens a page, goes to Wikipedia, and then uses XPathAsync to find a specific link within a list. Important note: XPathAsync always returns an array (IElementHandle[]) of matching elements, even if only one is found. That's why we access the first result using elements[0] before interacting with it.

The CSS selector equivalent for targeting based on an attribute might look like this:

var element = await page
    .QuerySelectorAsync("a[href='/wiki/Data_scraping']"

Alternatives to XPath

While XPath offers precision, you have other options. CSS selectors are often simpler for less complex selections but might return multiple elements if not specific enough. You can achieve XPath-like specificity with CSS using combinations of the direct child combinator (>) and pseudo-classes like :nth-child() or :nth-of-type().

For instance, the XPath:

/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/ul[1]/li[3]/a

Could be represented in CSS (though it becomes quite verbose) like this:

html >
  body >
  div:nth-of-type(2) >
  div >
  div:nth-of-type(3) >
  main >
  div:nth-of-type(3) >
  div:nth-of-type(3) >
  div:nth-of-type(1) >
  ul:nth-of-type(1) >
  li:nth-of-type(3) >
  a

Often, a more concise CSS selector using IDs, classes, or attributes is preferable if possible, but XPath remains a reliable tool for complex DOM navigation.

Integrating Proxies with Puppeteer Sharp

To scrape websites effectively and reliably, avoiding detection is paramount. While web scraping itself is generally legal for publicly accessible data, many websites employ mechanisms to block scrapers, often to protect server resources or maintain a competitive edge.

Websites detect scrapers based on various signals. One common check involves analyzing HTTP request headers. Basic scraping scripts might send incomplete or unusual headers. However, Puppeteer Sharp uses a real browser engine (Chromium), so its default request headers appear legitimate, mimicking actual user traffic.

A more significant detection vector is the IP address making the requests. If a site sees hundreds or thousands of requests originating from the same IP address in a short period, or accessing pages in a predictable pattern, it's a strong indicator of automated activity, often leading to a block.

This is where proxies become essential. By routing your scraping traffic through a proxy service like Evomi, you can mask your actual IP address. Using Evomi's residential proxies, for example, allows each request (or batches of requests) to originate from a different, genuine residential IP address. To the target website, it looks like distinct users browsing normally from various locations, making your scraper much harder to detect and block. Evomi prides itself on ethically sourced proxies and Swiss-based reliability, ensuring high-quality connections for your scraping tasks.

Once you have an Evomi proxy plan, you'll find your connection details (like endpoint address, port, username, and password) in your client dashboard.

Now, let's integrate this into your Puppeteer Sharp code.

As shown earlier, LaunchOptions allow you to configure the browser instance. You can specify a proxy server using the Args parameter:

// Example using Evomi's residential proxy endpoint (HTTP)
await using var browser = await Puppeteer.LaunchAsync(new LaunchOptions
{
    Headless = true,
    Args = new[] { "--proxy-server=rp.evomi.com:1000" } // Replace with your specific Evomi endpoint/port

If your Evomi proxy plan uses username/password authentication (instead of IP whitelisting), you need to authenticate the connection for each new page:

await using var page = await browser.NewPageAsync();
await page.AuthenticateAsync(new Credentials 
{ 
    Username = "your-evomi-username", 
    Password = "your-evomi-password" 
});
// Now you can navigate
await page.GoToAsync("https://some-target-website.com"

Here’s a combined snippet showing launch with proxy and authentication, then checking the resulting IP:

// Launch browser with proxy settings
await using var browser = await Puppeteer.LaunchAsync(new LaunchOptions
{
    Headless = true,
    Args = new[] { "--proxy-server=rp.evomi.com:1000" } // Use your Evomi endpoint
});

await using var page = await browser.NewPageAsync();

// Authenticate if needed
await page.AuthenticateAsync(new Credentials
{
    Username = "your-evomi-username",
    Password = "your-evomi-password"
});

// Navigate to an IP checker site
await page.GoToAsync("https://check.evomi.com/"); // Using Evomi's IP checker

// Optionally, take a screenshot to verify
await page.ScreenshotAsync("proxy_check_screenshot.png");

Console.WriteLine("Navigated via proxy. Check screenshot.");

await browser.CloseAsync

With the proxy configured, you can proceed with your scraping logic on the page object. Remember to consider setting appropriate timeout values, especially when using proxies, to allow sufficient time for connections and page loads.

Practical Scraping Tasks with Puppeteer Sharp

Let's explore some common web scraping actions you can perform using Puppeteer Sharp, often combined with XPath for targeting elements. Remember to include your proxy configuration (as shown above) if needed for reliable scraping. While these examples use XPath, you could adapt them to use QuerySelectorAsync with CSS selectors where appropriate.

Taking Screenshots

Capturing a visual snapshot of a page or element is straightforward:

await using var page = await browser.NewPageAsync(); // Assuming browser is launched with proxy if needed
await page.GoToAsync("https://en.wikipedia.org/wiki/Web_scraping");

// Take a screenshot of the entire visible viewport
await page.ScreenshotAsync("fullpage_screenshot.png");

// Take a screenshot of a specific element (e.g., the main content area)
var contentArea = await page.XPathAsync("/html/body/div[2]/div/div[3]/main/div[3]");
if (contentArea.Length > 0)
{
    await contentArea[0].ScreenshotAsync("element_screenshot.png");
    Console.WriteLine("Element screenshot saved."

You can adjust the browser's viewport size before taking the screenshot to control the dimensions of the resulting image.

Converting HTML to PDF

Puppeteer Sharp can easily render a webpage directly into a PDF file:

await using var page = await browser.NewPageAsync();
await page.GoToAsync("https://en.wikipedia.org/wiki/Web_scraping");

// Save the page as a PDF
await page.PdfAsync("web_scraping_wiki.pdf");

Console.WriteLine("PDF generated."

Similar to screenshots, the browser's viewport settings can influence the layout and pagination of the generated PDF.

Setting Timeouts

Controlling timeouts is crucial for handling slow-loading pages or network latency, especially with proxies. Use the NavigationOptions class with methods like GoToAsync:

await using var page = await browser.NewPageAsync();
// Navigate with a custom timeout (e.g., 60 seconds)
await page.GoToAsync("https://some-potentially-slow-site.com", new NavigationOptions { Timeout = 60000 }); // Timeout in milliseconds
Console.WriteLine("Page loaded or timeout reached."

The default timeout is typically 30 seconds (30000 ms). Adjust this value based on expected page load times and network conditions.

Filling and Submitting Forms

Interacting with forms involves locating input fields, typing text, and triggering the submission (usually by clicking a button).

Setting Input Field Values

The TypeAsync method is commonly used for entering text into input fields. Note that TypeAsync generally requires a CSS selector, not an XPath. Here's how you might type into Wikipedia's search bar (using its CSS selector):

await using var page = await browser.NewPageAsync();
await page.GoToAsync("https://en.wikipedia.org/wiki/Main_Page");

// CSS Selector for the search input
string searchInputSelector = "#searchInput"; // Simpler selector than the long path

// Type into the search box
await page.TypeAsync(searchInputSelector, "PuppeteerSharp");

// You would then typically simulate pressing Enter or clicking the search button
// Example: Clicking the search button using its XPath
var searchButton = await page.XPathAsync("//*[@id='searchform']/div/button"); // Example XPath
if (searchButton.Length > 0)
{
    await searchButton[0].ClickAsync();
    await page.WaitForNavigationAsync(); // Wait for search results page to load
    Console.WriteLine("Form submitted."

After typing, you'd locate the submit button (using XPath or CSS selector via QuerySelectorAsync or XPathAsync) and use the ClickAsync method, often followed by WaitForNavigationAsync to ensure the next page loads.

Evaluating JavaScript on the Page

Puppeteer Sharp lets you execute arbitrary JavaScript code within the context of the page. This is powerful for interacting with dynamic content or extracting data generated by client-side scripts.

Here are a couple of examples:

await using var page = await browser.NewPageAsync();
await page.GoToAsync("https://some-javascript-heavy-site.com"); // Example site

// Example 1: Evaluate a simple expression
var result = await page.EvaluateExpressionAsync<int>("5 * 8");
Console.WriteLine($"JavaScript evaluated: 5 * 8 = {result}");

// Example 2: Run a function and return an object
var dimensions = await page.EvaluateFunctionAsync<dynamic>(@"() => {
    return {
        width: document.documentElement.clientWidth,
        height: document.documentElement.clientHeight
    };
}");
Console.WriteLine($"Page dimensions: Width={dimensions.width}, Height={dimensions.height}");

// Example 3: Extract text content using JS
string elementSelector = "#some-dynamic-element"; // CSS selector for target element
var elementText = await page.EvaluateFunctionAsync<string>($"sel => document.querySelector(sel)?.textContent", elementSelector);
Console.WriteLine($"Text from dynamic element: {elementText ?? "Not found"}"

EvaluateExpressionAsync executes a simple JavaScript expression, while EvaluateFunctionAsync runs a JS function, optionally passing arguments from your C# code. This allows you to directly access and manipulate the DOM, call existing JavaScript functions on the page, or extract complex data structures, potentially pre-processing them before returning the results to your C# application.

Puppeteer Sharp vs. Playwright for .NET

Choosing between Puppeteer Sharp and Playwright for .NET? Honestly, both are excellent libraries for browser automation and scraping in C#. The best choice often comes down to personal preference or specific project needs.

Playwright is developed by Microsoft and generally receives more frequent updates, potentially offering slightly broader browser support (including WebKit/Safari and Firefox alongside Chromium) and arguably a more modern API design. However, Puppeteer Sharp is mature, widely used, and perfectly capable for most scraping and automation tasks, especially if you're already familiar with the original Node.js Puppeteer.

Wrapping Up

We've journeyed through using Puppeteer Sharp in C#, demonstrating how to combine it with the precision of XPath for selecting elements and the necessity of proxies (like those offered by Evomi) for reliable, large-scale web scraping. You've learned how to navigate, take screenshots, generate PDFs, handle forms, execute JavaScript, and manage timeouts.

Armed with this knowledge, you're well-equipped to build sophisticated web scrapers capable of tackling dynamic websites and integrating data into your projects, whether for programmatic SEO, market research, or countless other applications.

Happy scraping!

Mastering Web Scraping with Puppeteer Sharp, XPath, and Proxies

Getting started with web scraping? Puppeteer Sharp is a seriously powerful tool in your C# arsenal. It brings the user-friendly headless browser control of Node.js's Puppeteer into the high-performance world of C#.

However, pulling data effectively from the modern web isn't always a walk in the park. You face a few key hurdles.

First off, pinpointing the exact data you need can be tricky. While various selectors exist, CSS alone can sometimes lead you down the wrong path or grab unintended elements.

Next, you've got to navigate the digital minefield of anti-scraping measures. Websites actively try to detect and block automated data collection, so staying under the radar is crucial for consistent results.

Finally, successful scraping often requires more than just grabbing static text. You need the know-how to interact with pages dynamically – filling forms, clicking buttons, executing JavaScript, and generally mimicking how a real person uses a site.

Good news: this guide will show you how to tackle these challenges. We'll combine the precision of XPath selectors with the stealth of reliable proxies, all orchestrated using Puppeteer Sharp in C#. We'll go beyond just data extraction, covering interactions, screenshots, and more.

With these techniques, you'll be equipped to build robust scrapers for various tasks, especially for dynamic, JavaScript-heavy websites.

This knowledge also pairs nicely with building programmatic websites. We recently explored this in-depth in a four-part series:

You could certainly adapt the scraping methods in part two of that series to use Puppeteer Sharp instead of other libraries. Let's dive in!

So, What Exactly is Puppeteer Sharp?

Puppeteer Sharp is essentially a .NET port of the popular Node.js Puppeteer library. It gives you programmatic control over headless Chrome or Chromium browsers using C# code. Think of it as having a robot arm that can drive a web browser just like a human would, but automatically.

This capability makes it incredibly useful for automated testing, task automation, and, as we're focusing on here, web scraping. We'll explore how its features allow you to extract precisely the data you need from websites.

Getting Started with Puppeteer Sharp

To begin using Puppeteer Sharp, you'll need a C# development environment like Visual Studio. Start by creating a new project (e.g., a Console App or Web Application):

Creating a new project in Visual Studio

Next, navigate to Tools > NuGet Package Manager > Manage NuGet Packages for Solution... Search for "PuppeteerSharp" in the Browse tab.

Finding PuppeteerSharp in NuGet Package Manager

Select the package, check the box next to your project, and click "Install". Accept any license agreements, and you're ready to code.

If you're exploring alternatives, Playwright is another strong contender. You can learn more about it in our Web Scraping with Playwright and C# guide.

Leveraging XPath in Puppeteer Sharp

Puppeteer Sharp needs instructions on which elements to interact with or extract data from. These instructions come in the form of "locators". Locators can be standard CSS selectors, text content matches, layout-based selectors (e.g., finding an element to the left of another), or, crucially for precision, XPath expressions.

XPath (XML Path Language) defines a path to navigate through the structure (the DOM tree) of an HTML or XML document. While it can select multiple elements like CSS, XPath excels at uniquely identifying a single, specific element, even in complex structures where CSS might be ambiguous.

Consider a typical CSS selector:

html body div#main-content section article p.highlight span

Now, look at an equivalent XPath expression:

/html/body/div[2]/section[1]/article/p[3]/span[1]

The XPath explicitly defines each step down the DOM tree, including the index (like the 3rd paragraph p[3]), leading directly to the target element without ambiguity.

Finding an Element's XPath

You can easily find the XPath for any element using your browser's developer tools. Right-click the element on the web page and choose "Inspect" or "Inspect Element". In the Elements panel that appears, right-click the highlighted HTML code for your element, navigate through the "Copy" submenu, and select "Copy XPath":

Copying XPath from browser developer tools

Selecting Elements with XPath in Code

With the XPath copied, you can use it in Puppeteer Sharp. Here’s a C# snippet demonstrating this:

// Ensure Chromium is downloaded
using var browserFetcher = new BrowserFetcher();
await browserFetcher.DownloadAsync();

// Launch the browser
await using var browser = await Puppeteer.LaunchAsync(new LaunchOptions { Headless = true });
await using var page = await browser.NewPageAsync();

// Navigate to a target page
await page.GoToAsync("https://en.wikipedia.org/wiki/Web_scraping");

// Select an element using XPath
var elements = await page.XPathAsync("/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/ul[1]/li[3]/a");

if (elements.Length > 0)
{
    // XPathAsync returns an array, access the first element
    var element = elements[0];
    var textHandle = await element.GetPropertyAsync("textContent");
    var text = await textHandle.JsonValueAsync<string>();

    // Output the text content
    Console.WriteLine($"Found link text: {text}");
}
else
{
    Console.WriteLine("Element not found using the specified XPath.");
}

await browser.CloseAsync

This code initializes Puppeteer, opens a page, goes to Wikipedia, and then uses XPathAsync to find a specific link within a list. Important note: XPathAsync always returns an array (IElementHandle[]) of matching elements, even if only one is found. That's why we access the first result using elements[0] before interacting with it.

The CSS selector equivalent for targeting based on an attribute might look like this:

var element = await page
    .QuerySelectorAsync("a[href='/wiki/Data_scraping']"

Alternatives to XPath

While XPath offers precision, you have other options. CSS selectors are often simpler for less complex selections but might return multiple elements if not specific enough. You can achieve XPath-like specificity with CSS using combinations of the direct child combinator (>) and pseudo-classes like :nth-child() or :nth-of-type().

For instance, the XPath:

/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/ul[1]/li[3]/a

Could be represented in CSS (though it becomes quite verbose) like this:

html >
  body >
  div:nth-of-type(2) >
  div >
  div:nth-of-type(3) >
  main >
  div:nth-of-type(3) >
  div:nth-of-type(3) >
  div:nth-of-type(1) >
  ul:nth-of-type(1) >
  li:nth-of-type(3) >
  a

Often, a more concise CSS selector using IDs, classes, or attributes is preferable if possible, but XPath remains a reliable tool for complex DOM navigation.

Integrating Proxies with Puppeteer Sharp

To scrape websites effectively and reliably, avoiding detection is paramount. While web scraping itself is generally legal for publicly accessible data, many websites employ mechanisms to block scrapers, often to protect server resources or maintain a competitive edge.

Websites detect scrapers based on various signals. One common check involves analyzing HTTP request headers. Basic scraping scripts might send incomplete or unusual headers. However, Puppeteer Sharp uses a real browser engine (Chromium), so its default request headers appear legitimate, mimicking actual user traffic.

A more significant detection vector is the IP address making the requests. If a site sees hundreds or thousands of requests originating from the same IP address in a short period, or accessing pages in a predictable pattern, it's a strong indicator of automated activity, often leading to a block.

This is where proxies become essential. By routing your scraping traffic through a proxy service like Evomi, you can mask your actual IP address. Using Evomi's residential proxies, for example, allows each request (or batches of requests) to originate from a different, genuine residential IP address. To the target website, it looks like distinct users browsing normally from various locations, making your scraper much harder to detect and block. Evomi prides itself on ethically sourced proxies and Swiss-based reliability, ensuring high-quality connections for your scraping tasks.

Once you have an Evomi proxy plan, you'll find your connection details (like endpoint address, port, username, and password) in your client dashboard.

Now, let's integrate this into your Puppeteer Sharp code.

As shown earlier, LaunchOptions allow you to configure the browser instance. You can specify a proxy server using the Args parameter:

// Example using Evomi's residential proxy endpoint (HTTP)
await using var browser = await Puppeteer.LaunchAsync(new LaunchOptions
{
    Headless = true,
    Args = new[] { "--proxy-server=rp.evomi.com:1000" } // Replace with your specific Evomi endpoint/port

If your Evomi proxy plan uses username/password authentication (instead of IP whitelisting), you need to authenticate the connection for each new page:

await using var page = await browser.NewPageAsync();
await page.AuthenticateAsync(new Credentials 
{ 
    Username = "your-evomi-username", 
    Password = "your-evomi-password" 
});
// Now you can navigate
await page.GoToAsync("https://some-target-website.com"

Here’s a combined snippet showing launch with proxy and authentication, then checking the resulting IP:

// Launch browser with proxy settings
await using var browser = await Puppeteer.LaunchAsync(new LaunchOptions
{
    Headless = true,
    Args = new[] { "--proxy-server=rp.evomi.com:1000" } // Use your Evomi endpoint
});

await using var page = await browser.NewPageAsync();

// Authenticate if needed
await page.AuthenticateAsync(new Credentials
{
    Username = "your-evomi-username",
    Password = "your-evomi-password"
});

// Navigate to an IP checker site
await page.GoToAsync("https://check.evomi.com/"); // Using Evomi's IP checker

// Optionally, take a screenshot to verify
await page.ScreenshotAsync("proxy_check_screenshot.png");

Console.WriteLine("Navigated via proxy. Check screenshot.");

await browser.CloseAsync

With the proxy configured, you can proceed with your scraping logic on the page object. Remember to consider setting appropriate timeout values, especially when using proxies, to allow sufficient time for connections and page loads.

Practical Scraping Tasks with Puppeteer Sharp

Let's explore some common web scraping actions you can perform using Puppeteer Sharp, often combined with XPath for targeting elements. Remember to include your proxy configuration (as shown above) if needed for reliable scraping. While these examples use XPath, you could adapt them to use QuerySelectorAsync with CSS selectors where appropriate.

Taking Screenshots

Capturing a visual snapshot of a page or element is straightforward:

await using var page = await browser.NewPageAsync(); // Assuming browser is launched with proxy if needed
await page.GoToAsync("https://en.wikipedia.org/wiki/Web_scraping");

// Take a screenshot of the entire visible viewport
await page.ScreenshotAsync("fullpage_screenshot.png");

// Take a screenshot of a specific element (e.g., the main content area)
var contentArea = await page.XPathAsync("/html/body/div[2]/div/div[3]/main/div[3]");
if (contentArea.Length > 0)
{
    await contentArea[0].ScreenshotAsync("element_screenshot.png");
    Console.WriteLine("Element screenshot saved."

You can adjust the browser's viewport size before taking the screenshot to control the dimensions of the resulting image.

Converting HTML to PDF

Puppeteer Sharp can easily render a webpage directly into a PDF file:

await using var page = await browser.NewPageAsync();
await page.GoToAsync("https://en.wikipedia.org/wiki/Web_scraping");

// Save the page as a PDF
await page.PdfAsync("web_scraping_wiki.pdf");

Console.WriteLine("PDF generated."

Similar to screenshots, the browser's viewport settings can influence the layout and pagination of the generated PDF.

Setting Timeouts

Controlling timeouts is crucial for handling slow-loading pages or network latency, especially with proxies. Use the NavigationOptions class with methods like GoToAsync:

await using var page = await browser.NewPageAsync();
// Navigate with a custom timeout (e.g., 60 seconds)
await page.GoToAsync("https://some-potentially-slow-site.com", new NavigationOptions { Timeout = 60000 }); // Timeout in milliseconds
Console.WriteLine("Page loaded or timeout reached."

The default timeout is typically 30 seconds (30000 ms). Adjust this value based on expected page load times and network conditions.

Filling and Submitting Forms

Interacting with forms involves locating input fields, typing text, and triggering the submission (usually by clicking a button).

Setting Input Field Values

The TypeAsync method is commonly used for entering text into input fields. Note that TypeAsync generally requires a CSS selector, not an XPath. Here's how you might type into Wikipedia's search bar (using its CSS selector):

await using var page = await browser.NewPageAsync();
await page.GoToAsync("https://en.wikipedia.org/wiki/Main_Page");

// CSS Selector for the search input
string searchInputSelector = "#searchInput"; // Simpler selector than the long path

// Type into the search box
await page.TypeAsync(searchInputSelector, "PuppeteerSharp");

// You would then typically simulate pressing Enter or clicking the search button
// Example: Clicking the search button using its XPath
var searchButton = await page.XPathAsync("//*[@id='searchform']/div/button"); // Example XPath
if (searchButton.Length > 0)
{
    await searchButton[0].ClickAsync();
    await page.WaitForNavigationAsync(); // Wait for search results page to load
    Console.WriteLine("Form submitted."

After typing, you'd locate the submit button (using XPath or CSS selector via QuerySelectorAsync or XPathAsync) and use the ClickAsync method, often followed by WaitForNavigationAsync to ensure the next page loads.

Evaluating JavaScript on the Page

Puppeteer Sharp lets you execute arbitrary JavaScript code within the context of the page. This is powerful for interacting with dynamic content or extracting data generated by client-side scripts.

Here are a couple of examples:

await using var page = await browser.NewPageAsync();
await page.GoToAsync("https://some-javascript-heavy-site.com"); // Example site

// Example 1: Evaluate a simple expression
var result = await page.EvaluateExpressionAsync<int>("5 * 8");
Console.WriteLine($"JavaScript evaluated: 5 * 8 = {result}");

// Example 2: Run a function and return an object
var dimensions = await page.EvaluateFunctionAsync<dynamic>(@"() => {
    return {
        width: document.documentElement.clientWidth,
        height: document.documentElement.clientHeight
    };
}");
Console.WriteLine($"Page dimensions: Width={dimensions.width}, Height={dimensions.height}");

// Example 3: Extract text content using JS
string elementSelector = "#some-dynamic-element"; // CSS selector for target element
var elementText = await page.EvaluateFunctionAsync<string>($"sel => document.querySelector(sel)?.textContent", elementSelector);
Console.WriteLine($"Text from dynamic element: {elementText ?? "Not found"}"

EvaluateExpressionAsync executes a simple JavaScript expression, while EvaluateFunctionAsync runs a JS function, optionally passing arguments from your C# code. This allows you to directly access and manipulate the DOM, call existing JavaScript functions on the page, or extract complex data structures, potentially pre-processing them before returning the results to your C# application.

Puppeteer Sharp vs. Playwright for .NET

Choosing between Puppeteer Sharp and Playwright for .NET? Honestly, both are excellent libraries for browser automation and scraping in C#. The best choice often comes down to personal preference or specific project needs.

Playwright is developed by Microsoft and generally receives more frequent updates, potentially offering slightly broader browser support (including WebKit/Safari and Firefox alongside Chromium) and arguably a more modern API design. However, Puppeteer Sharp is mature, widely used, and perfectly capable for most scraping and automation tasks, especially if you're already familiar with the original Node.js Puppeteer.

Wrapping Up

We've journeyed through using Puppeteer Sharp in C#, demonstrating how to combine it with the precision of XPath for selecting elements and the necessity of proxies (like those offered by Evomi) for reliable, large-scale web scraping. You've learned how to navigate, take screenshots, generate PDFs, handle forms, execute JavaScript, and manage timeouts.

Armed with this knowledge, you're well-equipped to build sophisticated web scrapers capable of tackling dynamic websites and integrating data into your projects, whether for programmatic SEO, market research, or countless other applications.

Happy scraping!

Mastering Web Scraping with Puppeteer Sharp, XPath, and Proxies

Getting started with web scraping? Puppeteer Sharp is a seriously powerful tool in your C# arsenal. It brings the user-friendly headless browser control of Node.js's Puppeteer into the high-performance world of C#.

However, pulling data effectively from the modern web isn't always a walk in the park. You face a few key hurdles.

First off, pinpointing the exact data you need can be tricky. While various selectors exist, CSS alone can sometimes lead you down the wrong path or grab unintended elements.

Next, you've got to navigate the digital minefield of anti-scraping measures. Websites actively try to detect and block automated data collection, so staying under the radar is crucial for consistent results.

Finally, successful scraping often requires more than just grabbing static text. You need the know-how to interact with pages dynamically – filling forms, clicking buttons, executing JavaScript, and generally mimicking how a real person uses a site.

Good news: this guide will show you how to tackle these challenges. We'll combine the precision of XPath selectors with the stealth of reliable proxies, all orchestrated using Puppeteer Sharp in C#. We'll go beyond just data extraction, covering interactions, screenshots, and more.

With these techniques, you'll be equipped to build robust scrapers for various tasks, especially for dynamic, JavaScript-heavy websites.

This knowledge also pairs nicely with building programmatic websites. We recently explored this in-depth in a four-part series:

You could certainly adapt the scraping methods in part two of that series to use Puppeteer Sharp instead of other libraries. Let's dive in!

So, What Exactly is Puppeteer Sharp?

Puppeteer Sharp is essentially a .NET port of the popular Node.js Puppeteer library. It gives you programmatic control over headless Chrome or Chromium browsers using C# code. Think of it as having a robot arm that can drive a web browser just like a human would, but automatically.

This capability makes it incredibly useful for automated testing, task automation, and, as we're focusing on here, web scraping. We'll explore how its features allow you to extract precisely the data you need from websites.

Getting Started with Puppeteer Sharp

To begin using Puppeteer Sharp, you'll need a C# development environment like Visual Studio. Start by creating a new project (e.g., a Console App or Web Application):

Creating a new project in Visual Studio

Next, navigate to Tools > NuGet Package Manager > Manage NuGet Packages for Solution... Search for "PuppeteerSharp" in the Browse tab.

Finding PuppeteerSharp in NuGet Package Manager

Select the package, check the box next to your project, and click "Install". Accept any license agreements, and you're ready to code.

If you're exploring alternatives, Playwright is another strong contender. You can learn more about it in our Web Scraping with Playwright and C# guide.

Leveraging XPath in Puppeteer Sharp

Puppeteer Sharp needs instructions on which elements to interact with or extract data from. These instructions come in the form of "locators". Locators can be standard CSS selectors, text content matches, layout-based selectors (e.g., finding an element to the left of another), or, crucially for precision, XPath expressions.

XPath (XML Path Language) defines a path to navigate through the structure (the DOM tree) of an HTML or XML document. While it can select multiple elements like CSS, XPath excels at uniquely identifying a single, specific element, even in complex structures where CSS might be ambiguous.

Consider a typical CSS selector:

html body div#main-content section article p.highlight span

Now, look at an equivalent XPath expression:

/html/body/div[2]/section[1]/article/p[3]/span[1]

The XPath explicitly defines each step down the DOM tree, including the index (like the 3rd paragraph p[3]), leading directly to the target element without ambiguity.

Finding an Element's XPath

You can easily find the XPath for any element using your browser's developer tools. Right-click the element on the web page and choose "Inspect" or "Inspect Element". In the Elements panel that appears, right-click the highlighted HTML code for your element, navigate through the "Copy" submenu, and select "Copy XPath":

Copying XPath from browser developer tools

Selecting Elements with XPath in Code

With the XPath copied, you can use it in Puppeteer Sharp. Here’s a C# snippet demonstrating this:

// Ensure Chromium is downloaded
using var browserFetcher = new BrowserFetcher();
await browserFetcher.DownloadAsync();

// Launch the browser
await using var browser = await Puppeteer.LaunchAsync(new LaunchOptions { Headless = true });
await using var page = await browser.NewPageAsync();

// Navigate to a target page
await page.GoToAsync("https://en.wikipedia.org/wiki/Web_scraping");

// Select an element using XPath
var elements = await page.XPathAsync("/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/ul[1]/li[3]/a");

if (elements.Length > 0)
{
    // XPathAsync returns an array, access the first element
    var element = elements[0];
    var textHandle = await element.GetPropertyAsync("textContent");
    var text = await textHandle.JsonValueAsync<string>();

    // Output the text content
    Console.WriteLine($"Found link text: {text}");
}
else
{
    Console.WriteLine("Element not found using the specified XPath.");
}

await browser.CloseAsync

This code initializes Puppeteer, opens a page, goes to Wikipedia, and then uses XPathAsync to find a specific link within a list. Important note: XPathAsync always returns an array (IElementHandle[]) of matching elements, even if only one is found. That's why we access the first result using elements[0] before interacting with it.

The CSS selector equivalent for targeting based on an attribute might look like this:

var element = await page
    .QuerySelectorAsync("a[href='/wiki/Data_scraping']"

Alternatives to XPath

While XPath offers precision, you have other options. CSS selectors are often simpler for less complex selections but might return multiple elements if not specific enough. You can achieve XPath-like specificity with CSS using combinations of the direct child combinator (>) and pseudo-classes like :nth-child() or :nth-of-type().

For instance, the XPath:

/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/ul[1]/li[3]/a

Could be represented in CSS (though it becomes quite verbose) like this:

html >
  body >
  div:nth-of-type(2) >
  div >
  div:nth-of-type(3) >
  main >
  div:nth-of-type(3) >
  div:nth-of-type(3) >
  div:nth-of-type(1) >
  ul:nth-of-type(1) >
  li:nth-of-type(3) >
  a

Often, a more concise CSS selector using IDs, classes, or attributes is preferable if possible, but XPath remains a reliable tool for complex DOM navigation.

Integrating Proxies with Puppeteer Sharp

To scrape websites effectively and reliably, avoiding detection is paramount. While web scraping itself is generally legal for publicly accessible data, many websites employ mechanisms to block scrapers, often to protect server resources or maintain a competitive edge.

Websites detect scrapers based on various signals. One common check involves analyzing HTTP request headers. Basic scraping scripts might send incomplete or unusual headers. However, Puppeteer Sharp uses a real browser engine (Chromium), so its default request headers appear legitimate, mimicking actual user traffic.

A more significant detection vector is the IP address making the requests. If a site sees hundreds or thousands of requests originating from the same IP address in a short period, or accessing pages in a predictable pattern, it's a strong indicator of automated activity, often leading to a block.

This is where proxies become essential. By routing your scraping traffic through a proxy service like Evomi, you can mask your actual IP address. Using Evomi's residential proxies, for example, allows each request (or batches of requests) to originate from a different, genuine residential IP address. To the target website, it looks like distinct users browsing normally from various locations, making your scraper much harder to detect and block. Evomi prides itself on ethically sourced proxies and Swiss-based reliability, ensuring high-quality connections for your scraping tasks.

Once you have an Evomi proxy plan, you'll find your connection details (like endpoint address, port, username, and password) in your client dashboard.

Now, let's integrate this into your Puppeteer Sharp code.

As shown earlier, LaunchOptions allow you to configure the browser instance. You can specify a proxy server using the Args parameter:

// Example using Evomi's residential proxy endpoint (HTTP)
await using var browser = await Puppeteer.LaunchAsync(new LaunchOptions
{
    Headless = true,
    Args = new[] { "--proxy-server=rp.evomi.com:1000" } // Replace with your specific Evomi endpoint/port

If your Evomi proxy plan uses username/password authentication (instead of IP whitelisting), you need to authenticate the connection for each new page:

await using var page = await browser.NewPageAsync();
await page.AuthenticateAsync(new Credentials 
{ 
    Username = "your-evomi-username", 
    Password = "your-evomi-password" 
});
// Now you can navigate
await page.GoToAsync("https://some-target-website.com"

Here’s a combined snippet showing launch with proxy and authentication, then checking the resulting IP:

// Launch browser with proxy settings
await using var browser = await Puppeteer.LaunchAsync(new LaunchOptions
{
    Headless = true,
    Args = new[] { "--proxy-server=rp.evomi.com:1000" } // Use your Evomi endpoint
});

await using var page = await browser.NewPageAsync();

// Authenticate if needed
await page.AuthenticateAsync(new Credentials
{
    Username = "your-evomi-username",
    Password = "your-evomi-password"
});

// Navigate to an IP checker site
await page.GoToAsync("https://check.evomi.com/"); // Using Evomi's IP checker

// Optionally, take a screenshot to verify
await page.ScreenshotAsync("proxy_check_screenshot.png");

Console.WriteLine("Navigated via proxy. Check screenshot.");

await browser.CloseAsync

With the proxy configured, you can proceed with your scraping logic on the page object. Remember to consider setting appropriate timeout values, especially when using proxies, to allow sufficient time for connections and page loads.

Practical Scraping Tasks with Puppeteer Sharp

Let's explore some common web scraping actions you can perform using Puppeteer Sharp, often combined with XPath for targeting elements. Remember to include your proxy configuration (as shown above) if needed for reliable scraping. While these examples use XPath, you could adapt them to use QuerySelectorAsync with CSS selectors where appropriate.

Taking Screenshots

Capturing a visual snapshot of a page or element is straightforward:

await using var page = await browser.NewPageAsync(); // Assuming browser is launched with proxy if needed
await page.GoToAsync("https://en.wikipedia.org/wiki/Web_scraping");

// Take a screenshot of the entire visible viewport
await page.ScreenshotAsync("fullpage_screenshot.png");

// Take a screenshot of a specific element (e.g., the main content area)
var contentArea = await page.XPathAsync("/html/body/div[2]/div/div[3]/main/div[3]");
if (contentArea.Length > 0)
{
    await contentArea[0].ScreenshotAsync("element_screenshot.png");
    Console.WriteLine("Element screenshot saved."

You can adjust the browser's viewport size before taking the screenshot to control the dimensions of the resulting image.

Converting HTML to PDF

Puppeteer Sharp can easily render a webpage directly into a PDF file:

await using var page = await browser.NewPageAsync();
await page.GoToAsync("https://en.wikipedia.org/wiki/Web_scraping");

// Save the page as a PDF
await page.PdfAsync("web_scraping_wiki.pdf");

Console.WriteLine("PDF generated."

Similar to screenshots, the browser's viewport settings can influence the layout and pagination of the generated PDF.

Setting Timeouts

Controlling timeouts is crucial for handling slow-loading pages or network latency, especially with proxies. Use the NavigationOptions class with methods like GoToAsync:

await using var page = await browser.NewPageAsync();
// Navigate with a custom timeout (e.g., 60 seconds)
await page.GoToAsync("https://some-potentially-slow-site.com", new NavigationOptions { Timeout = 60000 }); // Timeout in milliseconds
Console.WriteLine("Page loaded or timeout reached."

The default timeout is typically 30 seconds (30000 ms). Adjust this value based on expected page load times and network conditions.

Filling and Submitting Forms

Interacting with forms involves locating input fields, typing text, and triggering the submission (usually by clicking a button).

Setting Input Field Values

The TypeAsync method is commonly used for entering text into input fields. Note that TypeAsync generally requires a CSS selector, not an XPath. Here's how you might type into Wikipedia's search bar (using its CSS selector):

await using var page = await browser.NewPageAsync();
await page.GoToAsync("https://en.wikipedia.org/wiki/Main_Page");

// CSS Selector for the search input
string searchInputSelector = "#searchInput"; // Simpler selector than the long path

// Type into the search box
await page.TypeAsync(searchInputSelector, "PuppeteerSharp");

// You would then typically simulate pressing Enter or clicking the search button
// Example: Clicking the search button using its XPath
var searchButton = await page.XPathAsync("//*[@id='searchform']/div/button"); // Example XPath
if (searchButton.Length > 0)
{
    await searchButton[0].ClickAsync();
    await page.WaitForNavigationAsync(); // Wait for search results page to load
    Console.WriteLine("Form submitted."

After typing, you'd locate the submit button (using XPath or CSS selector via QuerySelectorAsync or XPathAsync) and use the ClickAsync method, often followed by WaitForNavigationAsync to ensure the next page loads.

Evaluating JavaScript on the Page

Puppeteer Sharp lets you execute arbitrary JavaScript code within the context of the page. This is powerful for interacting with dynamic content or extracting data generated by client-side scripts.

Here are a couple of examples:

await using var page = await browser.NewPageAsync();
await page.GoToAsync("https://some-javascript-heavy-site.com"); // Example site

// Example 1: Evaluate a simple expression
var result = await page.EvaluateExpressionAsync<int>("5 * 8");
Console.WriteLine($"JavaScript evaluated: 5 * 8 = {result}");

// Example 2: Run a function and return an object
var dimensions = await page.EvaluateFunctionAsync<dynamic>(@"() => {
    return {
        width: document.documentElement.clientWidth,
        height: document.documentElement.clientHeight
    };
}");
Console.WriteLine($"Page dimensions: Width={dimensions.width}, Height={dimensions.height}");

// Example 3: Extract text content using JS
string elementSelector = "#some-dynamic-element"; // CSS selector for target element
var elementText = await page.EvaluateFunctionAsync<string>($"sel => document.querySelector(sel)?.textContent", elementSelector);
Console.WriteLine($"Text from dynamic element: {elementText ?? "Not found"}"

EvaluateExpressionAsync executes a simple JavaScript expression, while EvaluateFunctionAsync runs a JS function, optionally passing arguments from your C# code. This allows you to directly access and manipulate the DOM, call existing JavaScript functions on the page, or extract complex data structures, potentially pre-processing them before returning the results to your C# application.

Puppeteer Sharp vs. Playwright for .NET

Choosing between Puppeteer Sharp and Playwright for .NET? Honestly, both are excellent libraries for browser automation and scraping in C#. The best choice often comes down to personal preference or specific project needs.

Playwright is developed by Microsoft and generally receives more frequent updates, potentially offering slightly broader browser support (including WebKit/Safari and Firefox alongside Chromium) and arguably a more modern API design. However, Puppeteer Sharp is mature, widely used, and perfectly capable for most scraping and automation tasks, especially if you're already familiar with the original Node.js Puppeteer.

Wrapping Up

We've journeyed through using Puppeteer Sharp in C#, demonstrating how to combine it with the precision of XPath for selecting elements and the necessity of proxies (like those offered by Evomi) for reliable, large-scale web scraping. You've learned how to navigate, take screenshots, generate PDFs, handle forms, execute JavaScript, and manage timeouts.

Armed with this knowledge, you're well-equipped to build sophisticated web scrapers capable of tackling dynamic websites and integrating data into your projects, whether for programmatic SEO, market research, or countless other applications.

Happy scraping!

Author

Michael Chen

AI & Network Infrastructure Analyst

About Author

Michael bridges the gap between artificial intelligence and network security, analyzing how AI-driven technologies enhance proxy performance and security. His work focuses on AI-powered anti-detection techniques, predictive traffic routing, and how proxies integrate with machine learning applications for smarter data access.

Like this article? Share it.
You asked, we answer - Users questions:
How can I manage multiple Puppeteer Sharp browser instances concurrently for large-scale scraping tasks using different proxies?+
What are effective error handling strategies when Puppeteer Sharp encounters issues like invalid XPath selectors or proxy connection failures?+
When is it beneficial to run Puppeteer Sharp in headful mode (non-headless) during development or debugging?+
How can Puppeteer Sharp handle scraping data from pages with infinite scrolling or content that loads dynamically after user actions?+
Does Puppeteer Sharp offer built-in solutions for solving CAPTCHAs that might appear when scraping with proxies?+

In This Article

Read More Blogs