Rust Web Scraping in 2025: Steps, Tools & Proxies





David Foster
Scraping Techniques
Diving into Rust for Web Scraping in 2025
Python often gets the spotlight when web scraping comes up, but truth be told, many languages are perfectly capable of pulling data from the web. If a language can fetch HTML and make sense of its structure, you're generally good to go. Rust, a language celebrated for its performance and safety, might not be the first choice that springs to mind for scraping, often being pigeonholed into backend system tasks.
However, don't let that fool you. Scraping data with Rust can be surprisingly straightforward, much like using Python. The libraries available echo the design philosophies found in Python counterparts, leading to code that's quite readable and efficient.
This guide will walk you through the process of web scraping using Rust. We'll specifically use two popular Rust crates (that's Rust lingo for libraries): Reqwest for making HTTP requests and Scraper for parsing HTML. Our goal? To extract the top posts and their scores from the popular tech news site, Hacker News.
Setting Up Your Rust Scraping Environment
Before we write any Rust code, we need to ensure Rust is installed. If you haven't got it set up yet, you can grab the installer from the official Rust website.
With Rust ready, let's create a new project. Open your terminal or command prompt and run cargo new rust_hacker_news_scraper
. Navigate into the newly created `rust_hacker_news_scraper` directory using your preferred code editor.
Our project relies on three external crates:
Reqwest: Handles the task of sending HTTP requests to fetch web pages.
Scraper: Helps us parse the downloaded HTML and extract specific data.
Tokio: An asynchronous runtime for Rust; Reqwest leans on this for handling network operations efficiently.
We need to tell Rust's package manager, Cargo, about these dependencies. Open the `Cargo.toml` file in your project's root directory and add the following lines under the `[dependencies]` section:
[dependencies]
reqwest = "0.11"
scraper = "0.13.0"
tokio = { version = "1.22.0", features = ["full"] }
Next, open the `src/main.rs` file. Replace the default "Hello, world!" code with this basic structure for an asynchronous Rust program using Tokio:
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Our scraping logic will go here
println!("Starting the Hacker News scraper...");
// Placeholder for future code
println!("Scraping finished.");
Ok(())
}
This setup gives us an asynchronous `main` function, necessary because Reqwest operates asynchronously by default, allowing our program to handle network waits without blocking everything else. With the groundwork laid, let's fetch some HTML.
Fetching Web Content with Reqwest
Reqwest is a robust and ergonomic HTTP client library for Rust, analogous to the popular Requests library in Python. It simplifies sending various HTTP requests like GET and POST.
First, we create a Reqwest client. This client instance will manage connections and settings for our requests.
let client = reqwest::Client::builder()
.build()?;
Now, using this client, we can send a GET request to Hacker News. The following code fetches the HTML content of the front page and stores it as text in the `html_content` variable.
let url = "https://news.ycombinator.com/";
println!("Fetching page: {}", url);
let html_content = client
.get(url)
.send()
.await?
.text()
.await?;
println!("Page fetched successfully. Length: {} bytes", html_content.len());
And that's essentially it for retrieving a webpage's content in Rust! Let's see the `main.rs` file so far:
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
println!("Starting the Hacker News scraper...");
let client = reqwest::Client::builder().build()?;
let url = "https://news.ycombinator.com/";
println!("Fetching page: {}", url);
let html_content = client
.get(url)
.send()
.await?
.text()
.await?;
println!(
"Page fetched successfully. Length: {} bytes",
html_content.len()
);
// Parsing logic will come next
println!("Scraping finished.");
Ok(())
}
With the HTML in hand, our next step is to parse it and extract the data we need.
Extracting Data with Scraper
The Scraper library is our tool for navigating and extracting information from the HTML structure, similar to how BeautifulSoup works in Python. It uses CSS selectors to pinpoint specific elements.
First, parse the fetched HTML string into a Scraper document object:
let document = scraper::Html::parse_document(&html_content);
Now, we can query this `document` using CSS selectors. Selectors are patterns that match elements in an HTML document. You can learn more about them here.
If you inspect the Hacker News front page HTML, you'll notice that post titles are located within an anchor (<a>
) tag, which itself is inside a <span>
tag with the class `titleline`. The CSS selector `span.titleline > a` targets exactly these anchor tags.
Let's create this selector in Rust using `scraper::Selector::parse`:
let title_selector = scraper::Selector::parse("span.titleline > a")
.expect("Failed to parse title selector");
The `parse` function returns a `Result` because the input string might not be a valid CSS selector. We use `expect` here for simplicity; in production code, you might handle this error more gracefully. It essentially unwraps the successful result or causes the program to panic with the provided message if parsing fails (Result and unwrap/expect explained).
With the selector ready, we apply it to our parsed document and iterate through the matched elements, extracting the text content (the title) from each:
let titles = document
.select(&title_selector)
.map(|element| element.inner_html());
Extracting the scores is slightly trickier because some items on Hacker News (like job postings) don't have scores. We need to handle this possibility.
First, we find a common parent element for each post's metadata (score, author, time). This is often a table cell (<td>
) with the class `subtext`.
let subtext_selector = scraper::Selector::parse("td.subtext").expect("Failed to parse subtext selector");
let subtexts = document.select(&subtext_selector);
Next, within each `subtext` element, we look for the score, which is usually inside a <span>
with the class `score`. We need a selector for that too:
let score_selector = scraper::Selector::parse("span.score")
.expect("Failed to parse score selector");
Now, we iterate through the `subtexts`. For each one, we try to find the score element. If a score element exists, we extract its text. If it doesn't (e.g., for a job post), we provide a default value like "0 points".
let scores = subtexts.map(|subtext| {
subtext
.select(&score_selector) // Try to find the score span within this subtext
.next() // Get the first match (there should only be one)
.map(|score| score.text().collect::<String>()) // If found, get its text
.unwrap_or_else(|| "0 points".to_string()) // Otherwise, use "0 points"
});
We've now got iterators for both titles and scores! Let's combine them and print the pairs.
println!("\n--- Hacker News Top Posts ---");
titles.zip(scores).for_each(|(title, score)| {
println!("Title: {} - Score: {}", title, score);
});
println!("---------------------------\n");
Complete Rust Scraper Code
Here is the full code for `src/main.rs`:
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
println!("Starting the Hacker News scraper...");
let client = reqwest::Client::builder().build()?;
let url = "https://news.ycombinator.com/";
println!("Fetching page: {}", url);
let html_content = client
.get(url)
.send()
.await?
.text()
.await?;
println!("Page fetched successfully. Length: {} bytes", html_content.len());
let document = scraper::Html::parse_document(&html_content);
// Selector for post titles
let title_selector =
scraper::Selector::parse("span.titleline > a").expect("Failed to parse title selector");
let titles = document
.select(&title_selector)
.map(|element| element.inner_html());
// Selectors for metadata and scores
let subtext_selector =
scraper::Selector::parse("td.subtext").expect("Failed to parse subtext selector");
let score_selector =
scraper::Selector::parse("span.score").expect("Failed to parse score selector");
let subtexts = document.select(&subtext_selector);
// Extract scores, providing a default if missing
let scores = subtexts.map(|subtext| {
subtext
.select(&score_selector)
.next()
.map(|score| score.text().collect::<String>())
.unwrap_or_else(|| "0 points".to_string()) // Provide default score
});
println!("\n--- Hacker News Top Posts ---");
// Combine titles and scores and print them
titles.zip(scores).for_each(|(title, score)| {
// Basic cleaning of potential HTML entities in titles for cleaner output
let clean_title = title.replace("&", "&").replace("<", "<").replace(">", ">");
println!("Title: {} - Score: {}", clean_title, score);
});
println!("---------------------------\n");
println!("Scraping finished.");
Ok(())
}
You can run this program from your terminal using cargo run
. It should fetch the Hacker News front page and print a list of titles paired with their scores, looking something like this (titles and scores will vary):
Title: Show HN: I built a tool to visualize Git history - Score: 255 points
Title: The Quiet Crisis Engulfing Stanford University - Score: 310 points
Title: Why RAM is Getting Faster (And Why It Matters) - Score: 198 points
Title: Ask HN: What are your favorite niche blogs? - Score: 402 points
Title: Understanding B-Trees: The Database Indexing Workhorse - Score: 150 points
Integrating Proxies with Reqwest
When scraping websites frequently, your IP address can get flagged and potentially blocked. Websites employ various measures to detect and stop automated scraping activities. Using a proxy server is a common technique to mitigate this risk.
A proxy acts as an intermediary, routing your request through its own IP address, effectively masking your original IP from the target website. This is crucial for larger or more frequent scraping tasks.
For effective scraping, especially at scale, using reliable proxies is key. Services like Evomi offer various proxy types, including residential proxies which are particularly useful as they originate from real user devices, making them harder to detect and block compared to datacenter IPs. Evomi sources its proxies ethically and provides robust infrastructure, ensuring quality and reliability – attributes often associated with Swiss-based services. They even offer a free trial if you want to test the waters.
To configure Reqwest to use a proxy, you need the proxy server's address and credentials (if required). The format typically looks like `protocol://username:password@proxy_host:proxy_port`.
Let's modify our client setup to include an HTTP and an HTTPS proxy. You'll need to replace the placeholder URL with your actual proxy details.
// Example proxy URL format - replace with your actual proxy details
// For Evomi residential proxies, it might look like:
// "http://YOUR_USERNAME:YOUR_PASSWORD@rp.evomi.com:1000"
let proxy_url = "http://your_proxy_user:your_proxy_pass@your_proxy_server:port";
let http_proxy = reqwest::Proxy::http(proxy_url)?;
let https_proxy = reqwest::Proxy::https(proxy_url)?; // Use same proxy for HTTPS
let client = reqwest::Client::builder()
.proxy(http_proxy)
.proxy(https_proxy)
.build()?;
// ... rest of the code remains the same ...
With this change, all requests made by this `client` instance will be routed through the specified proxy server, helping to protect your IP address and improve the chances of successful scraping. Remember to handle potential errors during proxy creation, perhaps using `?` as shown or more detailed error handling.
Concluding Thoughts
We've explored how to perform fundamental web scraping tasks using the Rust programming language, leveraging the `Reqwest` library for fetching web content and `Scraper` for parsing HTML. We also covered the importance of using proxies and how to integrate them into your Rust scraper using Reqwest.
While this setup works well for static HTML pages like Hacker News, many modern websites rely heavily on JavaScript to load content dynamically. For those scenarios, simply fetching the initial HTML won't be enough. Rust has solutions for this too, notably the Thirtyfour crate, which provides bindings for Selenium WebDriver. This allows you to control a real web browser programmatically, enabling interaction with dynamic elements, clicking buttons, filling forms, and executing JavaScript – essential for scraping complex, interactive sites.
Diving into Rust for Web Scraping in 2025
Python often gets the spotlight when web scraping comes up, but truth be told, many languages are perfectly capable of pulling data from the web. If a language can fetch HTML and make sense of its structure, you're generally good to go. Rust, a language celebrated for its performance and safety, might not be the first choice that springs to mind for scraping, often being pigeonholed into backend system tasks.
However, don't let that fool you. Scraping data with Rust can be surprisingly straightforward, much like using Python. The libraries available echo the design philosophies found in Python counterparts, leading to code that's quite readable and efficient.
This guide will walk you through the process of web scraping using Rust. We'll specifically use two popular Rust crates (that's Rust lingo for libraries): Reqwest for making HTTP requests and Scraper for parsing HTML. Our goal? To extract the top posts and their scores from the popular tech news site, Hacker News.
Setting Up Your Rust Scraping Environment
Before we write any Rust code, we need to ensure Rust is installed. If you haven't got it set up yet, you can grab the installer from the official Rust website.
With Rust ready, let's create a new project. Open your terminal or command prompt and run cargo new rust_hacker_news_scraper
. Navigate into the newly created `rust_hacker_news_scraper` directory using your preferred code editor.
Our project relies on three external crates:
Reqwest: Handles the task of sending HTTP requests to fetch web pages.
Scraper: Helps us parse the downloaded HTML and extract specific data.
Tokio: An asynchronous runtime for Rust; Reqwest leans on this for handling network operations efficiently.
We need to tell Rust's package manager, Cargo, about these dependencies. Open the `Cargo.toml` file in your project's root directory and add the following lines under the `[dependencies]` section:
[dependencies]
reqwest = "0.11"
scraper = "0.13.0"
tokio = { version = "1.22.0", features = ["full"] }
Next, open the `src/main.rs` file. Replace the default "Hello, world!" code with this basic structure for an asynchronous Rust program using Tokio:
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Our scraping logic will go here
println!("Starting the Hacker News scraper...");
// Placeholder for future code
println!("Scraping finished.");
Ok(())
}
This setup gives us an asynchronous `main` function, necessary because Reqwest operates asynchronously by default, allowing our program to handle network waits without blocking everything else. With the groundwork laid, let's fetch some HTML.
Fetching Web Content with Reqwest
Reqwest is a robust and ergonomic HTTP client library for Rust, analogous to the popular Requests library in Python. It simplifies sending various HTTP requests like GET and POST.
First, we create a Reqwest client. This client instance will manage connections and settings for our requests.
let client = reqwest::Client::builder()
.build()?;
Now, using this client, we can send a GET request to Hacker News. The following code fetches the HTML content of the front page and stores it as text in the `html_content` variable.
let url = "https://news.ycombinator.com/";
println!("Fetching page: {}", url);
let html_content = client
.get(url)
.send()
.await?
.text()
.await?;
println!("Page fetched successfully. Length: {} bytes", html_content.len());
And that's essentially it for retrieving a webpage's content in Rust! Let's see the `main.rs` file so far:
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
println!("Starting the Hacker News scraper...");
let client = reqwest::Client::builder().build()?;
let url = "https://news.ycombinator.com/";
println!("Fetching page: {}", url);
let html_content = client
.get(url)
.send()
.await?
.text()
.await?;
println!(
"Page fetched successfully. Length: {} bytes",
html_content.len()
);
// Parsing logic will come next
println!("Scraping finished.");
Ok(())
}
With the HTML in hand, our next step is to parse it and extract the data we need.
Extracting Data with Scraper
The Scraper library is our tool for navigating and extracting information from the HTML structure, similar to how BeautifulSoup works in Python. It uses CSS selectors to pinpoint specific elements.
First, parse the fetched HTML string into a Scraper document object:
let document = scraper::Html::parse_document(&html_content);
Now, we can query this `document` using CSS selectors. Selectors are patterns that match elements in an HTML document. You can learn more about them here.
If you inspect the Hacker News front page HTML, you'll notice that post titles are located within an anchor (<a>
) tag, which itself is inside a <span>
tag with the class `titleline`. The CSS selector `span.titleline > a` targets exactly these anchor tags.
Let's create this selector in Rust using `scraper::Selector::parse`:
let title_selector = scraper::Selector::parse("span.titleline > a")
.expect("Failed to parse title selector");
The `parse` function returns a `Result` because the input string might not be a valid CSS selector. We use `expect` here for simplicity; in production code, you might handle this error more gracefully. It essentially unwraps the successful result or causes the program to panic with the provided message if parsing fails (Result and unwrap/expect explained).
With the selector ready, we apply it to our parsed document and iterate through the matched elements, extracting the text content (the title) from each:
let titles = document
.select(&title_selector)
.map(|element| element.inner_html());
Extracting the scores is slightly trickier because some items on Hacker News (like job postings) don't have scores. We need to handle this possibility.
First, we find a common parent element for each post's metadata (score, author, time). This is often a table cell (<td>
) with the class `subtext`.
let subtext_selector = scraper::Selector::parse("td.subtext").expect("Failed to parse subtext selector");
let subtexts = document.select(&subtext_selector);
Next, within each `subtext` element, we look for the score, which is usually inside a <span>
with the class `score`. We need a selector for that too:
let score_selector = scraper::Selector::parse("span.score")
.expect("Failed to parse score selector");
Now, we iterate through the `subtexts`. For each one, we try to find the score element. If a score element exists, we extract its text. If it doesn't (e.g., for a job post), we provide a default value like "0 points".
let scores = subtexts.map(|subtext| {
subtext
.select(&score_selector) // Try to find the score span within this subtext
.next() // Get the first match (there should only be one)
.map(|score| score.text().collect::<String>()) // If found, get its text
.unwrap_or_else(|| "0 points".to_string()) // Otherwise, use "0 points"
});
We've now got iterators for both titles and scores! Let's combine them and print the pairs.
println!("\n--- Hacker News Top Posts ---");
titles.zip(scores).for_each(|(title, score)| {
println!("Title: {} - Score: {}", title, score);
});
println!("---------------------------\n");
Complete Rust Scraper Code
Here is the full code for `src/main.rs`:
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
println!("Starting the Hacker News scraper...");
let client = reqwest::Client::builder().build()?;
let url = "https://news.ycombinator.com/";
println!("Fetching page: {}", url);
let html_content = client
.get(url)
.send()
.await?
.text()
.await?;
println!("Page fetched successfully. Length: {} bytes", html_content.len());
let document = scraper::Html::parse_document(&html_content);
// Selector for post titles
let title_selector =
scraper::Selector::parse("span.titleline > a").expect("Failed to parse title selector");
let titles = document
.select(&title_selector)
.map(|element| element.inner_html());
// Selectors for metadata and scores
let subtext_selector =
scraper::Selector::parse("td.subtext").expect("Failed to parse subtext selector");
let score_selector =
scraper::Selector::parse("span.score").expect("Failed to parse score selector");
let subtexts = document.select(&subtext_selector);
// Extract scores, providing a default if missing
let scores = subtexts.map(|subtext| {
subtext
.select(&score_selector)
.next()
.map(|score| score.text().collect::<String>())
.unwrap_or_else(|| "0 points".to_string()) // Provide default score
});
println!("\n--- Hacker News Top Posts ---");
// Combine titles and scores and print them
titles.zip(scores).for_each(|(title, score)| {
// Basic cleaning of potential HTML entities in titles for cleaner output
let clean_title = title.replace("&", "&").replace("<", "<").replace(">", ">");
println!("Title: {} - Score: {}", clean_title, score);
});
println!("---------------------------\n");
println!("Scraping finished.");
Ok(())
}
You can run this program from your terminal using cargo run
. It should fetch the Hacker News front page and print a list of titles paired with their scores, looking something like this (titles and scores will vary):
Title: Show HN: I built a tool to visualize Git history - Score: 255 points
Title: The Quiet Crisis Engulfing Stanford University - Score: 310 points
Title: Why RAM is Getting Faster (And Why It Matters) - Score: 198 points
Title: Ask HN: What are your favorite niche blogs? - Score: 402 points
Title: Understanding B-Trees: The Database Indexing Workhorse - Score: 150 points
Integrating Proxies with Reqwest
When scraping websites frequently, your IP address can get flagged and potentially blocked. Websites employ various measures to detect and stop automated scraping activities. Using a proxy server is a common technique to mitigate this risk.
A proxy acts as an intermediary, routing your request through its own IP address, effectively masking your original IP from the target website. This is crucial for larger or more frequent scraping tasks.
For effective scraping, especially at scale, using reliable proxies is key. Services like Evomi offer various proxy types, including residential proxies which are particularly useful as they originate from real user devices, making them harder to detect and block compared to datacenter IPs. Evomi sources its proxies ethically and provides robust infrastructure, ensuring quality and reliability – attributes often associated with Swiss-based services. They even offer a free trial if you want to test the waters.
To configure Reqwest to use a proxy, you need the proxy server's address and credentials (if required). The format typically looks like `protocol://username:password@proxy_host:proxy_port`.
Let's modify our client setup to include an HTTP and an HTTPS proxy. You'll need to replace the placeholder URL with your actual proxy details.
// Example proxy URL format - replace with your actual proxy details
// For Evomi residential proxies, it might look like:
// "http://YOUR_USERNAME:YOUR_PASSWORD@rp.evomi.com:1000"
let proxy_url = "http://your_proxy_user:your_proxy_pass@your_proxy_server:port";
let http_proxy = reqwest::Proxy::http(proxy_url)?;
let https_proxy = reqwest::Proxy::https(proxy_url)?; // Use same proxy for HTTPS
let client = reqwest::Client::builder()
.proxy(http_proxy)
.proxy(https_proxy)
.build()?;
// ... rest of the code remains the same ...
With this change, all requests made by this `client` instance will be routed through the specified proxy server, helping to protect your IP address and improve the chances of successful scraping. Remember to handle potential errors during proxy creation, perhaps using `?` as shown or more detailed error handling.
Concluding Thoughts
We've explored how to perform fundamental web scraping tasks using the Rust programming language, leveraging the `Reqwest` library for fetching web content and `Scraper` for parsing HTML. We also covered the importance of using proxies and how to integrate them into your Rust scraper using Reqwest.
While this setup works well for static HTML pages like Hacker News, many modern websites rely heavily on JavaScript to load content dynamically. For those scenarios, simply fetching the initial HTML won't be enough. Rust has solutions for this too, notably the Thirtyfour crate, which provides bindings for Selenium WebDriver. This allows you to control a real web browser programmatically, enabling interaction with dynamic elements, clicking buttons, filling forms, and executing JavaScript – essential for scraping complex, interactive sites.
Diving into Rust for Web Scraping in 2025
Python often gets the spotlight when web scraping comes up, but truth be told, many languages are perfectly capable of pulling data from the web. If a language can fetch HTML and make sense of its structure, you're generally good to go. Rust, a language celebrated for its performance and safety, might not be the first choice that springs to mind for scraping, often being pigeonholed into backend system tasks.
However, don't let that fool you. Scraping data with Rust can be surprisingly straightforward, much like using Python. The libraries available echo the design philosophies found in Python counterparts, leading to code that's quite readable and efficient.
This guide will walk you through the process of web scraping using Rust. We'll specifically use two popular Rust crates (that's Rust lingo for libraries): Reqwest for making HTTP requests and Scraper for parsing HTML. Our goal? To extract the top posts and their scores from the popular tech news site, Hacker News.
Setting Up Your Rust Scraping Environment
Before we write any Rust code, we need to ensure Rust is installed. If you haven't got it set up yet, you can grab the installer from the official Rust website.
With Rust ready, let's create a new project. Open your terminal or command prompt and run cargo new rust_hacker_news_scraper
. Navigate into the newly created `rust_hacker_news_scraper` directory using your preferred code editor.
Our project relies on three external crates:
Reqwest: Handles the task of sending HTTP requests to fetch web pages.
Scraper: Helps us parse the downloaded HTML and extract specific data.
Tokio: An asynchronous runtime for Rust; Reqwest leans on this for handling network operations efficiently.
We need to tell Rust's package manager, Cargo, about these dependencies. Open the `Cargo.toml` file in your project's root directory and add the following lines under the `[dependencies]` section:
[dependencies]
reqwest = "0.11"
scraper = "0.13.0"
tokio = { version = "1.22.0", features = ["full"] }
Next, open the `src/main.rs` file. Replace the default "Hello, world!" code with this basic structure for an asynchronous Rust program using Tokio:
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Our scraping logic will go here
println!("Starting the Hacker News scraper...");
// Placeholder for future code
println!("Scraping finished.");
Ok(())
}
This setup gives us an asynchronous `main` function, necessary because Reqwest operates asynchronously by default, allowing our program to handle network waits without blocking everything else. With the groundwork laid, let's fetch some HTML.
Fetching Web Content with Reqwest
Reqwest is a robust and ergonomic HTTP client library for Rust, analogous to the popular Requests library in Python. It simplifies sending various HTTP requests like GET and POST.
First, we create a Reqwest client. This client instance will manage connections and settings for our requests.
let client = reqwest::Client::builder()
.build()?;
Now, using this client, we can send a GET request to Hacker News. The following code fetches the HTML content of the front page and stores it as text in the `html_content` variable.
let url = "https://news.ycombinator.com/";
println!("Fetching page: {}", url);
let html_content = client
.get(url)
.send()
.await?
.text()
.await?;
println!("Page fetched successfully. Length: {} bytes", html_content.len());
And that's essentially it for retrieving a webpage's content in Rust! Let's see the `main.rs` file so far:
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
println!("Starting the Hacker News scraper...");
let client = reqwest::Client::builder().build()?;
let url = "https://news.ycombinator.com/";
println!("Fetching page: {}", url);
let html_content = client
.get(url)
.send()
.await?
.text()
.await?;
println!(
"Page fetched successfully. Length: {} bytes",
html_content.len()
);
// Parsing logic will come next
println!("Scraping finished.");
Ok(())
}
With the HTML in hand, our next step is to parse it and extract the data we need.
Extracting Data with Scraper
The Scraper library is our tool for navigating and extracting information from the HTML structure, similar to how BeautifulSoup works in Python. It uses CSS selectors to pinpoint specific elements.
First, parse the fetched HTML string into a Scraper document object:
let document = scraper::Html::parse_document(&html_content);
Now, we can query this `document` using CSS selectors. Selectors are patterns that match elements in an HTML document. You can learn more about them here.
If you inspect the Hacker News front page HTML, you'll notice that post titles are located within an anchor (<a>
) tag, which itself is inside a <span>
tag with the class `titleline`. The CSS selector `span.titleline > a` targets exactly these anchor tags.
Let's create this selector in Rust using `scraper::Selector::parse`:
let title_selector = scraper::Selector::parse("span.titleline > a")
.expect("Failed to parse title selector");
The `parse` function returns a `Result` because the input string might not be a valid CSS selector. We use `expect` here for simplicity; in production code, you might handle this error more gracefully. It essentially unwraps the successful result or causes the program to panic with the provided message if parsing fails (Result and unwrap/expect explained).
With the selector ready, we apply it to our parsed document and iterate through the matched elements, extracting the text content (the title) from each:
let titles = document
.select(&title_selector)
.map(|element| element.inner_html());
Extracting the scores is slightly trickier because some items on Hacker News (like job postings) don't have scores. We need to handle this possibility.
First, we find a common parent element for each post's metadata (score, author, time). This is often a table cell (<td>
) with the class `subtext`.
let subtext_selector = scraper::Selector::parse("td.subtext").expect("Failed to parse subtext selector");
let subtexts = document.select(&subtext_selector);
Next, within each `subtext` element, we look for the score, which is usually inside a <span>
with the class `score`. We need a selector for that too:
let score_selector = scraper::Selector::parse("span.score")
.expect("Failed to parse score selector");
Now, we iterate through the `subtexts`. For each one, we try to find the score element. If a score element exists, we extract its text. If it doesn't (e.g., for a job post), we provide a default value like "0 points".
let scores = subtexts.map(|subtext| {
subtext
.select(&score_selector) // Try to find the score span within this subtext
.next() // Get the first match (there should only be one)
.map(|score| score.text().collect::<String>()) // If found, get its text
.unwrap_or_else(|| "0 points".to_string()) // Otherwise, use "0 points"
});
We've now got iterators for both titles and scores! Let's combine them and print the pairs.
println!("\n--- Hacker News Top Posts ---");
titles.zip(scores).for_each(|(title, score)| {
println!("Title: {} - Score: {}", title, score);
});
println!("---------------------------\n");
Complete Rust Scraper Code
Here is the full code for `src/main.rs`:
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
println!("Starting the Hacker News scraper...");
let client = reqwest::Client::builder().build()?;
let url = "https://news.ycombinator.com/";
println!("Fetching page: {}", url);
let html_content = client
.get(url)
.send()
.await?
.text()
.await?;
println!("Page fetched successfully. Length: {} bytes", html_content.len());
let document = scraper::Html::parse_document(&html_content);
// Selector for post titles
let title_selector =
scraper::Selector::parse("span.titleline > a").expect("Failed to parse title selector");
let titles = document
.select(&title_selector)
.map(|element| element.inner_html());
// Selectors for metadata and scores
let subtext_selector =
scraper::Selector::parse("td.subtext").expect("Failed to parse subtext selector");
let score_selector =
scraper::Selector::parse("span.score").expect("Failed to parse score selector");
let subtexts = document.select(&subtext_selector);
// Extract scores, providing a default if missing
let scores = subtexts.map(|subtext| {
subtext
.select(&score_selector)
.next()
.map(|score| score.text().collect::<String>())
.unwrap_or_else(|| "0 points".to_string()) // Provide default score
});
println!("\n--- Hacker News Top Posts ---");
// Combine titles and scores and print them
titles.zip(scores).for_each(|(title, score)| {
// Basic cleaning of potential HTML entities in titles for cleaner output
let clean_title = title.replace("&", "&").replace("<", "<").replace(">", ">");
println!("Title: {} - Score: {}", clean_title, score);
});
println!("---------------------------\n");
println!("Scraping finished.");
Ok(())
}
You can run this program from your terminal using cargo run
. It should fetch the Hacker News front page and print a list of titles paired with their scores, looking something like this (titles and scores will vary):
Title: Show HN: I built a tool to visualize Git history - Score: 255 points
Title: The Quiet Crisis Engulfing Stanford University - Score: 310 points
Title: Why RAM is Getting Faster (And Why It Matters) - Score: 198 points
Title: Ask HN: What are your favorite niche blogs? - Score: 402 points
Title: Understanding B-Trees: The Database Indexing Workhorse - Score: 150 points
Integrating Proxies with Reqwest
When scraping websites frequently, your IP address can get flagged and potentially blocked. Websites employ various measures to detect and stop automated scraping activities. Using a proxy server is a common technique to mitigate this risk.
A proxy acts as an intermediary, routing your request through its own IP address, effectively masking your original IP from the target website. This is crucial for larger or more frequent scraping tasks.
For effective scraping, especially at scale, using reliable proxies is key. Services like Evomi offer various proxy types, including residential proxies which are particularly useful as they originate from real user devices, making them harder to detect and block compared to datacenter IPs. Evomi sources its proxies ethically and provides robust infrastructure, ensuring quality and reliability – attributes often associated with Swiss-based services. They even offer a free trial if you want to test the waters.
To configure Reqwest to use a proxy, you need the proxy server's address and credentials (if required). The format typically looks like `protocol://username:password@proxy_host:proxy_port`.
Let's modify our client setup to include an HTTP and an HTTPS proxy. You'll need to replace the placeholder URL with your actual proxy details.
// Example proxy URL format - replace with your actual proxy details
// For Evomi residential proxies, it might look like:
// "http://YOUR_USERNAME:YOUR_PASSWORD@rp.evomi.com:1000"
let proxy_url = "http://your_proxy_user:your_proxy_pass@your_proxy_server:port";
let http_proxy = reqwest::Proxy::http(proxy_url)?;
let https_proxy = reqwest::Proxy::https(proxy_url)?; // Use same proxy for HTTPS
let client = reqwest::Client::builder()
.proxy(http_proxy)
.proxy(https_proxy)
.build()?;
// ... rest of the code remains the same ...
With this change, all requests made by this `client` instance will be routed through the specified proxy server, helping to protect your IP address and improve the chances of successful scraping. Remember to handle potential errors during proxy creation, perhaps using `?` as shown or more detailed error handling.
Concluding Thoughts
We've explored how to perform fundamental web scraping tasks using the Rust programming language, leveraging the `Reqwest` library for fetching web content and `Scraper` for parsing HTML. We also covered the importance of using proxies and how to integrate them into your Rust scraper using Reqwest.
While this setup works well for static HTML pages like Hacker News, many modern websites rely heavily on JavaScript to load content dynamically. For those scenarios, simply fetching the initial HTML won't be enough. Rust has solutions for this too, notably the Thirtyfour crate, which provides bindings for Selenium WebDriver. This allows you to control a real web browser programmatically, enabling interaction with dynamic elements, clicking buttons, filling forms, and executing JavaScript – essential for scraping complex, interactive sites.

Author
David Foster
Proxy & Network Security Analyst
About Author
David is an expert in network security, web scraping, and proxy technologies, helping businesses optimize data extraction while maintaining privacy and efficiency. With a deep understanding of residential, datacenter, and rotating proxies, he explores how proxies enhance cybersecurity, bypass geo-restrictions, and power large-scale web scraping. David’s insights help businesses and developers choose the right proxy solutions for SEO monitoring, competitive intelligence, and anonymous browsing.