Ruby Web Scraping: Avoid Blocks with Proxies

Michael Chen

Last edited on May 15, 2025
Last edited on May 15, 2025

Scraping Techniques

Tapping into Web Data with Ruby: Staying Undetected

Ruby, with its elegant syntax and robust libraries, is a fantastic tool for automatically gathering and processing web data at scale. Think of the possibilities: tracking competitor prices, aggregating news feeds, monitoring job boards, or collecting customer sentiment from reviews. This process, known as web scraping, offers significant advantages regardless of your chosen method – be it Ruby, another language, or even no-code solutions.

However, there's a common roadblock: getting blocked. Most websites employ measures to prevent automated scraping, preferring to serve content exclusively to human visitors, even though web scraping itself is generally legal.

Our mission today is to explore how you can harness Ruby for web scraping effectively while minimizing the risk of detection. We'll journey from setting up your Ruby environment to capturing screenshots, extracting valuable data, and navigating web pages, all with an eye on staying under the radar.

Let's dive in!

What Exactly is Ruby Web Scraping?

At its core, Ruby web scraping involves using Ruby code, often leveraging specialized libraries (called 'gems'), to fetch and interpret content from websites automatically.

The fundamental process mirrors scraping in other languages: you request a web page, download its HTML (and sometimes other resources), parse this content to find the data you need, and finally, store or process that data.

Ruby offers several different approaches to achieve this, each with its own strengths and weaknesses. Let's explore the common paths.

Is Ruby a Solid Choice for Web Scraping?

Absolutely. Ruby is an excellent language for web scraping tasks. Its relatively gentle learning curve makes it accessible for beginners, and it boasts a supportive community. Furthermore, Ruby offers powerful gems specifically designed for web scraping and related tasks, including popular choices like Nokogiri, Kimurai, and the versatile Selenium WebDriver.

When it comes to running your scraper, Ruby provides flexibility. You can easily test scripts on your local machine, deploy them to various online hosting platforms, or integrate them into cloud-based services.

Common Methods for Scraping Websites with Ruby

There are four primary techniques you might encounter or employ when scraping with Ruby:

  • Requesting pages and using regular expressions (regex) for data extraction.

  • Requesting pages and utilizing an HTML/XML parser gem for structured data extraction.

  • Intercepting and making direct XHR (XMLHttpRequest) requests that load dynamic data.

  • Employing a headless browser to simulate full user interaction.

Using a headless browser tool like Selenium often proves to be the most reliable and versatile option, especially for modern websites. These tools can handle tasks that simpler methods struggle with, effectively mimicking a real user's browser.

Let's briefly touch upon why we often lean towards Selenium for web scraping, especially when dealing with complex sites.

Parsing with Nokogiri: A Quick Look

Nokogiri is a very popular Ruby gem for parsing HTML and XML documents. You typically use another gem (like `net/http` or `HTTParty`) to fetch the web page's content first, and then feed that content to Nokogiri for analysis and data extraction.

Nokogiri excels at navigating the structure of an HTML document, even if it's not perfectly formed. This is a significant improvement over relying solely on regular expressions. However, Nokogiri primarily works with the static HTML source code received from the server.

Many modern websites heavily rely on JavaScript to load or modify content after the initial page load. Since Nokogiri doesn't execute JavaScript, it might miss data that isn't present in the initial HTML source. This limitation makes headless browsers a more robust choice for comprehensive scraping.

Why Consider Selenium for Scraping?

Selenium WebDriver stands out as a powerful tool because it automates actual web browsers (like Chrome, Firefox, etc.). It allows your Ruby script to interact with a web page just as a human user would: clicking buttons, filling forms, scrolling, and importantly, executing JavaScript.

This means Selenium can access content loaded dynamically, overcoming the main limitation of static parsers like Nokogiri. If a human can see the data in their browser, Selenium can likely access it too.

While Selenium is a go-to, Ruby has other headless browser automation options:

  • Kimurai: A full-fledged web scraping framework built on top of headless browsers.

  • Watir: Stands for "Web Application Testing in Ruby," focusing on browser automation for testing, but applicable to scraping.

  • Apparition: A driver specifically for using Chrome via the Capybara acceptance testing framework.

  • Poltergeist: A driver for the older PhantomJS headless browser, also often used with Capybara (though less common now with Chrome/Firefox headless modes).

These tools are often used beyond scraping, finding application in automated testing and various web automation tasks.

Can Your Scraper Get Blocked?

Yes, definitely. Websites actively try to detect and block automated scraping traffic. To navigate this, combining headless browsers with high-quality proxies is often essential.

Websites typically look for two main types of indicators to identify bots:

  1. Request Characteristics: They analyze HTTP headers, TLS fingerprints, and how a browser requests and renders resources. Simple scripts might miss sending certain headers that real browsers include automatically. Using a real browser engine via Selenium helps mitigate this, as it sends requests much like a standard browser.

  2. Behavioral Patterns: Making too many requests from a single IP address in a short period, accessing pages in a non-human sequence, or exhibiting predictable timing patterns can flag an account or IP address.

This is where proxies become crucial. By routing your requests through different IP addresses, you make it much harder for websites to track your scraper's activity based on its origin IP. Using residential proxies, like those offered by Evomi, is particularly effective. These proxies use IP addresses assigned by ISPs to real home users, making your scraper's traffic blend in seamlessly with legitimate human traffic, unlike easily identifiable datacenter IPs.

At Evomi, we prioritize ethically sourced proxies and provide reliable residential, mobile, datacenter, and static ISP options, ensuring you can scrape effectively and responsibly. We even offer a free trial for our residential, mobile and datacenter proxies if you want to test the waters!

With the 'why' covered, let's get practical and build our Ruby scraper.

Ruby Web Scraping: A Practical Walkthrough

Follow along with the steps below, or jump to the sections most relevant to your needs. We'll cover:

  • Setting up Ruby and an editor

  • Installing Selenium

  • Taking a basic screenshot

  • Configuring Selenium to use a proxy

  • Extracting specific data

  • Simulating clicks

  • Filling out web forms

Setting Up Your Ruby Environment

First, check if you already have Ruby installed. Open your terminal or command prompt and type:

ruby -v

If you see a version number, you're good to go! If not, you'll need to install it. The best method depends on your operating system.

  • Windows: The RubyInstaller is generally the easiest way. Alternatively, package managers like Chocolatey can be used:

  • macOS: macOS usually comes with a system version of Ruby, but it's often recommended to manage versions using tools like Homebrew:

  • Linux: Use your distribution's package manager. Examples include:

    sudo apt-get install ruby-full # Debian/Ubuntu
    sudo yum install ruby        # CentOS/Fedora (older)
    sudo dnf install ruby        # Fedora (newer)
    sudo pacman -S ruby        # Arch Linux

For detailed instructions, refer to the official Ruby installation guide.

You'll also need a code editor. Popular choices include VS Code, Sublime Text, RubyMine, Atom, or even simpler text editors like Notepad++ or TextMate.

Installing the Selenium Gem

With Ruby installed, open your terminal and install the Selenium WebDriver gem:

This command downloads and installs the necessary library. To use it in your Ruby script, you'll add this line at the top:

require 'selenium-webdriver'

For using authenticated proxies later, you'll also need the `selenium-devtools` gem:

Taking a Simple Screenshot with Selenium

Let's start with a basic script. Create a file named `scraper.rb` in your editor and add the following code:

require 'selenium-webdriver'

# Configure Chrome options
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless') # Run without opening a visible browser window
options.add_argument('--no-sandbox') # Often needed in Linux environments
options.add_argument('--disable-dev-shm-usage') # Overcomes resource limits in Docker/Linux
options.add_argument('--ignore-certificate-errors')

# Create a new Selenium WebDriver instance for Chrome
driver = Selenium::WebDriver.for :chrome, options: options

begin
  # Navigate to a page that shows our IP
  driver.get('https://geo.evomi.com/')

  # Wait a bit for the page to load (optional, but good practice)
  sleep(2)

  # Save a screenshot
  driver.save_screenshot('page_snapshot.png')
  puts "Screenshot saved as page_snapshot.png"

  # Print the page source (HTML content)
  puts "Page Source:"
  puts driver.page_source[0..500] + "..." # Print first 500 chars
ensure
  # Always close the browser session
  driver.quit
end

Let's break down this script:

  • require 'selenium-webdriver': Loads the library.

  • Selenium::WebDriver::Chrome::Options.new: Creates an object to hold browser configuration settings.

  • options.add_argument(...): Adds command-line flags when launching Chrome (e.g., `headless` runs it without a GUI).

  • Selenium::WebDriver.for :chrome, options: options: Initializes the Chrome browser controlled by Selenium, using our defined options.

  • driver.get('...'): Navigates the browser to the specified URL (we're using Evomi's Geo IP checker here).

  • sleep(2): Pauses the script for 2 seconds. Useful for letting pages (especially those with JavaScript) finish loading.

  • driver.save_screenshot(...): Captures the current browser view and saves it as an image file.

  • driver.page_source: Returns the full HTML source code of the current page.

  • ensure driver.quit: This crucial block ensures the browser process is closed properly, even if errors occur earlier in the script.

Run this script from your terminal: `ruby scraper.rb`. You should see a file `page_snapshot.png` created in the same directory and some HTML output in your terminal, including the IP address detected by the website.

Terminal output showing an IP address

(Image illustrates typical output showing a detected IP address)

This confirms Selenium is working and accessing the web. The IP shown will be your machine's public IP address.

Integrating Authenticated Proxies with Selenium

To avoid blocks and scrape effectively, using proxies is key. Let's modify the script to use an Evomi residential proxy. When you sign up for an Evomi proxy plan, you'll get access to your dashboard where you find your proxy endpoints (like `rp.evomi.com`), port numbers, and authentication credentials (username/password). You can often also whitelist your IP for password-less access from specific locations.

We'll use username/password authentication here. Selenium's DevTools protocol integration allows us to handle this:

require 'selenium-webdriver'
require 'selenium-devtools' # Needed for authentication

# --- Evomi Proxy Configuration ---
proxy_host = 'rp.evomi.com'
proxy_port = 1000 # Example: HTTP port for residential proxies
proxy_user = 'YOUR_EVOMI_USERNAME' # Replace with your actual username
proxy_pass = 'YOUR_EVOMI_PASSWORD' # Replace with your actual password
proxy_url = "#{proxy_host}:#{proxy_port}"
# --------------------------------

# Configure the proxy within Selenium options
proxy = Selenium::WebDriver::Proxy.new(
  http: proxy_url,
  ssl: proxy_url # Apply proxy for both HTTP and HTTPS
)

# Configure Chrome options, including the proxy
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--ignore-certificate-errors')
options.proxy = proxy # Assign the proxy configuration

# Create a new Selenium WebDriver instance for Chrome
driver = Selenium::WebDriver.for :chrome, options: options

# Get DevTools and register authentication
devtools = driver.devtools
devtools.network.enable # Enable network domain

# IMPORTANT: Use a pattern that matches the proxy server host/port
uri_pattern = "*://#{proxy_host}:#{proxy_port}/*"
devtools.network.set_auth_required(
  patterns: [{ urlPattern: uri_pattern, authChallengeType: "Server" }],
  enabled: true
)

devtools.network.on(:auth_required) do |params|
  puts "Proxy authentication required for: #{params['requestUrl']}"
  devtools.network.continue_with_auth(
    challenge_response: {
      response: "ProvideCredentials",
      username: proxy_user,
      password: proxy_pass
    },
    request_id: params['requestId'] # Use the requestId from the event
  )
end

begin
  # Navigate to the same IP checking page
  puts "Navigating via proxy..."
  driver.get('https://geo.evomi.com/')

  sleep(3) # Allow slightly more time for proxy connection/page load

  # Save a screenshot
  driver.save_screenshot('proxied_snapshot.png')
  puts "Screenshot saved as proxied_snapshot.png"

  # Print the page source (should show the proxy's IP)
  puts "Page Source via Proxy:"
  puts driver.page_source[0..500] + "..."
ensure
  # Always close the browser session
  driver.quit
end

Key changes:

  • We define Evomi proxy details (replace placeholders!).

  • Selenium::WebDriver::Proxy.new configures the proxy settings.

  • options.proxy = proxy applies the proxy to the Chrome options.

  • We `require 'selenium-devtools'`.

  • We get the `devtools` object and enable the network domain.

  • We use `set_auth_required` and `on(:auth_required)` to listen for the proxy's authentication challenge and respond with our credentials using `continue_with_auth`. Note: This DevTools approach is more robust than the older `register` method shown in some examples.

  • The target URL remains `https://geo.evomi.com/`.

Make sure to replace 'YOUR_EVOMI_USERNAME' and 'YOUR_EVOMI_PASSWORD' with your actual Evomi credentials.

Run this updated script (`ruby scraper.rb`). Now, the screenshot and the printed page source should reflect the IP address of the Evomi proxy server, not your own IP.

Terminal output showing a different IP address via proxy

(Image illustrates the concept of the IP address changing when using a proxy)

Success! You're now browsing through a proxy. You can adapt the subsequent examples by starting with this proxy-enabled setup and just changing the `driver.get(...)` URL and the interaction logic.

Extracting Specific Data from Pages

Selenium allows you to find specific elements on a page using various locators (CSS selectors, XPath, ID, name, link text, etc.) and then extract information from them.

Let's grab the main title from a Wikipedia page:

# Assuming 'driver' is initialized and potentially proxied
begin
  driver.get('https://en.wikipedia.org/wiki/Ruby_(programming_language)')

  # Find the main heading element using its CSS class
  title_element = driver.find_element(css: '.mw-page-title-main')

  # Extract and print the text content, removing leading/trailing whitespace
  puts "Page Title: #{title_element.text.strip}"
ensure
  driver.quit
end

This code navigates to the Ruby language Wikipedia page, uses `find_element` with a CSS selector (`.mw-page-title-main`) to locate the main heading, and then prints its `.text` content.

Clicking Links Programmatically

You can simulate clicks on links or buttons. First, find the element, then call the `.click` method on it.

# Assuming 'driver' is initialized and potentially proxied
begin
  driver.get('https://en.wikipedia.org/wiki/Ruby_(programming_language)')

  # Find a link element by its exact visible text
  history_link = driver.find_element(link_text: 'History')

  # Click the link
  history_link.click

  # Wait briefly for the new page section/page to load
  sleep(1)

  # Print the current URL to verify navigation
  puts "Current URL after click: #{driver.current_url}"
  # (Might just append #History or navigate if it was a full page link)
ensure
  driver.quit
end

This script loads the Ruby Wikipedia page, finds the link with the text "History", clicks it, and then prints the browser's current URL, which might now include `#History` or be a different page entirely depending on the link type.

Filling and Submitting Forms

Selenium can also interact with form fields using the `.send_keys` method (simulates typing) and `.submit` (simulates pressing Enter on a form field or clicking a submit button).

Let's search Wikipedia:

# Assuming 'driver' is initialized and potentially proxied
begin
  driver.get('https://en.wikipedia.org/wiki/Main_Page')

  # Find the search input field by its ID
  search_box = driver.find_element(id: 'searchInput')

  # Type text into the search box
  search_box.send_keys('Web scraping')

  # Submit the form (like pressing Enter)
  search_box.submit

  # Wait for search results page to load
  sleep(2)

  # Save a screenshot of the results page
  driver.save_screenshot('search_results.png')
  puts "Screenshot saved as search_results.png"

  # Print the URL of the results page
  puts "Search Results URL: #{driver.current_url}"
ensure
  driver.quit
end

This code goes to the Wikipedia main page, finds the search input (`#searchInput`), types "Web scraping" into it, submits the search, waits, takes a screenshot of the results, and prints the new URL.

Screenshot of Wikipedia search results page

(Image illustrates a typical search results page after automated form submission)

Wrapping Up

Today we explored how to leverage the Ruby language for web scraping tasks. We covered the journey from setting up your environment and installing Selenium to performing essential scraping actions like taking screenshots, extracting data, clicking links, and filling forms. Crucially, we saw how integrating reliable residential proxies, like those from Evomi, is vital for avoiding detection and ensuring your scraper can access the data it needs.

With these techniques, you're well-equipped to start building your own powerful and robust web scrapers using Ruby. Happy scraping!

Tapping into Web Data with Ruby: Staying Undetected

Ruby, with its elegant syntax and robust libraries, is a fantastic tool for automatically gathering and processing web data at scale. Think of the possibilities: tracking competitor prices, aggregating news feeds, monitoring job boards, or collecting customer sentiment from reviews. This process, known as web scraping, offers significant advantages regardless of your chosen method – be it Ruby, another language, or even no-code solutions.

However, there's a common roadblock: getting blocked. Most websites employ measures to prevent automated scraping, preferring to serve content exclusively to human visitors, even though web scraping itself is generally legal.

Our mission today is to explore how you can harness Ruby for web scraping effectively while minimizing the risk of detection. We'll journey from setting up your Ruby environment to capturing screenshots, extracting valuable data, and navigating web pages, all with an eye on staying under the radar.

Let's dive in!

What Exactly is Ruby Web Scraping?

At its core, Ruby web scraping involves using Ruby code, often leveraging specialized libraries (called 'gems'), to fetch and interpret content from websites automatically.

The fundamental process mirrors scraping in other languages: you request a web page, download its HTML (and sometimes other resources), parse this content to find the data you need, and finally, store or process that data.

Ruby offers several different approaches to achieve this, each with its own strengths and weaknesses. Let's explore the common paths.

Is Ruby a Solid Choice for Web Scraping?

Absolutely. Ruby is an excellent language for web scraping tasks. Its relatively gentle learning curve makes it accessible for beginners, and it boasts a supportive community. Furthermore, Ruby offers powerful gems specifically designed for web scraping and related tasks, including popular choices like Nokogiri, Kimurai, and the versatile Selenium WebDriver.

When it comes to running your scraper, Ruby provides flexibility. You can easily test scripts on your local machine, deploy them to various online hosting platforms, or integrate them into cloud-based services.

Common Methods for Scraping Websites with Ruby

There are four primary techniques you might encounter or employ when scraping with Ruby:

  • Requesting pages and using regular expressions (regex) for data extraction.

  • Requesting pages and utilizing an HTML/XML parser gem for structured data extraction.

  • Intercepting and making direct XHR (XMLHttpRequest) requests that load dynamic data.

  • Employing a headless browser to simulate full user interaction.

Using a headless browser tool like Selenium often proves to be the most reliable and versatile option, especially for modern websites. These tools can handle tasks that simpler methods struggle with, effectively mimicking a real user's browser.

Let's briefly touch upon why we often lean towards Selenium for web scraping, especially when dealing with complex sites.

Parsing with Nokogiri: A Quick Look

Nokogiri is a very popular Ruby gem for parsing HTML and XML documents. You typically use another gem (like `net/http` or `HTTParty`) to fetch the web page's content first, and then feed that content to Nokogiri for analysis and data extraction.

Nokogiri excels at navigating the structure of an HTML document, even if it's not perfectly formed. This is a significant improvement over relying solely on regular expressions. However, Nokogiri primarily works with the static HTML source code received from the server.

Many modern websites heavily rely on JavaScript to load or modify content after the initial page load. Since Nokogiri doesn't execute JavaScript, it might miss data that isn't present in the initial HTML source. This limitation makes headless browsers a more robust choice for comprehensive scraping.

Why Consider Selenium for Scraping?

Selenium WebDriver stands out as a powerful tool because it automates actual web browsers (like Chrome, Firefox, etc.). It allows your Ruby script to interact with a web page just as a human user would: clicking buttons, filling forms, scrolling, and importantly, executing JavaScript.

This means Selenium can access content loaded dynamically, overcoming the main limitation of static parsers like Nokogiri. If a human can see the data in their browser, Selenium can likely access it too.

While Selenium is a go-to, Ruby has other headless browser automation options:

  • Kimurai: A full-fledged web scraping framework built on top of headless browsers.

  • Watir: Stands for "Web Application Testing in Ruby," focusing on browser automation for testing, but applicable to scraping.

  • Apparition: A driver specifically for using Chrome via the Capybara acceptance testing framework.

  • Poltergeist: A driver for the older PhantomJS headless browser, also often used with Capybara (though less common now with Chrome/Firefox headless modes).

These tools are often used beyond scraping, finding application in automated testing and various web automation tasks.

Can Your Scraper Get Blocked?

Yes, definitely. Websites actively try to detect and block automated scraping traffic. To navigate this, combining headless browsers with high-quality proxies is often essential.

Websites typically look for two main types of indicators to identify bots:

  1. Request Characteristics: They analyze HTTP headers, TLS fingerprints, and how a browser requests and renders resources. Simple scripts might miss sending certain headers that real browsers include automatically. Using a real browser engine via Selenium helps mitigate this, as it sends requests much like a standard browser.

  2. Behavioral Patterns: Making too many requests from a single IP address in a short period, accessing pages in a non-human sequence, or exhibiting predictable timing patterns can flag an account or IP address.

This is where proxies become crucial. By routing your requests through different IP addresses, you make it much harder for websites to track your scraper's activity based on its origin IP. Using residential proxies, like those offered by Evomi, is particularly effective. These proxies use IP addresses assigned by ISPs to real home users, making your scraper's traffic blend in seamlessly with legitimate human traffic, unlike easily identifiable datacenter IPs.

At Evomi, we prioritize ethically sourced proxies and provide reliable residential, mobile, datacenter, and static ISP options, ensuring you can scrape effectively and responsibly. We even offer a free trial for our residential, mobile and datacenter proxies if you want to test the waters!

With the 'why' covered, let's get practical and build our Ruby scraper.

Ruby Web Scraping: A Practical Walkthrough

Follow along with the steps below, or jump to the sections most relevant to your needs. We'll cover:

  • Setting up Ruby and an editor

  • Installing Selenium

  • Taking a basic screenshot

  • Configuring Selenium to use a proxy

  • Extracting specific data

  • Simulating clicks

  • Filling out web forms

Setting Up Your Ruby Environment

First, check if you already have Ruby installed. Open your terminal or command prompt and type:

ruby -v

If you see a version number, you're good to go! If not, you'll need to install it. The best method depends on your operating system.

  • Windows: The RubyInstaller is generally the easiest way. Alternatively, package managers like Chocolatey can be used:

  • macOS: macOS usually comes with a system version of Ruby, but it's often recommended to manage versions using tools like Homebrew:

  • Linux: Use your distribution's package manager. Examples include:

    sudo apt-get install ruby-full # Debian/Ubuntu
    sudo yum install ruby        # CentOS/Fedora (older)
    sudo dnf install ruby        # Fedora (newer)
    sudo pacman -S ruby        # Arch Linux

For detailed instructions, refer to the official Ruby installation guide.

You'll also need a code editor. Popular choices include VS Code, Sublime Text, RubyMine, Atom, or even simpler text editors like Notepad++ or TextMate.

Installing the Selenium Gem

With Ruby installed, open your terminal and install the Selenium WebDriver gem:

This command downloads and installs the necessary library. To use it in your Ruby script, you'll add this line at the top:

require 'selenium-webdriver'

For using authenticated proxies later, you'll also need the `selenium-devtools` gem:

Taking a Simple Screenshot with Selenium

Let's start with a basic script. Create a file named `scraper.rb` in your editor and add the following code:

require 'selenium-webdriver'

# Configure Chrome options
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless') # Run without opening a visible browser window
options.add_argument('--no-sandbox') # Often needed in Linux environments
options.add_argument('--disable-dev-shm-usage') # Overcomes resource limits in Docker/Linux
options.add_argument('--ignore-certificate-errors')

# Create a new Selenium WebDriver instance for Chrome
driver = Selenium::WebDriver.for :chrome, options: options

begin
  # Navigate to a page that shows our IP
  driver.get('https://geo.evomi.com/')

  # Wait a bit for the page to load (optional, but good practice)
  sleep(2)

  # Save a screenshot
  driver.save_screenshot('page_snapshot.png')
  puts "Screenshot saved as page_snapshot.png"

  # Print the page source (HTML content)
  puts "Page Source:"
  puts driver.page_source[0..500] + "..." # Print first 500 chars
ensure
  # Always close the browser session
  driver.quit
end

Let's break down this script:

  • require 'selenium-webdriver': Loads the library.

  • Selenium::WebDriver::Chrome::Options.new: Creates an object to hold browser configuration settings.

  • options.add_argument(...): Adds command-line flags when launching Chrome (e.g., `headless` runs it without a GUI).

  • Selenium::WebDriver.for :chrome, options: options: Initializes the Chrome browser controlled by Selenium, using our defined options.

  • driver.get('...'): Navigates the browser to the specified URL (we're using Evomi's Geo IP checker here).

  • sleep(2): Pauses the script for 2 seconds. Useful for letting pages (especially those with JavaScript) finish loading.

  • driver.save_screenshot(...): Captures the current browser view and saves it as an image file.

  • driver.page_source: Returns the full HTML source code of the current page.

  • ensure driver.quit: This crucial block ensures the browser process is closed properly, even if errors occur earlier in the script.

Run this script from your terminal: `ruby scraper.rb`. You should see a file `page_snapshot.png` created in the same directory and some HTML output in your terminal, including the IP address detected by the website.

Terminal output showing an IP address

(Image illustrates typical output showing a detected IP address)

This confirms Selenium is working and accessing the web. The IP shown will be your machine's public IP address.

Integrating Authenticated Proxies with Selenium

To avoid blocks and scrape effectively, using proxies is key. Let's modify the script to use an Evomi residential proxy. When you sign up for an Evomi proxy plan, you'll get access to your dashboard where you find your proxy endpoints (like `rp.evomi.com`), port numbers, and authentication credentials (username/password). You can often also whitelist your IP for password-less access from specific locations.

We'll use username/password authentication here. Selenium's DevTools protocol integration allows us to handle this:

require 'selenium-webdriver'
require 'selenium-devtools' # Needed for authentication

# --- Evomi Proxy Configuration ---
proxy_host = 'rp.evomi.com'
proxy_port = 1000 # Example: HTTP port for residential proxies
proxy_user = 'YOUR_EVOMI_USERNAME' # Replace with your actual username
proxy_pass = 'YOUR_EVOMI_PASSWORD' # Replace with your actual password
proxy_url = "#{proxy_host}:#{proxy_port}"
# --------------------------------

# Configure the proxy within Selenium options
proxy = Selenium::WebDriver::Proxy.new(
  http: proxy_url,
  ssl: proxy_url # Apply proxy for both HTTP and HTTPS
)

# Configure Chrome options, including the proxy
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--ignore-certificate-errors')
options.proxy = proxy # Assign the proxy configuration

# Create a new Selenium WebDriver instance for Chrome
driver = Selenium::WebDriver.for :chrome, options: options

# Get DevTools and register authentication
devtools = driver.devtools
devtools.network.enable # Enable network domain

# IMPORTANT: Use a pattern that matches the proxy server host/port
uri_pattern = "*://#{proxy_host}:#{proxy_port}/*"
devtools.network.set_auth_required(
  patterns: [{ urlPattern: uri_pattern, authChallengeType: "Server" }],
  enabled: true
)

devtools.network.on(:auth_required) do |params|
  puts "Proxy authentication required for: #{params['requestUrl']}"
  devtools.network.continue_with_auth(
    challenge_response: {
      response: "ProvideCredentials",
      username: proxy_user,
      password: proxy_pass
    },
    request_id: params['requestId'] # Use the requestId from the event
  )
end

begin
  # Navigate to the same IP checking page
  puts "Navigating via proxy..."
  driver.get('https://geo.evomi.com/')

  sleep(3) # Allow slightly more time for proxy connection/page load

  # Save a screenshot
  driver.save_screenshot('proxied_snapshot.png')
  puts "Screenshot saved as proxied_snapshot.png"

  # Print the page source (should show the proxy's IP)
  puts "Page Source via Proxy:"
  puts driver.page_source[0..500] + "..."
ensure
  # Always close the browser session
  driver.quit
end

Key changes:

  • We define Evomi proxy details (replace placeholders!).

  • Selenium::WebDriver::Proxy.new configures the proxy settings.

  • options.proxy = proxy applies the proxy to the Chrome options.

  • We `require 'selenium-devtools'`.

  • We get the `devtools` object and enable the network domain.

  • We use `set_auth_required` and `on(:auth_required)` to listen for the proxy's authentication challenge and respond with our credentials using `continue_with_auth`. Note: This DevTools approach is more robust than the older `register` method shown in some examples.

  • The target URL remains `https://geo.evomi.com/`.

Make sure to replace 'YOUR_EVOMI_USERNAME' and 'YOUR_EVOMI_PASSWORD' with your actual Evomi credentials.

Run this updated script (`ruby scraper.rb`). Now, the screenshot and the printed page source should reflect the IP address of the Evomi proxy server, not your own IP.

Terminal output showing a different IP address via proxy

(Image illustrates the concept of the IP address changing when using a proxy)

Success! You're now browsing through a proxy. You can adapt the subsequent examples by starting with this proxy-enabled setup and just changing the `driver.get(...)` URL and the interaction logic.

Extracting Specific Data from Pages

Selenium allows you to find specific elements on a page using various locators (CSS selectors, XPath, ID, name, link text, etc.) and then extract information from them.

Let's grab the main title from a Wikipedia page:

# Assuming 'driver' is initialized and potentially proxied
begin
  driver.get('https://en.wikipedia.org/wiki/Ruby_(programming_language)')

  # Find the main heading element using its CSS class
  title_element = driver.find_element(css: '.mw-page-title-main')

  # Extract and print the text content, removing leading/trailing whitespace
  puts "Page Title: #{title_element.text.strip}"
ensure
  driver.quit
end

This code navigates to the Ruby language Wikipedia page, uses `find_element` with a CSS selector (`.mw-page-title-main`) to locate the main heading, and then prints its `.text` content.

Clicking Links Programmatically

You can simulate clicks on links or buttons. First, find the element, then call the `.click` method on it.

# Assuming 'driver' is initialized and potentially proxied
begin
  driver.get('https://en.wikipedia.org/wiki/Ruby_(programming_language)')

  # Find a link element by its exact visible text
  history_link = driver.find_element(link_text: 'History')

  # Click the link
  history_link.click

  # Wait briefly for the new page section/page to load
  sleep(1)

  # Print the current URL to verify navigation
  puts "Current URL after click: #{driver.current_url}"
  # (Might just append #History or navigate if it was a full page link)
ensure
  driver.quit
end

This script loads the Ruby Wikipedia page, finds the link with the text "History", clicks it, and then prints the browser's current URL, which might now include `#History` or be a different page entirely depending on the link type.

Filling and Submitting Forms

Selenium can also interact with form fields using the `.send_keys` method (simulates typing) and `.submit` (simulates pressing Enter on a form field or clicking a submit button).

Let's search Wikipedia:

# Assuming 'driver' is initialized and potentially proxied
begin
  driver.get('https://en.wikipedia.org/wiki/Main_Page')

  # Find the search input field by its ID
  search_box = driver.find_element(id: 'searchInput')

  # Type text into the search box
  search_box.send_keys('Web scraping')

  # Submit the form (like pressing Enter)
  search_box.submit

  # Wait for search results page to load
  sleep(2)

  # Save a screenshot of the results page
  driver.save_screenshot('search_results.png')
  puts "Screenshot saved as search_results.png"

  # Print the URL of the results page
  puts "Search Results URL: #{driver.current_url}"
ensure
  driver.quit
end

This code goes to the Wikipedia main page, finds the search input (`#searchInput`), types "Web scraping" into it, submits the search, waits, takes a screenshot of the results, and prints the new URL.

Screenshot of Wikipedia search results page

(Image illustrates a typical search results page after automated form submission)

Wrapping Up

Today we explored how to leverage the Ruby language for web scraping tasks. We covered the journey from setting up your environment and installing Selenium to performing essential scraping actions like taking screenshots, extracting data, clicking links, and filling forms. Crucially, we saw how integrating reliable residential proxies, like those from Evomi, is vital for avoiding detection and ensuring your scraper can access the data it needs.

With these techniques, you're well-equipped to start building your own powerful and robust web scrapers using Ruby. Happy scraping!

Tapping into Web Data with Ruby: Staying Undetected

Ruby, with its elegant syntax and robust libraries, is a fantastic tool for automatically gathering and processing web data at scale. Think of the possibilities: tracking competitor prices, aggregating news feeds, monitoring job boards, or collecting customer sentiment from reviews. This process, known as web scraping, offers significant advantages regardless of your chosen method – be it Ruby, another language, or even no-code solutions.

However, there's a common roadblock: getting blocked. Most websites employ measures to prevent automated scraping, preferring to serve content exclusively to human visitors, even though web scraping itself is generally legal.

Our mission today is to explore how you can harness Ruby for web scraping effectively while minimizing the risk of detection. We'll journey from setting up your Ruby environment to capturing screenshots, extracting valuable data, and navigating web pages, all with an eye on staying under the radar.

Let's dive in!

What Exactly is Ruby Web Scraping?

At its core, Ruby web scraping involves using Ruby code, often leveraging specialized libraries (called 'gems'), to fetch and interpret content from websites automatically.

The fundamental process mirrors scraping in other languages: you request a web page, download its HTML (and sometimes other resources), parse this content to find the data you need, and finally, store or process that data.

Ruby offers several different approaches to achieve this, each with its own strengths and weaknesses. Let's explore the common paths.

Is Ruby a Solid Choice for Web Scraping?

Absolutely. Ruby is an excellent language for web scraping tasks. Its relatively gentle learning curve makes it accessible for beginners, and it boasts a supportive community. Furthermore, Ruby offers powerful gems specifically designed for web scraping and related tasks, including popular choices like Nokogiri, Kimurai, and the versatile Selenium WebDriver.

When it comes to running your scraper, Ruby provides flexibility. You can easily test scripts on your local machine, deploy them to various online hosting platforms, or integrate them into cloud-based services.

Common Methods for Scraping Websites with Ruby

There are four primary techniques you might encounter or employ when scraping with Ruby:

  • Requesting pages and using regular expressions (regex) for data extraction.

  • Requesting pages and utilizing an HTML/XML parser gem for structured data extraction.

  • Intercepting and making direct XHR (XMLHttpRequest) requests that load dynamic data.

  • Employing a headless browser to simulate full user interaction.

Using a headless browser tool like Selenium often proves to be the most reliable and versatile option, especially for modern websites. These tools can handle tasks that simpler methods struggle with, effectively mimicking a real user's browser.

Let's briefly touch upon why we often lean towards Selenium for web scraping, especially when dealing with complex sites.

Parsing with Nokogiri: A Quick Look

Nokogiri is a very popular Ruby gem for parsing HTML and XML documents. You typically use another gem (like `net/http` or `HTTParty`) to fetch the web page's content first, and then feed that content to Nokogiri for analysis and data extraction.

Nokogiri excels at navigating the structure of an HTML document, even if it's not perfectly formed. This is a significant improvement over relying solely on regular expressions. However, Nokogiri primarily works with the static HTML source code received from the server.

Many modern websites heavily rely on JavaScript to load or modify content after the initial page load. Since Nokogiri doesn't execute JavaScript, it might miss data that isn't present in the initial HTML source. This limitation makes headless browsers a more robust choice for comprehensive scraping.

Why Consider Selenium for Scraping?

Selenium WebDriver stands out as a powerful tool because it automates actual web browsers (like Chrome, Firefox, etc.). It allows your Ruby script to interact with a web page just as a human user would: clicking buttons, filling forms, scrolling, and importantly, executing JavaScript.

This means Selenium can access content loaded dynamically, overcoming the main limitation of static parsers like Nokogiri. If a human can see the data in their browser, Selenium can likely access it too.

While Selenium is a go-to, Ruby has other headless browser automation options:

  • Kimurai: A full-fledged web scraping framework built on top of headless browsers.

  • Watir: Stands for "Web Application Testing in Ruby," focusing on browser automation for testing, but applicable to scraping.

  • Apparition: A driver specifically for using Chrome via the Capybara acceptance testing framework.

  • Poltergeist: A driver for the older PhantomJS headless browser, also often used with Capybara (though less common now with Chrome/Firefox headless modes).

These tools are often used beyond scraping, finding application in automated testing and various web automation tasks.

Can Your Scraper Get Blocked?

Yes, definitely. Websites actively try to detect and block automated scraping traffic. To navigate this, combining headless browsers with high-quality proxies is often essential.

Websites typically look for two main types of indicators to identify bots:

  1. Request Characteristics: They analyze HTTP headers, TLS fingerprints, and how a browser requests and renders resources. Simple scripts might miss sending certain headers that real browsers include automatically. Using a real browser engine via Selenium helps mitigate this, as it sends requests much like a standard browser.

  2. Behavioral Patterns: Making too many requests from a single IP address in a short period, accessing pages in a non-human sequence, or exhibiting predictable timing patterns can flag an account or IP address.

This is where proxies become crucial. By routing your requests through different IP addresses, you make it much harder for websites to track your scraper's activity based on its origin IP. Using residential proxies, like those offered by Evomi, is particularly effective. These proxies use IP addresses assigned by ISPs to real home users, making your scraper's traffic blend in seamlessly with legitimate human traffic, unlike easily identifiable datacenter IPs.

At Evomi, we prioritize ethically sourced proxies and provide reliable residential, mobile, datacenter, and static ISP options, ensuring you can scrape effectively and responsibly. We even offer a free trial for our residential, mobile and datacenter proxies if you want to test the waters!

With the 'why' covered, let's get practical and build our Ruby scraper.

Ruby Web Scraping: A Practical Walkthrough

Follow along with the steps below, or jump to the sections most relevant to your needs. We'll cover:

  • Setting up Ruby and an editor

  • Installing Selenium

  • Taking a basic screenshot

  • Configuring Selenium to use a proxy

  • Extracting specific data

  • Simulating clicks

  • Filling out web forms

Setting Up Your Ruby Environment

First, check if you already have Ruby installed. Open your terminal or command prompt and type:

ruby -v

If you see a version number, you're good to go! If not, you'll need to install it. The best method depends on your operating system.

  • Windows: The RubyInstaller is generally the easiest way. Alternatively, package managers like Chocolatey can be used:

  • macOS: macOS usually comes with a system version of Ruby, but it's often recommended to manage versions using tools like Homebrew:

  • Linux: Use your distribution's package manager. Examples include:

    sudo apt-get install ruby-full # Debian/Ubuntu
    sudo yum install ruby        # CentOS/Fedora (older)
    sudo dnf install ruby        # Fedora (newer)
    sudo pacman -S ruby        # Arch Linux

For detailed instructions, refer to the official Ruby installation guide.

You'll also need a code editor. Popular choices include VS Code, Sublime Text, RubyMine, Atom, or even simpler text editors like Notepad++ or TextMate.

Installing the Selenium Gem

With Ruby installed, open your terminal and install the Selenium WebDriver gem:

This command downloads and installs the necessary library. To use it in your Ruby script, you'll add this line at the top:

require 'selenium-webdriver'

For using authenticated proxies later, you'll also need the `selenium-devtools` gem:

Taking a Simple Screenshot with Selenium

Let's start with a basic script. Create a file named `scraper.rb` in your editor and add the following code:

require 'selenium-webdriver'

# Configure Chrome options
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless') # Run without opening a visible browser window
options.add_argument('--no-sandbox') # Often needed in Linux environments
options.add_argument('--disable-dev-shm-usage') # Overcomes resource limits in Docker/Linux
options.add_argument('--ignore-certificate-errors')

# Create a new Selenium WebDriver instance for Chrome
driver = Selenium::WebDriver.for :chrome, options: options

begin
  # Navigate to a page that shows our IP
  driver.get('https://geo.evomi.com/')

  # Wait a bit for the page to load (optional, but good practice)
  sleep(2)

  # Save a screenshot
  driver.save_screenshot('page_snapshot.png')
  puts "Screenshot saved as page_snapshot.png"

  # Print the page source (HTML content)
  puts "Page Source:"
  puts driver.page_source[0..500] + "..." # Print first 500 chars
ensure
  # Always close the browser session
  driver.quit
end

Let's break down this script:

  • require 'selenium-webdriver': Loads the library.

  • Selenium::WebDriver::Chrome::Options.new: Creates an object to hold browser configuration settings.

  • options.add_argument(...): Adds command-line flags when launching Chrome (e.g., `headless` runs it without a GUI).

  • Selenium::WebDriver.for :chrome, options: options: Initializes the Chrome browser controlled by Selenium, using our defined options.

  • driver.get('...'): Navigates the browser to the specified URL (we're using Evomi's Geo IP checker here).

  • sleep(2): Pauses the script for 2 seconds. Useful for letting pages (especially those with JavaScript) finish loading.

  • driver.save_screenshot(...): Captures the current browser view and saves it as an image file.

  • driver.page_source: Returns the full HTML source code of the current page.

  • ensure driver.quit: This crucial block ensures the browser process is closed properly, even if errors occur earlier in the script.

Run this script from your terminal: `ruby scraper.rb`. You should see a file `page_snapshot.png` created in the same directory and some HTML output in your terminal, including the IP address detected by the website.

Terminal output showing an IP address

(Image illustrates typical output showing a detected IP address)

This confirms Selenium is working and accessing the web. The IP shown will be your machine's public IP address.

Integrating Authenticated Proxies with Selenium

To avoid blocks and scrape effectively, using proxies is key. Let's modify the script to use an Evomi residential proxy. When you sign up for an Evomi proxy plan, you'll get access to your dashboard where you find your proxy endpoints (like `rp.evomi.com`), port numbers, and authentication credentials (username/password). You can often also whitelist your IP for password-less access from specific locations.

We'll use username/password authentication here. Selenium's DevTools protocol integration allows us to handle this:

require 'selenium-webdriver'
require 'selenium-devtools' # Needed for authentication

# --- Evomi Proxy Configuration ---
proxy_host = 'rp.evomi.com'
proxy_port = 1000 # Example: HTTP port for residential proxies
proxy_user = 'YOUR_EVOMI_USERNAME' # Replace with your actual username
proxy_pass = 'YOUR_EVOMI_PASSWORD' # Replace with your actual password
proxy_url = "#{proxy_host}:#{proxy_port}"
# --------------------------------

# Configure the proxy within Selenium options
proxy = Selenium::WebDriver::Proxy.new(
  http: proxy_url,
  ssl: proxy_url # Apply proxy for both HTTP and HTTPS
)

# Configure Chrome options, including the proxy
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--ignore-certificate-errors')
options.proxy = proxy # Assign the proxy configuration

# Create a new Selenium WebDriver instance for Chrome
driver = Selenium::WebDriver.for :chrome, options: options

# Get DevTools and register authentication
devtools = driver.devtools
devtools.network.enable # Enable network domain

# IMPORTANT: Use a pattern that matches the proxy server host/port
uri_pattern = "*://#{proxy_host}:#{proxy_port}/*"
devtools.network.set_auth_required(
  patterns: [{ urlPattern: uri_pattern, authChallengeType: "Server" }],
  enabled: true
)

devtools.network.on(:auth_required) do |params|
  puts "Proxy authentication required for: #{params['requestUrl']}"
  devtools.network.continue_with_auth(
    challenge_response: {
      response: "ProvideCredentials",
      username: proxy_user,
      password: proxy_pass
    },
    request_id: params['requestId'] # Use the requestId from the event
  )
end

begin
  # Navigate to the same IP checking page
  puts "Navigating via proxy..."
  driver.get('https://geo.evomi.com/')

  sleep(3) # Allow slightly more time for proxy connection/page load

  # Save a screenshot
  driver.save_screenshot('proxied_snapshot.png')
  puts "Screenshot saved as proxied_snapshot.png"

  # Print the page source (should show the proxy's IP)
  puts "Page Source via Proxy:"
  puts driver.page_source[0..500] + "..."
ensure
  # Always close the browser session
  driver.quit
end

Key changes:

  • We define Evomi proxy details (replace placeholders!).

  • Selenium::WebDriver::Proxy.new configures the proxy settings.

  • options.proxy = proxy applies the proxy to the Chrome options.

  • We `require 'selenium-devtools'`.

  • We get the `devtools` object and enable the network domain.

  • We use `set_auth_required` and `on(:auth_required)` to listen for the proxy's authentication challenge and respond with our credentials using `continue_with_auth`. Note: This DevTools approach is more robust than the older `register` method shown in some examples.

  • The target URL remains `https://geo.evomi.com/`.

Make sure to replace 'YOUR_EVOMI_USERNAME' and 'YOUR_EVOMI_PASSWORD' with your actual Evomi credentials.

Run this updated script (`ruby scraper.rb`). Now, the screenshot and the printed page source should reflect the IP address of the Evomi proxy server, not your own IP.

Terminal output showing a different IP address via proxy

(Image illustrates the concept of the IP address changing when using a proxy)

Success! You're now browsing through a proxy. You can adapt the subsequent examples by starting with this proxy-enabled setup and just changing the `driver.get(...)` URL and the interaction logic.

Extracting Specific Data from Pages

Selenium allows you to find specific elements on a page using various locators (CSS selectors, XPath, ID, name, link text, etc.) and then extract information from them.

Let's grab the main title from a Wikipedia page:

# Assuming 'driver' is initialized and potentially proxied
begin
  driver.get('https://en.wikipedia.org/wiki/Ruby_(programming_language)')

  # Find the main heading element using its CSS class
  title_element = driver.find_element(css: '.mw-page-title-main')

  # Extract and print the text content, removing leading/trailing whitespace
  puts "Page Title: #{title_element.text.strip}"
ensure
  driver.quit
end

This code navigates to the Ruby language Wikipedia page, uses `find_element` with a CSS selector (`.mw-page-title-main`) to locate the main heading, and then prints its `.text` content.

Clicking Links Programmatically

You can simulate clicks on links or buttons. First, find the element, then call the `.click` method on it.

# Assuming 'driver' is initialized and potentially proxied
begin
  driver.get('https://en.wikipedia.org/wiki/Ruby_(programming_language)')

  # Find a link element by its exact visible text
  history_link = driver.find_element(link_text: 'History')

  # Click the link
  history_link.click

  # Wait briefly for the new page section/page to load
  sleep(1)

  # Print the current URL to verify navigation
  puts "Current URL after click: #{driver.current_url}"
  # (Might just append #History or navigate if it was a full page link)
ensure
  driver.quit
end

This script loads the Ruby Wikipedia page, finds the link with the text "History", clicks it, and then prints the browser's current URL, which might now include `#History` or be a different page entirely depending on the link type.

Filling and Submitting Forms

Selenium can also interact with form fields using the `.send_keys` method (simulates typing) and `.submit` (simulates pressing Enter on a form field or clicking a submit button).

Let's search Wikipedia:

# Assuming 'driver' is initialized and potentially proxied
begin
  driver.get('https://en.wikipedia.org/wiki/Main_Page')

  # Find the search input field by its ID
  search_box = driver.find_element(id: 'searchInput')

  # Type text into the search box
  search_box.send_keys('Web scraping')

  # Submit the form (like pressing Enter)
  search_box.submit

  # Wait for search results page to load
  sleep(2)

  # Save a screenshot of the results page
  driver.save_screenshot('search_results.png')
  puts "Screenshot saved as search_results.png"

  # Print the URL of the results page
  puts "Search Results URL: #{driver.current_url}"
ensure
  driver.quit
end

This code goes to the Wikipedia main page, finds the search input (`#searchInput`), types "Web scraping" into it, submits the search, waits, takes a screenshot of the results, and prints the new URL.

Screenshot of Wikipedia search results page

(Image illustrates a typical search results page after automated form submission)

Wrapping Up

Today we explored how to leverage the Ruby language for web scraping tasks. We covered the journey from setting up your environment and installing Selenium to performing essential scraping actions like taking screenshots, extracting data, clicking links, and filling forms. Crucially, we saw how integrating reliable residential proxies, like those from Evomi, is vital for avoiding detection and ensuring your scraper can access the data it needs.

With these techniques, you're well-equipped to start building your own powerful and robust web scrapers using Ruby. Happy scraping!

Author

Michael Chen

AI & Network Infrastructure Analyst

About Author

Michael bridges the gap between artificial intelligence and network security, analyzing how AI-driven technologies enhance proxy performance and security. His work focuses on AI-powered anti-detection techniques, predictive traffic routing, and how proxies integrate with machine learning applications for smarter data access.

Like this article? Share it.
You asked, we answer - Users questions:
How does using Selenium and proxies with Ruby impact scraping performance compared to simpler methods?+
What are the ethical considerations and legal pitfalls of web scraping with Ruby, beyond getting blocked?+
The article covers using one proxy. How should I manage multiple proxies for larger Ruby scraping jobs?+
My target website requires login before I can access the data. How can I handle this with Ruby and Selenium?+
Besides IP address blocking, what other anti-scraping techniques might I encounter even when using Ruby with proxies?+

In This Article