Python Web Crawling w/ Scrapy: Leverage Proxies for Data





David Foster
Scraping Techniques
Exploring the Web with Python: Building a Scrapy Crawler
Ever wondered how search giants like Google or Bing seem to know about almost every corner of the internet? They don't have magic crystal balls; they rely heavily on a process called web crawling.
At their core, search engines deploy armies of automated bots, known as web crawlers (or spiders), that tirelessly navigate the vast expanse of the web. These bots jump from link to link, meticulously indexing page content and analyzing the connections between them. This constant exploration builds the massive databases that power our search results.
In this guide, we'll dive into the world of web crawling using Python and Scrapy, a powerful open-source framework designed for exactly this purpose. While we won't be indexing the entire internet today, we'll build a functional crawler capable of exploring a specific slice of Wikipedia.
So, What Exactly is Web Crawling?
Web crawling is the systematic, automated process of browsing websites. It's carried out by web crawlers or spiders – software programs specifically designed to discover URLs (links) on web pages and follow them to find more pages.
Distinguishing Web Crawling from Web Scraping
People often use "web crawling" and "web scraping" interchangeably, but they represent distinct, though related, activities.
Think of it this way: web crawling is about discovering pathways (URLs) across the web, like mapping out a city's streets. Web scraping, on the other hand, is about extracting specific information from the locations found on that map, like noting down the names and addresses of shops on those streets. Often, a data-gathering project involves both: first crawling to find the relevant pages, then scraping to pull out the desired data.
The typical output of a crawling process is a list of URLs or perhaps the full HTML of the pages found. Scraping, however, usually aims for a more structured dataset – think spreadsheets or databases filled with specific pieces of information.
Where Does Web Crawling Come into Play?
As mentioned, search engines are prime examples of heavy web crawler users. They need to build and maintain a near-complete snapshot of the accessible web to deliver relevant search results. Algorithms then process this massive index to rank pages effectively for user queries.
However, crawling isn't just for search engines. Imagine building a service that compares product prices across various online stores. You'd first need to crawl these e-commerce sites to discover all their product pages. Then, you'd periodically scrape these identified pages to extract current prices, descriptions, and availability, allowing you to present the best deals to your users.
Building Your First Web Crawler with Python and Scrapy
Let's get hands-on! In this tutorial, we'll construct a Python script using Scrapy to crawl a portion of Wikipedia.
Our goal is ambitious yet achievable: create a crawler that identifies all Wikipedia articles reachable within two clicks starting from a chosen article. Essentially, we're mapping out the pages that are two "degrees of separation" away from our starting point within the Wikipedia network.
Prerequisites
Before we start coding, ensure you have Python installed on your system. If not, you can grab the latest version from the official Python website.
Next, you'll need to install the Scrapy library. Open your terminal or command prompt and run:
A bit of familiarity with basic web scraping concepts will be helpful. If you're new to this, you might want to check out our comprehensive guide to Python web scraping first.
Crafting the Web Crawler
Let's begin by setting up our Scrapy project. Navigate to your desired working directory in the terminal and execute:
This command generates a directory named `wikinav` containing the basic structure for a Scrapy project:
.
├── scrapy.cfg
└── wikinav
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py
Now, create a new Python file named `wikispider.py` inside the `wikinav/spiders/` directory. Open this file in your preferred code editor.
First, we import the necessary components from Scrapy and Python's regular expression module (`re`):
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import re
In Scrapy, crawlers (spiders) are defined using Python classes. The class attributes and methods dictate the crawler's behavior.
We'll define a class named `WikiNavSpider` that inherits from Scrapy's specialized `CrawlSpider` class, which is designed for following links:
class WikiNavSpider(CrawlSpider): # Spider configuration and logic will go here
For our specific task, we need to define a few key class attributes: `name`, `start_urls`, `rules`, and configure the crawl depth via `custom_settings`.
Let's set the `name` (a unique identifier for the spider) to "wikinav" and define the `start_urls` list with a single starting Wikipedia page. Feel free to choose a different page!
name = 'wikinav' # Starting point for our crawl
start_urls = ['https://en.wikipedia.org/wiki/Web_crawler']
The `rules` attribute is where the magic of `CrawlSpider` happens. It defines how the crawler should find and follow links.
Here's the rule set for our Wikipedia crawler:
rules = ( # Rule to extract, process, and follow links
Rule(
LinkExtractor(
allow=r"https:\/\/en\.wikipedia\.org\/wiki\/(?!Main_Page$)[^:]*$", # Allow article links, exclude Main_Page and special namespaces
deny=(
r"Special:",
r"Portal:",
r"Help:",
r"Wikipedia:",
r"Wikipedia_talk:",
r"Talk:",
r"Category:",
) # Deny specific non-article namespaces
),
callback='parse_page_content', # Function to call for each extracted link
follow=True # Allow the spider to follow links found on these pages
),
)
This might look a bit dense, so let's break down the `Rule` object:
It uses a `LinkExtractor`, which is responsible for finding links on a page based on specified criteria.
`allow`: A regular expression matching URLs that the extractor should consider. Here, it targets English Wikipedia article pages (`/wiki/Something`) but specifically excludes the Main Page and pages containing colons (like `File:`, `Template:`, etc.) often used for non-article content.
`deny`: Regular expressions for URLs that should be explicitly ignored, even if they match the `allow` pattern. We're excluding various Wikipedia administrative and meta namespaces.
`callback`: Specifies the name of the method (as a string) within our spider class that should be called to process the response received from each followed link. We'll name ours `parse_page_content`.
`follow`: A boolean indicating whether the crawler should continue searching for links on the pages fetched according to this rule. Setting it to `True` enables multi-level crawling.
Defining these rules is often the trickiest part of setting up a `CrawlSpider`. It takes practice, but it provides a powerful declarative way to control crawling behavior without complex manual link management.
Finally, we limit the crawling depth. We only want pages within two clicks of the start page. We achieve this using `custom_settings`:
custom_settings = {
'DEPTH_LIMIT': 2, # Limit crawl depth to 2 levels from start_urls
}
Now, we need to implement the `callback` function we specified: `parse_page_content`. This method receives the `response` object for each crawled page and is where we can extract data (perform scraping).
Our `parse_page_content` method will simply extract the page title and its URL, yielding them as a Python dictionary:
def parse_page_content(self, response):
# Extract the main title from the <title> tag, removing the " - Wikipedia" suffix
page_title = response.css('title::text').get().replace(" - Wikipedia", "")
page_url = response.url
# Yield the data as a dictionary
yield {
'title': page_title,
'url': page_url
}
Note the use of `yield` instead of `return`. Scrapy operates asynchronously, processing multiple requests concurrently. `yield` allows Scrapy to handle the produced data efficiently without blocking the spider's progress.
Here is the complete code for our `wikispider.py` file:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import re
class WikiNavSpider(CrawlSpider):
name = 'wikinav'
# Starting point for our crawl
start_urls = ['https://en.wikipedia.org/wiki/Web_crawler']
# Rules define how to follow links and which pages to process
rules = (
Rule(
LinkExtractor(
allow=r"https:\/\/en\.wikipedia\.org\/wiki\/(?!Main_Page$)[^:]*$", # Allow article links, exclude Main_Page and special namespaces
deny=(
r"Special:",
r"Portal:",
r"Help:",
r"Wikipedia:",
r"Wikipedia_talk:",
r"Talk:",
r"Category:"
) # Deny specific non-article namespaces
),
callback='parse_page_content', # Function to call for each extracted link
follow=True # Allow the spider to follow links found on these pages
),
)
# Custom settings specific to this spider
custom_settings = {
'DEPTH_LIMIT': 2 # Limit crawl depth to 2 levels from start_urls
}
# Callback function to process the response from each crawled page
def parse_page_content(self, response):
# Extract the main title from the <title> tag, removing the " - Wikipedia" suffix
page_title = response.css('title::text').get().replace(" - Wikipedia", "")
page_url = response.url
# Yield the data as a dictionary
yield {
'title': page_title,
'url': page_url
}
Avoiding Blocks: Delays and Proxies
Our crawler is functional, but running it "as is" against a live site like Wikipedia will likely get our IP address temporarily blocked very quickly. Websites employ anti-bot measures to prevent aggressive crawling that could overload their servers.
One simple approach is to slow down. We can instruct Scrapy to wait a bit between requests by adding a setting to the main `settings.py` file in our project (`wikinav/settings.py`):
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1.5 # Wait 1.5 seconds between requests
However, for larger crawls or when speed is a factor, delays alone aren't sufficient. This is where proxies become essential.
Proxies act as intermediaries, routing your requests through different IP addresses. This masks the fact that all requests originate from a single source, making it much harder for websites to detect and block your crawler. If one proxy IP gets blocked, you simply rotate to another. Using proxies also protects your own IP address from being blacklisted.
At Evomi, we provide ethically sourced residential, mobile, datacenter, and static ISP proxies suitable for various crawling and scraping tasks. Our residential proxies, for example, offer IPs from real user devices across the globe, making your requests appear highly authentic. Coupled with our commitment to quality (backed by our Swiss base) and dedicated support, we aim to provide reliable proxy solutions.
Integrating Evomi Proxies with Your Scrapy Spider
Let's see how to configure our Scrapy crawler to use Evomi proxies. We'll focus on using rotating residential proxies, which automatically change the IP address for each request, ideal for large-scale crawling.
To use Evomi proxies, you'll need your proxy credentials and the correct endpoint address. For our rotating residential proxies, the endpoint is `rp.evomi.com`, and you can choose ports like `1000` for HTTP, `1001` for HTTPS, or `1002` for SOCKS5.
Important Security Note: Your proxy access details (username, password, endpoint, port) are sensitive. Never hardcode them directly into your script, especially if you plan to share your code or store it in version control like Git.
The recommended and safest way to provide proxy details to Scrapy is through environment variables. Scrapy automatically looks for standard environment variables named `http_proxy` and `https_proxy`.
Setting these variables depends on your operating system and shell. For Linux/macOS using bash or zsh, you can use the `export` command in the terminal *before* running your spider:
# Replace YOUR_USERNAME and YOUR_PASSWORD with your actual Evomi credentials
export http_proxy="http://YOUR_USERNAME:YOUR_PASSWORD@rp.evomi.com:1000"
export https_proxy="http://YOUR_USERNAME:YOUR_PASSWORD@rp.evomi.com:1001"
(Note: Adjust the port number based on the protocol you prefer - `1000` for HTTP, `1001` for HTTPS targeted sites). For Windows Command Prompt, use `set http_proxy=...`, and for PowerShell, use `$env:http_proxy=...`.
Once the environment variables are set, Scrapy will automatically route its requests through the specified Evomi proxy. No changes are needed in the Python code itself!
Now, you're ready to run your proxy-enabled crawler. Execute the following command in the same terminal session where you set the environment variables. This command runs the `wikinav` spider and saves the yielded data into a JSON file named `wiki_output.json`:
scrapy crawl wikinav -o
Depending on the starting page and network conditions, the crawl might take some time to complete.
Wrapping Up
Congratulations! You've learned the fundamentals of web crawling with Python using the powerful Scrapy framework. We covered creating a `CrawlSpider`, defining rules for link extraction and following, parsing page content, and crucially, integrating proxies using Evomi to handle anti-bot measures and enable more robust crawling.
Scrapy excels at crawling and scraping traditional websites. Keep in mind, however, that it doesn't execute JavaScript. For modern, dynamic websites that heavily rely on JavaScript to load content, tools like Playwright or Selenium, which control actual web browsers, might be a better fit.
Exploring the Web with Python: Building a Scrapy Crawler
Ever wondered how search giants like Google or Bing seem to know about almost every corner of the internet? They don't have magic crystal balls; they rely heavily on a process called web crawling.
At their core, search engines deploy armies of automated bots, known as web crawlers (or spiders), that tirelessly navigate the vast expanse of the web. These bots jump from link to link, meticulously indexing page content and analyzing the connections between them. This constant exploration builds the massive databases that power our search results.
In this guide, we'll dive into the world of web crawling using Python and Scrapy, a powerful open-source framework designed for exactly this purpose. While we won't be indexing the entire internet today, we'll build a functional crawler capable of exploring a specific slice of Wikipedia.
So, What Exactly is Web Crawling?
Web crawling is the systematic, automated process of browsing websites. It's carried out by web crawlers or spiders – software programs specifically designed to discover URLs (links) on web pages and follow them to find more pages.
Distinguishing Web Crawling from Web Scraping
People often use "web crawling" and "web scraping" interchangeably, but they represent distinct, though related, activities.
Think of it this way: web crawling is about discovering pathways (URLs) across the web, like mapping out a city's streets. Web scraping, on the other hand, is about extracting specific information from the locations found on that map, like noting down the names and addresses of shops on those streets. Often, a data-gathering project involves both: first crawling to find the relevant pages, then scraping to pull out the desired data.
The typical output of a crawling process is a list of URLs or perhaps the full HTML of the pages found. Scraping, however, usually aims for a more structured dataset – think spreadsheets or databases filled with specific pieces of information.
Where Does Web Crawling Come into Play?
As mentioned, search engines are prime examples of heavy web crawler users. They need to build and maintain a near-complete snapshot of the accessible web to deliver relevant search results. Algorithms then process this massive index to rank pages effectively for user queries.
However, crawling isn't just for search engines. Imagine building a service that compares product prices across various online stores. You'd first need to crawl these e-commerce sites to discover all their product pages. Then, you'd periodically scrape these identified pages to extract current prices, descriptions, and availability, allowing you to present the best deals to your users.
Building Your First Web Crawler with Python and Scrapy
Let's get hands-on! In this tutorial, we'll construct a Python script using Scrapy to crawl a portion of Wikipedia.
Our goal is ambitious yet achievable: create a crawler that identifies all Wikipedia articles reachable within two clicks starting from a chosen article. Essentially, we're mapping out the pages that are two "degrees of separation" away from our starting point within the Wikipedia network.
Prerequisites
Before we start coding, ensure you have Python installed on your system. If not, you can grab the latest version from the official Python website.
Next, you'll need to install the Scrapy library. Open your terminal or command prompt and run:
A bit of familiarity with basic web scraping concepts will be helpful. If you're new to this, you might want to check out our comprehensive guide to Python web scraping first.
Crafting the Web Crawler
Let's begin by setting up our Scrapy project. Navigate to your desired working directory in the terminal and execute:
This command generates a directory named `wikinav` containing the basic structure for a Scrapy project:
.
├── scrapy.cfg
└── wikinav
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py
Now, create a new Python file named `wikispider.py` inside the `wikinav/spiders/` directory. Open this file in your preferred code editor.
First, we import the necessary components from Scrapy and Python's regular expression module (`re`):
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import re
In Scrapy, crawlers (spiders) are defined using Python classes. The class attributes and methods dictate the crawler's behavior.
We'll define a class named `WikiNavSpider` that inherits from Scrapy's specialized `CrawlSpider` class, which is designed for following links:
class WikiNavSpider(CrawlSpider): # Spider configuration and logic will go here
For our specific task, we need to define a few key class attributes: `name`, `start_urls`, `rules`, and configure the crawl depth via `custom_settings`.
Let's set the `name` (a unique identifier for the spider) to "wikinav" and define the `start_urls` list with a single starting Wikipedia page. Feel free to choose a different page!
name = 'wikinav' # Starting point for our crawl
start_urls = ['https://en.wikipedia.org/wiki/Web_crawler']
The `rules` attribute is where the magic of `CrawlSpider` happens. It defines how the crawler should find and follow links.
Here's the rule set for our Wikipedia crawler:
rules = ( # Rule to extract, process, and follow links
Rule(
LinkExtractor(
allow=r"https:\/\/en\.wikipedia\.org\/wiki\/(?!Main_Page$)[^:]*$", # Allow article links, exclude Main_Page and special namespaces
deny=(
r"Special:",
r"Portal:",
r"Help:",
r"Wikipedia:",
r"Wikipedia_talk:",
r"Talk:",
r"Category:",
) # Deny specific non-article namespaces
),
callback='parse_page_content', # Function to call for each extracted link
follow=True # Allow the spider to follow links found on these pages
),
)
This might look a bit dense, so let's break down the `Rule` object:
It uses a `LinkExtractor`, which is responsible for finding links on a page based on specified criteria.
`allow`: A regular expression matching URLs that the extractor should consider. Here, it targets English Wikipedia article pages (`/wiki/Something`) but specifically excludes the Main Page and pages containing colons (like `File:`, `Template:`, etc.) often used for non-article content.
`deny`: Regular expressions for URLs that should be explicitly ignored, even if they match the `allow` pattern. We're excluding various Wikipedia administrative and meta namespaces.
`callback`: Specifies the name of the method (as a string) within our spider class that should be called to process the response received from each followed link. We'll name ours `parse_page_content`.
`follow`: A boolean indicating whether the crawler should continue searching for links on the pages fetched according to this rule. Setting it to `True` enables multi-level crawling.
Defining these rules is often the trickiest part of setting up a `CrawlSpider`. It takes practice, but it provides a powerful declarative way to control crawling behavior without complex manual link management.
Finally, we limit the crawling depth. We only want pages within two clicks of the start page. We achieve this using `custom_settings`:
custom_settings = {
'DEPTH_LIMIT': 2, # Limit crawl depth to 2 levels from start_urls
}
Now, we need to implement the `callback` function we specified: `parse_page_content`. This method receives the `response` object for each crawled page and is where we can extract data (perform scraping).
Our `parse_page_content` method will simply extract the page title and its URL, yielding them as a Python dictionary:
def parse_page_content(self, response):
# Extract the main title from the <title> tag, removing the " - Wikipedia" suffix
page_title = response.css('title::text').get().replace(" - Wikipedia", "")
page_url = response.url
# Yield the data as a dictionary
yield {
'title': page_title,
'url': page_url
}
Note the use of `yield` instead of `return`. Scrapy operates asynchronously, processing multiple requests concurrently. `yield` allows Scrapy to handle the produced data efficiently without blocking the spider's progress.
Here is the complete code for our `wikispider.py` file:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import re
class WikiNavSpider(CrawlSpider):
name = 'wikinav'
# Starting point for our crawl
start_urls = ['https://en.wikipedia.org/wiki/Web_crawler']
# Rules define how to follow links and which pages to process
rules = (
Rule(
LinkExtractor(
allow=r"https:\/\/en\.wikipedia\.org\/wiki\/(?!Main_Page$)[^:]*$", # Allow article links, exclude Main_Page and special namespaces
deny=(
r"Special:",
r"Portal:",
r"Help:",
r"Wikipedia:",
r"Wikipedia_talk:",
r"Talk:",
r"Category:"
) # Deny specific non-article namespaces
),
callback='parse_page_content', # Function to call for each extracted link
follow=True # Allow the spider to follow links found on these pages
),
)
# Custom settings specific to this spider
custom_settings = {
'DEPTH_LIMIT': 2 # Limit crawl depth to 2 levels from start_urls
}
# Callback function to process the response from each crawled page
def parse_page_content(self, response):
# Extract the main title from the <title> tag, removing the " - Wikipedia" suffix
page_title = response.css('title::text').get().replace(" - Wikipedia", "")
page_url = response.url
# Yield the data as a dictionary
yield {
'title': page_title,
'url': page_url
}
Avoiding Blocks: Delays and Proxies
Our crawler is functional, but running it "as is" against a live site like Wikipedia will likely get our IP address temporarily blocked very quickly. Websites employ anti-bot measures to prevent aggressive crawling that could overload their servers.
One simple approach is to slow down. We can instruct Scrapy to wait a bit between requests by adding a setting to the main `settings.py` file in our project (`wikinav/settings.py`):
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1.5 # Wait 1.5 seconds between requests
However, for larger crawls or when speed is a factor, delays alone aren't sufficient. This is where proxies become essential.
Proxies act as intermediaries, routing your requests through different IP addresses. This masks the fact that all requests originate from a single source, making it much harder for websites to detect and block your crawler. If one proxy IP gets blocked, you simply rotate to another. Using proxies also protects your own IP address from being blacklisted.
At Evomi, we provide ethically sourced residential, mobile, datacenter, and static ISP proxies suitable for various crawling and scraping tasks. Our residential proxies, for example, offer IPs from real user devices across the globe, making your requests appear highly authentic. Coupled with our commitment to quality (backed by our Swiss base) and dedicated support, we aim to provide reliable proxy solutions.
Integrating Evomi Proxies with Your Scrapy Spider
Let's see how to configure our Scrapy crawler to use Evomi proxies. We'll focus on using rotating residential proxies, which automatically change the IP address for each request, ideal for large-scale crawling.
To use Evomi proxies, you'll need your proxy credentials and the correct endpoint address. For our rotating residential proxies, the endpoint is `rp.evomi.com`, and you can choose ports like `1000` for HTTP, `1001` for HTTPS, or `1002` for SOCKS5.
Important Security Note: Your proxy access details (username, password, endpoint, port) are sensitive. Never hardcode them directly into your script, especially if you plan to share your code or store it in version control like Git.
The recommended and safest way to provide proxy details to Scrapy is through environment variables. Scrapy automatically looks for standard environment variables named `http_proxy` and `https_proxy`.
Setting these variables depends on your operating system and shell. For Linux/macOS using bash or zsh, you can use the `export` command in the terminal *before* running your spider:
# Replace YOUR_USERNAME and YOUR_PASSWORD with your actual Evomi credentials
export http_proxy="http://YOUR_USERNAME:YOUR_PASSWORD@rp.evomi.com:1000"
export https_proxy="http://YOUR_USERNAME:YOUR_PASSWORD@rp.evomi.com:1001"
(Note: Adjust the port number based on the protocol you prefer - `1000` for HTTP, `1001` for HTTPS targeted sites). For Windows Command Prompt, use `set http_proxy=...`, and for PowerShell, use `$env:http_proxy=...`.
Once the environment variables are set, Scrapy will automatically route its requests through the specified Evomi proxy. No changes are needed in the Python code itself!
Now, you're ready to run your proxy-enabled crawler. Execute the following command in the same terminal session where you set the environment variables. This command runs the `wikinav` spider and saves the yielded data into a JSON file named `wiki_output.json`:
scrapy crawl wikinav -o
Depending on the starting page and network conditions, the crawl might take some time to complete.
Wrapping Up
Congratulations! You've learned the fundamentals of web crawling with Python using the powerful Scrapy framework. We covered creating a `CrawlSpider`, defining rules for link extraction and following, parsing page content, and crucially, integrating proxies using Evomi to handle anti-bot measures and enable more robust crawling.
Scrapy excels at crawling and scraping traditional websites. Keep in mind, however, that it doesn't execute JavaScript. For modern, dynamic websites that heavily rely on JavaScript to load content, tools like Playwright or Selenium, which control actual web browsers, might be a better fit.
Exploring the Web with Python: Building a Scrapy Crawler
Ever wondered how search giants like Google or Bing seem to know about almost every corner of the internet? They don't have magic crystal balls; they rely heavily on a process called web crawling.
At their core, search engines deploy armies of automated bots, known as web crawlers (or spiders), that tirelessly navigate the vast expanse of the web. These bots jump from link to link, meticulously indexing page content and analyzing the connections between them. This constant exploration builds the massive databases that power our search results.
In this guide, we'll dive into the world of web crawling using Python and Scrapy, a powerful open-source framework designed for exactly this purpose. While we won't be indexing the entire internet today, we'll build a functional crawler capable of exploring a specific slice of Wikipedia.
So, What Exactly is Web Crawling?
Web crawling is the systematic, automated process of browsing websites. It's carried out by web crawlers or spiders – software programs specifically designed to discover URLs (links) on web pages and follow them to find more pages.
Distinguishing Web Crawling from Web Scraping
People often use "web crawling" and "web scraping" interchangeably, but they represent distinct, though related, activities.
Think of it this way: web crawling is about discovering pathways (URLs) across the web, like mapping out a city's streets. Web scraping, on the other hand, is about extracting specific information from the locations found on that map, like noting down the names and addresses of shops on those streets. Often, a data-gathering project involves both: first crawling to find the relevant pages, then scraping to pull out the desired data.
The typical output of a crawling process is a list of URLs or perhaps the full HTML of the pages found. Scraping, however, usually aims for a more structured dataset – think spreadsheets or databases filled with specific pieces of information.
Where Does Web Crawling Come into Play?
As mentioned, search engines are prime examples of heavy web crawler users. They need to build and maintain a near-complete snapshot of the accessible web to deliver relevant search results. Algorithms then process this massive index to rank pages effectively for user queries.
However, crawling isn't just for search engines. Imagine building a service that compares product prices across various online stores. You'd first need to crawl these e-commerce sites to discover all their product pages. Then, you'd periodically scrape these identified pages to extract current prices, descriptions, and availability, allowing you to present the best deals to your users.
Building Your First Web Crawler with Python and Scrapy
Let's get hands-on! In this tutorial, we'll construct a Python script using Scrapy to crawl a portion of Wikipedia.
Our goal is ambitious yet achievable: create a crawler that identifies all Wikipedia articles reachable within two clicks starting from a chosen article. Essentially, we're mapping out the pages that are two "degrees of separation" away from our starting point within the Wikipedia network.
Prerequisites
Before we start coding, ensure you have Python installed on your system. If not, you can grab the latest version from the official Python website.
Next, you'll need to install the Scrapy library. Open your terminal or command prompt and run:
A bit of familiarity with basic web scraping concepts will be helpful. If you're new to this, you might want to check out our comprehensive guide to Python web scraping first.
Crafting the Web Crawler
Let's begin by setting up our Scrapy project. Navigate to your desired working directory in the terminal and execute:
This command generates a directory named `wikinav` containing the basic structure for a Scrapy project:
.
├── scrapy.cfg
└── wikinav
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py
Now, create a new Python file named `wikispider.py` inside the `wikinav/spiders/` directory. Open this file in your preferred code editor.
First, we import the necessary components from Scrapy and Python's regular expression module (`re`):
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import re
In Scrapy, crawlers (spiders) are defined using Python classes. The class attributes and methods dictate the crawler's behavior.
We'll define a class named `WikiNavSpider` that inherits from Scrapy's specialized `CrawlSpider` class, which is designed for following links:
class WikiNavSpider(CrawlSpider): # Spider configuration and logic will go here
For our specific task, we need to define a few key class attributes: `name`, `start_urls`, `rules`, and configure the crawl depth via `custom_settings`.
Let's set the `name` (a unique identifier for the spider) to "wikinav" and define the `start_urls` list with a single starting Wikipedia page. Feel free to choose a different page!
name = 'wikinav' # Starting point for our crawl
start_urls = ['https://en.wikipedia.org/wiki/Web_crawler']
The `rules` attribute is where the magic of `CrawlSpider` happens. It defines how the crawler should find and follow links.
Here's the rule set for our Wikipedia crawler:
rules = ( # Rule to extract, process, and follow links
Rule(
LinkExtractor(
allow=r"https:\/\/en\.wikipedia\.org\/wiki\/(?!Main_Page$)[^:]*$", # Allow article links, exclude Main_Page and special namespaces
deny=(
r"Special:",
r"Portal:",
r"Help:",
r"Wikipedia:",
r"Wikipedia_talk:",
r"Talk:",
r"Category:",
) # Deny specific non-article namespaces
),
callback='parse_page_content', # Function to call for each extracted link
follow=True # Allow the spider to follow links found on these pages
),
)
This might look a bit dense, so let's break down the `Rule` object:
It uses a `LinkExtractor`, which is responsible for finding links on a page based on specified criteria.
`allow`: A regular expression matching URLs that the extractor should consider. Here, it targets English Wikipedia article pages (`/wiki/Something`) but specifically excludes the Main Page and pages containing colons (like `File:`, `Template:`, etc.) often used for non-article content.
`deny`: Regular expressions for URLs that should be explicitly ignored, even if they match the `allow` pattern. We're excluding various Wikipedia administrative and meta namespaces.
`callback`: Specifies the name of the method (as a string) within our spider class that should be called to process the response received from each followed link. We'll name ours `parse_page_content`.
`follow`: A boolean indicating whether the crawler should continue searching for links on the pages fetched according to this rule. Setting it to `True` enables multi-level crawling.
Defining these rules is often the trickiest part of setting up a `CrawlSpider`. It takes practice, but it provides a powerful declarative way to control crawling behavior without complex manual link management.
Finally, we limit the crawling depth. We only want pages within two clicks of the start page. We achieve this using `custom_settings`:
custom_settings = {
'DEPTH_LIMIT': 2, # Limit crawl depth to 2 levels from start_urls
}
Now, we need to implement the `callback` function we specified: `parse_page_content`. This method receives the `response` object for each crawled page and is where we can extract data (perform scraping).
Our `parse_page_content` method will simply extract the page title and its URL, yielding them as a Python dictionary:
def parse_page_content(self, response):
# Extract the main title from the <title> tag, removing the " - Wikipedia" suffix
page_title = response.css('title::text').get().replace(" - Wikipedia", "")
page_url = response.url
# Yield the data as a dictionary
yield {
'title': page_title,
'url': page_url
}
Note the use of `yield` instead of `return`. Scrapy operates asynchronously, processing multiple requests concurrently. `yield` allows Scrapy to handle the produced data efficiently without blocking the spider's progress.
Here is the complete code for our `wikispider.py` file:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import re
class WikiNavSpider(CrawlSpider):
name = 'wikinav'
# Starting point for our crawl
start_urls = ['https://en.wikipedia.org/wiki/Web_crawler']
# Rules define how to follow links and which pages to process
rules = (
Rule(
LinkExtractor(
allow=r"https:\/\/en\.wikipedia\.org\/wiki\/(?!Main_Page$)[^:]*$", # Allow article links, exclude Main_Page and special namespaces
deny=(
r"Special:",
r"Portal:",
r"Help:",
r"Wikipedia:",
r"Wikipedia_talk:",
r"Talk:",
r"Category:"
) # Deny specific non-article namespaces
),
callback='parse_page_content', # Function to call for each extracted link
follow=True # Allow the spider to follow links found on these pages
),
)
# Custom settings specific to this spider
custom_settings = {
'DEPTH_LIMIT': 2 # Limit crawl depth to 2 levels from start_urls
}
# Callback function to process the response from each crawled page
def parse_page_content(self, response):
# Extract the main title from the <title> tag, removing the " - Wikipedia" suffix
page_title = response.css('title::text').get().replace(" - Wikipedia", "")
page_url = response.url
# Yield the data as a dictionary
yield {
'title': page_title,
'url': page_url
}
Avoiding Blocks: Delays and Proxies
Our crawler is functional, but running it "as is" against a live site like Wikipedia will likely get our IP address temporarily blocked very quickly. Websites employ anti-bot measures to prevent aggressive crawling that could overload their servers.
One simple approach is to slow down. We can instruct Scrapy to wait a bit between requests by adding a setting to the main `settings.py` file in our project (`wikinav/settings.py`):
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1.5 # Wait 1.5 seconds between requests
However, for larger crawls or when speed is a factor, delays alone aren't sufficient. This is where proxies become essential.
Proxies act as intermediaries, routing your requests through different IP addresses. This masks the fact that all requests originate from a single source, making it much harder for websites to detect and block your crawler. If one proxy IP gets blocked, you simply rotate to another. Using proxies also protects your own IP address from being blacklisted.
At Evomi, we provide ethically sourced residential, mobile, datacenter, and static ISP proxies suitable for various crawling and scraping tasks. Our residential proxies, for example, offer IPs from real user devices across the globe, making your requests appear highly authentic. Coupled with our commitment to quality (backed by our Swiss base) and dedicated support, we aim to provide reliable proxy solutions.
Integrating Evomi Proxies with Your Scrapy Spider
Let's see how to configure our Scrapy crawler to use Evomi proxies. We'll focus on using rotating residential proxies, which automatically change the IP address for each request, ideal for large-scale crawling.
To use Evomi proxies, you'll need your proxy credentials and the correct endpoint address. For our rotating residential proxies, the endpoint is `rp.evomi.com`, and you can choose ports like `1000` for HTTP, `1001` for HTTPS, or `1002` for SOCKS5.
Important Security Note: Your proxy access details (username, password, endpoint, port) are sensitive. Never hardcode them directly into your script, especially if you plan to share your code or store it in version control like Git.
The recommended and safest way to provide proxy details to Scrapy is through environment variables. Scrapy automatically looks for standard environment variables named `http_proxy` and `https_proxy`.
Setting these variables depends on your operating system and shell. For Linux/macOS using bash or zsh, you can use the `export` command in the terminal *before* running your spider:
# Replace YOUR_USERNAME and YOUR_PASSWORD with your actual Evomi credentials
export http_proxy="http://YOUR_USERNAME:YOUR_PASSWORD@rp.evomi.com:1000"
export https_proxy="http://YOUR_USERNAME:YOUR_PASSWORD@rp.evomi.com:1001"
(Note: Adjust the port number based on the protocol you prefer - `1000` for HTTP, `1001` for HTTPS targeted sites). For Windows Command Prompt, use `set http_proxy=...`, and for PowerShell, use `$env:http_proxy=...`.
Once the environment variables are set, Scrapy will automatically route its requests through the specified Evomi proxy. No changes are needed in the Python code itself!
Now, you're ready to run your proxy-enabled crawler. Execute the following command in the same terminal session where you set the environment variables. This command runs the `wikinav` spider and saves the yielded data into a JSON file named `wiki_output.json`:
scrapy crawl wikinav -o
Depending on the starting page and network conditions, the crawl might take some time to complete.
Wrapping Up
Congratulations! You've learned the fundamentals of web crawling with Python using the powerful Scrapy framework. We covered creating a `CrawlSpider`, defining rules for link extraction and following, parsing page content, and crucially, integrating proxies using Evomi to handle anti-bot measures and enable more robust crawling.
Scrapy excels at crawling and scraping traditional websites. Keep in mind, however, that it doesn't execute JavaScript. For modern, dynamic websites that heavily rely on JavaScript to load content, tools like Playwright or Selenium, which control actual web browsers, might be a better fit.

Author
David Foster
Proxy & Network Security Analyst
About Author
David is an expert in network security, web scraping, and proxy technologies, helping businesses optimize data extraction while maintaining privacy and efficiency. With a deep understanding of residential, datacenter, and rotating proxies, he explores how proxies enhance cybersecurity, bypass geo-restrictions, and power large-scale web scraping. David’s insights help businesses and developers choose the right proxy solutions for SEO monitoring, competitive intelligence, and anonymous browsing.