Async Web Scraping with Aiohttp: A Proxy Integration Guide





David Foster
Scraping Techniques
Diving into Asynchronous Web Scraping with Aiohttp and Proxies
Extracting data from the web, or web scraping, is a cornerstone technique for everything from tracking e-commerce prices to aggregating news feeds or monitoring financial markets. It's about using code to fetch information automatically and efficiently. If you're venturing into building your own scraper, getting comfortable with some coding is essential.
Python is a fantastic choice for this, largely thanks to its straightforward syntax and powerful libraries. When it comes to making HTTP requests, Python offers several popular options, including aiohttp, httpx, and the classic requests library. Each has its own strengths, which you can explore further in our comparison of httpx vs aiohttp vs requests.
This guide focuses on aiohttp
, an asynchronous library. What does "asynchronous" mean here? It means aiohttp
can juggle multiple web requests simultaneously without getting stuck waiting for each one to finish. This makes it incredibly efficient for tasks requiring many concurrent connections. However, to truly harness this power without running into roadblocks like IP bans, integrating proxies is crucial. Proxies help mask your origin and distribute your requests, making your scraping activities less likely to be flagged by target websites.
Here’s what we'll cover:
The basics of how
aiohttp
achieves concurrency.Getting
aiohttp
set up on your machine.Integrating proxies (like Evomi's residential proxies) with
aiohttp
.Smart strategies for using proxies with
aiohttp
effectively.
Let's get started!
What is Aiohttp Anyway?
Think about needing to grab real-time data from several different online sources, perhaps fetching currency exchange rates from multiple financial websites at once. With traditional, synchronous libraries (like the standard `requests` library), your program would make a request, wait for the response, process it, then move to the next request, one by one. This sequential process can become a significant bottleneck, especially when dealing with many slow or unresponsive servers.
Aiohttp
tackles this differently using Python's asyncio
framework. It allows your program to initiate a request and then immediately move on to other tasks (like starting another request) without waiting for the first one to complete. When a response arrives, aiohttp
handles it. This non-blocking approach means you can manage numerous HTTP operations concurrently, drastically speeding up I/O-bound tasks like web scraping.
Setting Up Aiohttp with Proxies
So, aiohttp
lets you fire off requests rapidly. That's great for speed, but it also increases the chances of overwhelming a target server or triggering its defenses. Websites often monitor incoming traffic, and a sudden flood of requests from a single IP address is a classic sign of automated scraping. This can lead to rate limiting (slowing you down), CAPTCHAs, or outright IP blocks.
This is where proxies become indispensable. By routing your aiohttp
requests through proxy servers, you change the source IP address for each request or group of requests. For high-volume scraping, simple datacenter proxies might not be enough. Rotating residential proxies, which use IP addresses assigned by ISPs to real home users, are often the gold standard. They provide a high degree of anonymity and legitimacy, making it harder for websites to distinguish your scraper from genuine user traffic. Evomi offers ethically sourced residential proxies starting at just $0.49/GB, perfect for these kinds of tasks.
Let's translate this into practical code.
What You'll Need
Before diving into the code, ensure you have:
Python 3.7 or a newer version installed.
Access to proxy servers. For robust scraping, consider rotating residential proxies. Evomi provides these, along with mobile and datacenter options, and even offers a completely free trial to test them out.
Got everything? Great, let's proceed.
Installing Aiohttp
Getting aiohttp
is simple using pip, Python's package installer. Open your terminal or command prompt and type:
To verify the installation, you can check the installed package details:
This command should display information about the installed aiohttp
library, including its version.
A Basic Script Using Aiohttp with a Proxy
Now that the setup is complete, let's write a simple Python script. We'll use `aiohttp` to fetch product names from a test e-commerce site, routing the request through an Evomi residential proxy. For this example, we'll target a test scraping site.
Step 1: Import Libraries
First, we need to import the necessary Python libraries: aiohttp
for the web requests and asyncio
to run our asynchronous code.
import aiohttp
import asyncio
Step 2: Configure Your Proxy
Define the proxy server details. We'll use Evomi's residential proxy endpoint format. Remember to replace placeholders with your actual credentials.
# Evomi Residential Proxy Configuration (replace with your details)
# Format: http://username:password@hostname:port
proxy_url = "http://user-xyz:pass123@rp.evomi.com:1000"
# If your proxy doesn't require auth embedded in the URL,
# you might use BasicAuth like this (adjust accordingly):
# proxy_url = "http://rp.evomi.com:1000"
# proxy_auth = aiohttp.BasicAuth('user-xyz', 'pass123')
# For this example, we embed auth in the proxy_url.
Note: Evomi offers different ports for HTTP (1000), HTTPS (1001), and SOCKS5 (1002) for residential proxies. Ensure you use the correct one for your needs.
Step 3: Write the Async Fetching Function
Let's create the asynchronous function that performs the actual web request via the proxy.
# Async function to fetch product names
async def fetch_product_names(session):
target_url = "http://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
print(f"Attempting to fetch data from: {target_url}")
# Use the proxy configured earlier
try:
async with session.get(target_url, proxy=proxy_url) as response:
# Check if the request was successful
response.raise_for_status() # Raises an exception for bad status codes (4xx or 5xx)
html_content = await response.text()
# Basic extraction: find lines containing product titles (usually in links within description class)
# This is a simplified approach; a real scraper would use libraries like BeautifulSoup
product_lines = [line.strip() for line in html_content.splitlines() if 'class="title"' in line]
# Further clean-up might be needed depending on HTML structure
product_names = [line for line in product_lines if 'href' in line] # Filter for lines likely containing titles
print("Successfully fetched data.")
return product_names
except aiohttp.ClientError as e:
print(f"An error occurred: {e}")
return [] # Return empty list on error
async def fetch_product_names(session)
: Defines an asynchronous function.async
signals it can perform non-blocking operations.target_url
: The web page we want to scrape.session.get(target_url, proxy=proxy_url)
: Makes an HTTP GET request using the provided session, directing it through our configuredproxy_url
.response.raise_for_status()
: A good practice to check if the request succeeded (status code 2xx).await response.text()
: Asynchronously gets the response body as text.await
pauses this function until the text is ready, allowing other tasks to run.Extracting Names: This example uses basic string searching (
'class="title"'
) to find lines likely containing product titles. For robust scraping, libraries like Beautiful Soup or lxml are recommended for parsing HTML.Error Handling: The
try...except
block catches potential connection or HTTP errors.
Step 4: Create the Main Async Function
This function orchestrates the process: it creates the aiohttp
session and calls our fetching function.
# Main async function to manage the process
async def main():
# Create a client session to manage connections
async with aiohttp.ClientSession() as session:
print("Client session started.")
product_titles = await fetch_product_names(session)
if product_titles:
print("\n--- Extracted Product Title Lines ---")
for title_line in product_titles:
# Print the raw line containing the title - further parsing needed for clean titles
print(title_line)
print("------------------------------------")
else:
print("No product titles extracted.")
print("Client session closed.")
async def main()
: The main entry point for our async operations.aiohttp.ClientSession()
: Creates a session object. Using a session is efficient as it can reuse connections and manage cookies.await fetch_product_names(session)
: Calls our fetching function and waits for it to complete.Printing Results: Loops through the returned list and prints the lines identified as potentially containing titles.
Step 5: Run the Async Code
Finally, use asyncio.run()
to execute the main
function.
# Entry point to run the main async function
if __name__ == "__main__":
print("Starting scraper...")
asyncio.run(main())
print("Scraper finished.")
Step 6: Putting It All Together
Here’s the complete script:
import aiohttp
import asyncio
# Evomi Residential Proxy Configuration (replace with your details)
# Format: http://username:password@hostname:port
proxy_url = "http://user-xyz:pass123@rp.evomi.com:1000"
# Async function to fetch product names
async def fetch_product_names(session):
target_url = "http://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
print(f"Attempting to fetch data from: {target_url} via proxy {proxy_url.split('@')[-1]}") # Hide credentials in log
try:
# Use the proxy configured earlier
async with session.get(target_url, proxy=proxy_url) as response:
# Check if the request was successful
response.raise_for_status() # Raises an exception for bad status codes (4xx or 5xx)
html_content = await response.text()
# Basic extraction: find lines containing product titles (usually in links within description class)
product_lines = [line.strip() for line in html_content.splitlines() if 'class="title"' in line]
product_names = [line for line in product_lines if 'href' in line] # Filter for lines likely containing titles
print(f"Successfully fetched data from {target_url}.")
return product_names
except aiohttp.ClientError as e:
print(f"An error occurred while fetching {target_url}: {e}")
return [] # Return empty list on error
# Main async function to manage the process
async def main():
# Create a client session to manage connections
async with aiohttp.ClientSession() as session:
print("Client session started.")
product_titles = await fetch_product_names(session)
if product_titles:
print("\n--- Extracted Product Title Lines ---")
for title_line in product_titles:
print(title_line)
print("------------------------------------")
else:
print("No product titles extracted.")
print("Client session closed.")
# Entry point to run the main async function
if __name__ == "__main__":
print("Starting scraper...")
asyncio.run(main())
print("Scraper finished.")
If everything runs correctly, the script will connect through the specified Evomi proxy and print the HTML lines containing the product titles from the target page. This basic example demonstrates proxy integration; next, we'll explore more advanced techniques like rotation.
Advanced Aiohttp Proxy Strategies
The simple script works, but for serious scraping, relying on a single proxy IP isn't ideal. Let's enhance our script to handle multiple proxies and rotate them, significantly improving robustness and reducing the likelihood of blocks.
Managing and Rotating Multiple Proxies
We'll modify the code to use a list of proxies and select one randomly for each request. This distributes the load and makes the scraping pattern less predictable. We'll aim to scrape product names from the first few pages of the laptop category on our test site.
Step 1: Import Additional Libraries
We'll need the random
library for selecting proxies and potentially re
(regular expressions) or a parsing library like BeautifulSoup
(recommended, but we'll stick to basic string methods for simplicity here, install with pip install beautifulsoup4
if you want to use it) for better data extraction.
import aiohttp
import asyncio
import random
# import re # If using regex for extraction
# from bs4 import BeautifulSoup # If using BeautifulSoup for parsing
Step 2: Define Your Proxy List
Create a list containing your proxy connection strings. We'll use Evomi's datacenter proxy endpoint format as an example. Datacenter proxies (starting at $0.30/GB with Evomi) can be cost-effective for some tasks, though residential might be needed for stricter sites.
# List of Evomi Datacenter Proxies (replace with your actual proxies)
# Format: http://username:password@hostname:port
proxy_list = [
"http://user-dc1:pass123@dc.evomi.com:2000",
"http://user-dc2:pass456@dc.evomi.com:2000",
"http://user-dc3:pass789@dc.evomi.com:2000",
"http://user-dc4:passabc@dc.evomi.com:2000",
"http://user-dc5:passdef@dc.evomi.com:2000",
]
Reminder: Evomi datacenter proxies use ports 2000 (HTTP), 2001 (HTTPS), and 2002 (SOCKS5).
Step 3: Update the Fetching Function
Modify the function to accept a URL and a specific proxy from the list for each call. We'll also add error handling specific to proxy connections.
# Updated async function to fetch data from a specific URL using a specific proxy
async def fetch_page_data(session, page_url, proxy):
# Log proxy host, hide credentials
proxy_host = "N/A"
if proxy:
try:
# Attempt to extract host safely
proxy_host = proxy.split('@')[-1].split(':')[0]
except IndexError:
proxy_host = "Invalid Format" # Or handle as needed
print(f"Fetching {page_url} using proxy {proxy_host}...")
try:
async with session.get(page_url, proxy=proxy, timeout=aiohttp.ClientTimeout(total=15)) as response:
response.raise_for_status() # Check for HTTP errors
html_content = await response.text()
# --- Extraction Logic ---
# Replace this with more robust parsing (e.g., BeautifulSoup)
product_lines = [line.strip() for line in html_content.splitlines() if 'class="title"' in line]
product_names = [line for line in product_lines if 'href' in line]
# --- End Extraction Logic ---
print(f"Successfully fetched {page_url}")
return product_names
# Handle proxy-specific connection errors
except aiohttp.ClientProxyConnectionError as e:
print(f"Proxy Connection Error for {proxy_host}: {e}")
return None # Indicate failure for this proxy/page
# Handle other potential client errors (timeout, DNS issues, etc.)
except aiohttp.ClientError as e:
print(f"Client Error fetching {page_url} via {proxy_host}: {e}")
return None
# Handle timeouts specifically
except asyncio.TimeoutError:
print(f"Timeout fetching {page_url} via {proxy_host}")
return None
The function now takes
page_url
andproxy
as arguments.random.choice(proxy_list)
will be used in the main loop to pick a proxy.We added specific error handling for
aiohttp.ClientProxyConnectionError
.A timeout (e.g., 15 seconds) is added to prevent hanging on unresponsive proxies/servers.
The extraction logic remains basic; consider using BeautifulSoup for real-world scenarios:
# Example with BeautifulSoup (install first: pip install beautifulsoup4) # from bs4 import BeautifulSoup # soup = BeautifulSoup(html_content, 'html.parser') # titles = [a['title'] for a in soup.select('a.title')] # return titles
Step 4: Update the Main Function
The main function will now manage the loop for multiple pages, select proxies randomly, and gather results.
# Updated main function for rotating proxies and multiple pages
async def main():
base_url = "http://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
# Let's try to scrape the first 3 pages (assuming pagination exists or modify URLs accordingly)
# NOTE: This site might not have simple pagination; adjust target URLs as needed.
# For this example, we'll just fetch the same page multiple times with different proxies.
num_requests = 5 # Make 5 requests in total
target_urls = [base_url] * num_requests # Re-use the base URL for demo purposes
all_results = {} # Dictionary to store results per proxy
async with aiohttp.ClientSession() as session:
tasks = []
for i in range(num_requests):
selected_proxy = random.choice(proxy_list)
page_url = target_urls[i] # In a real case, this would be page_url_1, page_url_2 etc.
# Create an asyncio task for each request
task = asyncio.create_task(fetch_page_data(session, page_url, selected_proxy))
tasks.append((selected_proxy, task)) # Store proxy with its task
# Wait for all tasks to complete
results = await asyncio.gather(*(task for _, task in tasks))
# Process results
print("\n--- Scraping Results ---")
for i, (selected_proxy, _) in enumerate(tasks):
proxy_host = selected_proxy.split('@')[-1]
if results[i] is not None:
print(f"Proxy {proxy_host}: Successfully fetched {len(results[i])} items.")
if proxy_host not in all_results:
all_results[proxy_host] = []
all_results[proxy_host].extend(results[i]) # Append results for this proxy
else:
print(f"Proxy {proxy_host}: Failed to fetch data.")
# Optional: Print combined results per proxy
print("\n--- Combined Data Per Proxy ---")
for proxy_host, data in all_results.items():
print(f"\nData from Proxy: {proxy_host}")
# Print first few items as example
for item in data[:3]:
print(f" - {item}")
if len(data) > 3:
print(f" ... and {len(data) - 3} more items")
print("\nClient session closed.")
We define the number of requests/pages to fetch.
random.choice(proxy_list)
selects a proxy for each request.asyncio.create_task()
creates tasks for concurrent execution.asyncio.gather()
runs all tasks concurrently and collects their results.The results are processed, showing which proxy fetched what data (or if it failed).
Step 5: Run the Updated Code
Use the standard asyncio entry point:
# Run the main async function
if __name__ == "__main__":
if not proxy_list:
print("Error: Proxy list is empty. Please add proxies.")
else:
print("Starting multi-proxy scraper...")
asyncio.run(main())
print("Scraper finished.")
Step 6: The Complete Rotating Proxy Script
Here is the full code combining these changes:
import aiohttp
import asyncio
import random
# import re # Uncomment if using regex
# from bs4 import BeautifulSoup # Uncomment if using BeautifulSoup
# List of Evomi Datacenter Proxies (replace with your actual proxies)
proxy_list = [
"http://user-dc1:pass123@dc.evomi.com:2000",
"http://user-dc2:pass456@dc.evomi.com:2000",
"http://user-dc3:pass789@dc.evomi.com:2000",
"http://user-dc4:passabc@dc.evomi.com:2000",
"http://user-dc5:passdef@dc.evomi.com:2000",
]
# Updated async function to fetch data from a specific URL using a specific proxy
async def fetch_page_data(session, page_url, proxy):
proxy_host_for_log = proxy.split('@')[-1] if '@' in proxy else proxy # Log proxy host, hide credentials
print(f"Fetching {page_url} using proxy {proxy_host_for_log}...")
try:
# Increased timeout
async with session.get(
page_url, proxy=proxy, timeout=aiohttp.ClientTimeout(total=20)
) as response:
response.raise_for_status() # Check for HTTP errors
html_content = await response.text()
# --- Basic Extraction Logic ---
product_lines = [
line.strip()
for line in html_content.splitlines()
if 'class="title"' in line
]
product_names = [line for line in product_lines if 'href' in line]
# --- End Extraction Logic ---
print(f"Successfully fetched {page_url} via {proxy_host_for_log}")
return product_names
except aiohttp.ClientProxyConnectionError as e:
print(f"Proxy Connection Error for {proxy_host_for_log}: {e}")
return None
except aiohttp.ClientError as e:
print(f"Client Error fetching {page_url} via {proxy_host_for_log}: {e}")
return None
except asyncio.TimeoutError:
print(f"Timeout fetching {page_url} via {proxy_host_for_log}")
return None
# Updated main function for rotating proxies and multiple pages
async def main():
base_url = "http://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
num_requests = 5
target_urls = [base_url] * num_requests
all_results = {}
# Use a TCPConnector to limit concurrent connections if needed
# connector = aiohttp.TCPConnector(limit=10) # Limit to 10 concurrent connections per host
# async with aiohttp.ClientSession(connector=connector) as session:
async with aiohttp.ClientSession() as session: # Default connector
tasks = []
# Select proxies for this run
proxies_in_use = random.sample(proxy_list, min(num_requests, len(proxy_list)))
for i in range(num_requests):
# Cycle through the selected proxies if num_requests > len(proxies_in_use)
selected_proxy = proxies_in_use[i % len(proxies_in_use)]
page_url = target_urls[i]
task = asyncio.create_task(
fetch_page_data(session, page_url, selected_proxy)
)
tasks.append((selected_proxy, task))
# Capture exceptions too
results = await asyncio.gather(*(task for _, task in tasks), return_exceptions=True)
print("\n--- Scraping Results ---")
successful_fetches = 0
failed_fetches = 0
for i, (selected_proxy, _) in enumerate(tasks):
proxy_host = selected_proxy.split('@')[-1] if '@' in selected_proxy else selected_proxy
if isinstance(results[i], Exception):
print(f"Proxy {proxy_host}: Task failed with exception: {results[i]}")
failed_fetches += 1
elif results[i] is not None:
print(f"Proxy {proxy_host}: Successfully fetched {len(results[i])} items.")
if proxy_host not in all_results:
all_results[proxy_host] = []
all_results[proxy_host].extend(results[i])
successful_fetches += 1
else:
# This case might happen if fetch_page_data returned None without an exception
print(f"Proxy {proxy_host}: Failed to fetch data (returned None).")
failed_fetches += 1
print(f"\nSummary: {successful_fetches} successful fetches, {failed_fetches} failed fetches.")
# Optional: Print combined results per proxy
print("\n--- Combined Data Per Proxy (Sample) ---")
for proxy_host, data in all_results.items():
print(f"\nData from Proxy: {proxy_host} ({len(data)} total items)")
for item in data[:3]:
print(f" - {item}")
if len(data) > 3:
print(f" ... and {len(data) - 3} more items")
print("\nClient session closed.")
# Run the main async function
if __name__ == "__main__":
if not proxy_list:
print("Error: Proxy list is empty. Please add proxies.")
else:
print(f"Starting multi-proxy scraper with {len(proxy_list)} proxies...")
asyncio.run(main())
print("Scraper finished.")
When you run this script, you'll see output indicating which proxy is being used for each request. The final summary will show the data collected via each proxy IP, demonstrating successful rotation. This approach significantly enhances the resilience of your scraper against IP-based blocks.
Handling Proxy Authentication
Our examples already incorporate proxy authentication directly within the proxy URL string:
http://USERNAME:PASSWORD@HOSTNAME:PORT
aiohttp
automatically parses this format. When you pass a URL like "http://user-dc1:pass123@dc.evomi.com:2000"
to the proxy
parameter in session.get()
, aiohttp
handles the necessary Proxy-Authorization
header for Basic Authentication.
Alternatively, if your proxy provider requires or if you prefer separating credentials, you can use aiohttp.BasicAuth
:
proxy_url_no_auth = "http://dc.evomi.com:2000"
auth = aiohttp.BasicAuth("user-dc1", "pass123")
# ... inside your async function ...
async with session.get(target_url, proxy=proxy_url_no_auth, proxy_auth=auth) as response:
# ... rest of the code
Both methods achieve the same result. Using the embedded format is often more convenient when managing lists of proxies.
Securing Connections with SSL
When scraping sites over HTTPS or handling sensitive data, ensuring your connection is encrypted via SSL/TLS is vital. aiohttp
handles SSL verification by default when connecting to HTTPS URLs.
Our examples used HTTP URLs (http://...
). If you target HTTPS sites (https://...
), aiohttp
will automatically attempt an SSL handshake. By default, it verifies the server's SSL certificate against a trusted set of Certificate Authorities (CAs), usually provided by the certifi
library.
You generally don't need to manually configure SSL unless:
You need to trust a self-signed certificate (common in testing environments).
You want to disable SSL verification (strongly discouraged for production as it opens you to man-in-the-middle attacks).
You need to specify a particular set of CAs.
To customize SSL behavior, you create an ssl.SSLContext
:
import ssl
# Create a default SSL context (recommended starting point)
ssl_context = ssl.create_default_context()
# Example: Load custom CA bundle (if needed)
# ssl_context.load_verify_locations(cafile='/path/to/custom/ca.crt')
# Example: Disable verification (DANGEROUS - for testing only)
# ssl_context = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)
# ssl_context.check_hostname = False
# ssl_context.verify_mode = ssl.CERT_NONE
# Pass the context to the session.get call
async with session.get(https_url, proxy=proxy, ssl=ssl_context) as response:
# ...
For most scraping tasks involving standard HTTPS websites, the default SSL handling in aiohttp
is sufficient and secure.
Best Practices for Aiohttp Proxy Usage
You've now got the technical skills to integrate and rotate proxies with aiohttp
. To maximize effectiveness and minimize disruptions, consider these best practices:
Tips for Staying Under the Radar
Choose Quality Proxies: Not all proxies are equal. Opt for reputable providers like Evomi, known for reliable and ethically sourced residential or mobile proxies. These blend in better with normal user traffic compared to datacenter IPs, especially on stricter websites. Our Swiss base also reflects a commitment to quality and privacy. You can always verify proxy performance using tools like our free Proxy Tester.
Implement Smart Rotation: Don't just use multiple proxies; rotate them intelligently. Avoid hitting the same domain repeatedly with the same IP in a short period. The random rotation shown earlier is a good start. For large-scale scraping, consider session-based rotation (keeping one IP for a user's "session" on a site) or geographic targeting if needed.
Mimic Human Behavior: Automation is fast, humans aren't always. Introduce random delays between requests to avoid predictable, machine-like patterns.
import time # Inside your loop or before making a request sleep_time = random.uniform(1.5, 4.5) # Wait 1.5 to 4.5 seconds print(f"Sleeping for {sleep_time:.2f} seconds...") await asyncio.sleep(sleep_time) # Now make the request...
Manage Headers and Fingerprints: Send realistic User-Agent strings and other HTTP headers that match common browsers. Be aware of browser fingerprinting techniques websites might use. Tools like Evomi's Browser Fingerprint Checker can show what sites see, and our antidetect browser, Evomium (free for customers), is designed to manage these fingerprints effectively.
Respect
robots.txt
: While not technically related to anonymity, respecting a site'srobots.txt
file (which outlines scraping rules) is good practice and can prevent legal or ethical issues.Use SSL/TLS Correctly: Always use HTTPS where available and ensure SSL verification is enabled unless you have a very specific, understood reason to disable it.
Handle CAPTCHAs Gracefully: If you encounter CAPTCHAs, integrate a solving service. Don't just give up or hammer the site. Check out options in our review of top CAPTCHA solvers.
Common Problems and Fixes
Even with best practices, you might hit snags. Here are common aiohttp
proxy-related errors and how to approach them:
aiohttp.ClientProxyConnectionError
: This usually means your script couldn't even reach the proxy server.Check Proxy Details: Double-check the IP/hostname, port, username, and password. Typos are common!
Verify Proxy Status: Is the proxy online and working? Use a tool like Evomi's Proxy Tester or simple
curl -x http://user:pass@proxy:port http://example.com
from your terminal.Firewall Issues: Ensure no local or network firewall is blocking the connection to the proxy port.
aiohttp.ClientHttpProxyError
/ Status Code 407 Proxy Authentication Required: The connection to the proxy worked, but authentication failed.Check Credentials: Verify the username and password again.
Authentication Format: Ensure you're using the correct authentication method (Basic Auth is common, handled by the URL format or
aiohttp.BasicAuth
). Check your provider's documentation.IP Authorization: Some providers require you to authorize the IP address *from which* you are connecting to the proxy. Check your Evomi dashboard or provider's settings.
aiohttp.ClientHttpProxyError
/ Other 4xx/5xx Status Codes from Proxy: The proxy responded, but with an error (e.g., 403 Forbidden, 502 Bad Gateway).Proxy Restrictions: The proxy itself might be blocked from accessing the target site, or it might have internal issues. Try a different proxy from your pool.
Provider Issue: There might be a temporary problem with the proxy service. Check the provider's status page or contact support.
asyncio.TimeoutError
oraiohttp.ServerTimeoutError
: The request took too long.Increase Timeout: The default timeout might be too short for slow proxies or target sites. Increase it in
aiohttp.ClientTimeout(total=...)
passed to the request or session.Proxy Performance: The specific proxy might be slow or overloaded. Rotate to a different one.
Target Server Slow: The website you're scraping might be slow to respond.
aiohttp.ClientSSLError
: An issue occurred during the SSL handshake with the *target* server (when using HTTPS).Outdated Certificates: Ensure your system's CA certificates (often managed by `certifi`) are up to date (`pip install --upgrade certifi`).
Server Configuration: The target website might have an invalid or misconfigured SSL certificate. You might need to investigate further or, as a last resort (and if you understand the risks), customize the SSL context to be less strict (see SSL section above).
Proxy Interference (Less Common): Some proxies (especially transparent ones, not typically used for scraping this way) might interfere with SSL. Ensure you're using appropriate HTTP/S or SOCKS proxies designed for this.
Wrapping Up
Hopefully, this guide provides a solid foundation for using `aiohttp` with proxies for your web scraping projects. The asynchronous nature of `aiohttp` offers significant performance benefits, while proxies provide the necessary means to scrape responsibly and avoid interruptions. Remember that successful scraping often involves combining the right tools (`aiohttp`, quality proxies like those from Evomi) with smart strategies (rotation, delays, header management).
Python's ecosystem offers many tools for web scraping beyond `aiohttp`. To explore other options, take a look at our overview of the best Python web scraping libraries.
Diving into Asynchronous Web Scraping with Aiohttp and Proxies
Extracting data from the web, or web scraping, is a cornerstone technique for everything from tracking e-commerce prices to aggregating news feeds or monitoring financial markets. It's about using code to fetch information automatically and efficiently. If you're venturing into building your own scraper, getting comfortable with some coding is essential.
Python is a fantastic choice for this, largely thanks to its straightforward syntax and powerful libraries. When it comes to making HTTP requests, Python offers several popular options, including aiohttp, httpx, and the classic requests library. Each has its own strengths, which you can explore further in our comparison of httpx vs aiohttp vs requests.
This guide focuses on aiohttp
, an asynchronous library. What does "asynchronous" mean here? It means aiohttp
can juggle multiple web requests simultaneously without getting stuck waiting for each one to finish. This makes it incredibly efficient for tasks requiring many concurrent connections. However, to truly harness this power without running into roadblocks like IP bans, integrating proxies is crucial. Proxies help mask your origin and distribute your requests, making your scraping activities less likely to be flagged by target websites.
Here’s what we'll cover:
The basics of how
aiohttp
achieves concurrency.Getting
aiohttp
set up on your machine.Integrating proxies (like Evomi's residential proxies) with
aiohttp
.Smart strategies for using proxies with
aiohttp
effectively.
Let's get started!
What is Aiohttp Anyway?
Think about needing to grab real-time data from several different online sources, perhaps fetching currency exchange rates from multiple financial websites at once. With traditional, synchronous libraries (like the standard `requests` library), your program would make a request, wait for the response, process it, then move to the next request, one by one. This sequential process can become a significant bottleneck, especially when dealing with many slow or unresponsive servers.
Aiohttp
tackles this differently using Python's asyncio
framework. It allows your program to initiate a request and then immediately move on to other tasks (like starting another request) without waiting for the first one to complete. When a response arrives, aiohttp
handles it. This non-blocking approach means you can manage numerous HTTP operations concurrently, drastically speeding up I/O-bound tasks like web scraping.
Setting Up Aiohttp with Proxies
So, aiohttp
lets you fire off requests rapidly. That's great for speed, but it also increases the chances of overwhelming a target server or triggering its defenses. Websites often monitor incoming traffic, and a sudden flood of requests from a single IP address is a classic sign of automated scraping. This can lead to rate limiting (slowing you down), CAPTCHAs, or outright IP blocks.
This is where proxies become indispensable. By routing your aiohttp
requests through proxy servers, you change the source IP address for each request or group of requests. For high-volume scraping, simple datacenter proxies might not be enough. Rotating residential proxies, which use IP addresses assigned by ISPs to real home users, are often the gold standard. They provide a high degree of anonymity and legitimacy, making it harder for websites to distinguish your scraper from genuine user traffic. Evomi offers ethically sourced residential proxies starting at just $0.49/GB, perfect for these kinds of tasks.
Let's translate this into practical code.
What You'll Need
Before diving into the code, ensure you have:
Python 3.7 or a newer version installed.
Access to proxy servers. For robust scraping, consider rotating residential proxies. Evomi provides these, along with mobile and datacenter options, and even offers a completely free trial to test them out.
Got everything? Great, let's proceed.
Installing Aiohttp
Getting aiohttp
is simple using pip, Python's package installer. Open your terminal or command prompt and type:
To verify the installation, you can check the installed package details:
This command should display information about the installed aiohttp
library, including its version.
A Basic Script Using Aiohttp with a Proxy
Now that the setup is complete, let's write a simple Python script. We'll use `aiohttp` to fetch product names from a test e-commerce site, routing the request through an Evomi residential proxy. For this example, we'll target a test scraping site.
Step 1: Import Libraries
First, we need to import the necessary Python libraries: aiohttp
for the web requests and asyncio
to run our asynchronous code.
import aiohttp
import asyncio
Step 2: Configure Your Proxy
Define the proxy server details. We'll use Evomi's residential proxy endpoint format. Remember to replace placeholders with your actual credentials.
# Evomi Residential Proxy Configuration (replace with your details)
# Format: http://username:password@hostname:port
proxy_url = "http://user-xyz:pass123@rp.evomi.com:1000"
# If your proxy doesn't require auth embedded in the URL,
# you might use BasicAuth like this (adjust accordingly):
# proxy_url = "http://rp.evomi.com:1000"
# proxy_auth = aiohttp.BasicAuth('user-xyz', 'pass123')
# For this example, we embed auth in the proxy_url.
Note: Evomi offers different ports for HTTP (1000), HTTPS (1001), and SOCKS5 (1002) for residential proxies. Ensure you use the correct one for your needs.
Step 3: Write the Async Fetching Function
Let's create the asynchronous function that performs the actual web request via the proxy.
# Async function to fetch product names
async def fetch_product_names(session):
target_url = "http://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
print(f"Attempting to fetch data from: {target_url}")
# Use the proxy configured earlier
try:
async with session.get(target_url, proxy=proxy_url) as response:
# Check if the request was successful
response.raise_for_status() # Raises an exception for bad status codes (4xx or 5xx)
html_content = await response.text()
# Basic extraction: find lines containing product titles (usually in links within description class)
# This is a simplified approach; a real scraper would use libraries like BeautifulSoup
product_lines = [line.strip() for line in html_content.splitlines() if 'class="title"' in line]
# Further clean-up might be needed depending on HTML structure
product_names = [line for line in product_lines if 'href' in line] # Filter for lines likely containing titles
print("Successfully fetched data.")
return product_names
except aiohttp.ClientError as e:
print(f"An error occurred: {e}")
return [] # Return empty list on error
async def fetch_product_names(session)
: Defines an asynchronous function.async
signals it can perform non-blocking operations.target_url
: The web page we want to scrape.session.get(target_url, proxy=proxy_url)
: Makes an HTTP GET request using the provided session, directing it through our configuredproxy_url
.response.raise_for_status()
: A good practice to check if the request succeeded (status code 2xx).await response.text()
: Asynchronously gets the response body as text.await
pauses this function until the text is ready, allowing other tasks to run.Extracting Names: This example uses basic string searching (
'class="title"'
) to find lines likely containing product titles. For robust scraping, libraries like Beautiful Soup or lxml are recommended for parsing HTML.Error Handling: The
try...except
block catches potential connection or HTTP errors.
Step 4: Create the Main Async Function
This function orchestrates the process: it creates the aiohttp
session and calls our fetching function.
# Main async function to manage the process
async def main():
# Create a client session to manage connections
async with aiohttp.ClientSession() as session:
print("Client session started.")
product_titles = await fetch_product_names(session)
if product_titles:
print("\n--- Extracted Product Title Lines ---")
for title_line in product_titles:
# Print the raw line containing the title - further parsing needed for clean titles
print(title_line)
print("------------------------------------")
else:
print("No product titles extracted.")
print("Client session closed.")
async def main()
: The main entry point for our async operations.aiohttp.ClientSession()
: Creates a session object. Using a session is efficient as it can reuse connections and manage cookies.await fetch_product_names(session)
: Calls our fetching function and waits for it to complete.Printing Results: Loops through the returned list and prints the lines identified as potentially containing titles.
Step 5: Run the Async Code
Finally, use asyncio.run()
to execute the main
function.
# Entry point to run the main async function
if __name__ == "__main__":
print("Starting scraper...")
asyncio.run(main())
print("Scraper finished.")
Step 6: Putting It All Together
Here’s the complete script:
import aiohttp
import asyncio
# Evomi Residential Proxy Configuration (replace with your details)
# Format: http://username:password@hostname:port
proxy_url = "http://user-xyz:pass123@rp.evomi.com:1000"
# Async function to fetch product names
async def fetch_product_names(session):
target_url = "http://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
print(f"Attempting to fetch data from: {target_url} via proxy {proxy_url.split('@')[-1]}") # Hide credentials in log
try:
# Use the proxy configured earlier
async with session.get(target_url, proxy=proxy_url) as response:
# Check if the request was successful
response.raise_for_status() # Raises an exception for bad status codes (4xx or 5xx)
html_content = await response.text()
# Basic extraction: find lines containing product titles (usually in links within description class)
product_lines = [line.strip() for line in html_content.splitlines() if 'class="title"' in line]
product_names = [line for line in product_lines if 'href' in line] # Filter for lines likely containing titles
print(f"Successfully fetched data from {target_url}.")
return product_names
except aiohttp.ClientError as e:
print(f"An error occurred while fetching {target_url}: {e}")
return [] # Return empty list on error
# Main async function to manage the process
async def main():
# Create a client session to manage connections
async with aiohttp.ClientSession() as session:
print("Client session started.")
product_titles = await fetch_product_names(session)
if product_titles:
print("\n--- Extracted Product Title Lines ---")
for title_line in product_titles:
print(title_line)
print("------------------------------------")
else:
print("No product titles extracted.")
print("Client session closed.")
# Entry point to run the main async function
if __name__ == "__main__":
print("Starting scraper...")
asyncio.run(main())
print("Scraper finished.")
If everything runs correctly, the script will connect through the specified Evomi proxy and print the HTML lines containing the product titles from the target page. This basic example demonstrates proxy integration; next, we'll explore more advanced techniques like rotation.
Advanced Aiohttp Proxy Strategies
The simple script works, but for serious scraping, relying on a single proxy IP isn't ideal. Let's enhance our script to handle multiple proxies and rotate them, significantly improving robustness and reducing the likelihood of blocks.
Managing and Rotating Multiple Proxies
We'll modify the code to use a list of proxies and select one randomly for each request. This distributes the load and makes the scraping pattern less predictable. We'll aim to scrape product names from the first few pages of the laptop category on our test site.
Step 1: Import Additional Libraries
We'll need the random
library for selecting proxies and potentially re
(regular expressions) or a parsing library like BeautifulSoup
(recommended, but we'll stick to basic string methods for simplicity here, install with pip install beautifulsoup4
if you want to use it) for better data extraction.
import aiohttp
import asyncio
import random
# import re # If using regex for extraction
# from bs4 import BeautifulSoup # If using BeautifulSoup for parsing
Step 2: Define Your Proxy List
Create a list containing your proxy connection strings. We'll use Evomi's datacenter proxy endpoint format as an example. Datacenter proxies (starting at $0.30/GB with Evomi) can be cost-effective for some tasks, though residential might be needed for stricter sites.
# List of Evomi Datacenter Proxies (replace with your actual proxies)
# Format: http://username:password@hostname:port
proxy_list = [
"http://user-dc1:pass123@dc.evomi.com:2000",
"http://user-dc2:pass456@dc.evomi.com:2000",
"http://user-dc3:pass789@dc.evomi.com:2000",
"http://user-dc4:passabc@dc.evomi.com:2000",
"http://user-dc5:passdef@dc.evomi.com:2000",
]
Reminder: Evomi datacenter proxies use ports 2000 (HTTP), 2001 (HTTPS), and 2002 (SOCKS5).
Step 3: Update the Fetching Function
Modify the function to accept a URL and a specific proxy from the list for each call. We'll also add error handling specific to proxy connections.
# Updated async function to fetch data from a specific URL using a specific proxy
async def fetch_page_data(session, page_url, proxy):
# Log proxy host, hide credentials
proxy_host = "N/A"
if proxy:
try:
# Attempt to extract host safely
proxy_host = proxy.split('@')[-1].split(':')[0]
except IndexError:
proxy_host = "Invalid Format" # Or handle as needed
print(f"Fetching {page_url} using proxy {proxy_host}...")
try:
async with session.get(page_url, proxy=proxy, timeout=aiohttp.ClientTimeout(total=15)) as response:
response.raise_for_status() # Check for HTTP errors
html_content = await response.text()
# --- Extraction Logic ---
# Replace this with more robust parsing (e.g., BeautifulSoup)
product_lines = [line.strip() for line in html_content.splitlines() if 'class="title"' in line]
product_names = [line for line in product_lines if 'href' in line]
# --- End Extraction Logic ---
print(f"Successfully fetched {page_url}")
return product_names
# Handle proxy-specific connection errors
except aiohttp.ClientProxyConnectionError as e:
print(f"Proxy Connection Error for {proxy_host}: {e}")
return None # Indicate failure for this proxy/page
# Handle other potential client errors (timeout, DNS issues, etc.)
except aiohttp.ClientError as e:
print(f"Client Error fetching {page_url} via {proxy_host}: {e}")
return None
# Handle timeouts specifically
except asyncio.TimeoutError:
print(f"Timeout fetching {page_url} via {proxy_host}")
return None
The function now takes
page_url
andproxy
as arguments.random.choice(proxy_list)
will be used in the main loop to pick a proxy.We added specific error handling for
aiohttp.ClientProxyConnectionError
.A timeout (e.g., 15 seconds) is added to prevent hanging on unresponsive proxies/servers.
The extraction logic remains basic; consider using BeautifulSoup for real-world scenarios:
# Example with BeautifulSoup (install first: pip install beautifulsoup4) # from bs4 import BeautifulSoup # soup = BeautifulSoup(html_content, 'html.parser') # titles = [a['title'] for a in soup.select('a.title')] # return titles
Step 4: Update the Main Function
The main function will now manage the loop for multiple pages, select proxies randomly, and gather results.
# Updated main function for rotating proxies and multiple pages
async def main():
base_url = "http://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
# Let's try to scrape the first 3 pages (assuming pagination exists or modify URLs accordingly)
# NOTE: This site might not have simple pagination; adjust target URLs as needed.
# For this example, we'll just fetch the same page multiple times with different proxies.
num_requests = 5 # Make 5 requests in total
target_urls = [base_url] * num_requests # Re-use the base URL for demo purposes
all_results = {} # Dictionary to store results per proxy
async with aiohttp.ClientSession() as session:
tasks = []
for i in range(num_requests):
selected_proxy = random.choice(proxy_list)
page_url = target_urls[i] # In a real case, this would be page_url_1, page_url_2 etc.
# Create an asyncio task for each request
task = asyncio.create_task(fetch_page_data(session, page_url, selected_proxy))
tasks.append((selected_proxy, task)) # Store proxy with its task
# Wait for all tasks to complete
results = await asyncio.gather(*(task for _, task in tasks))
# Process results
print("\n--- Scraping Results ---")
for i, (selected_proxy, _) in enumerate(tasks):
proxy_host = selected_proxy.split('@')[-1]
if results[i] is not None:
print(f"Proxy {proxy_host}: Successfully fetched {len(results[i])} items.")
if proxy_host not in all_results:
all_results[proxy_host] = []
all_results[proxy_host].extend(results[i]) # Append results for this proxy
else:
print(f"Proxy {proxy_host}: Failed to fetch data.")
# Optional: Print combined results per proxy
print("\n--- Combined Data Per Proxy ---")
for proxy_host, data in all_results.items():
print(f"\nData from Proxy: {proxy_host}")
# Print first few items as example
for item in data[:3]:
print(f" - {item}")
if len(data) > 3:
print(f" ... and {len(data) - 3} more items")
print("\nClient session closed.")
We define the number of requests/pages to fetch.
random.choice(proxy_list)
selects a proxy for each request.asyncio.create_task()
creates tasks for concurrent execution.asyncio.gather()
runs all tasks concurrently and collects their results.The results are processed, showing which proxy fetched what data (or if it failed).
Step 5: Run the Updated Code
Use the standard asyncio entry point:
# Run the main async function
if __name__ == "__main__":
if not proxy_list:
print("Error: Proxy list is empty. Please add proxies.")
else:
print("Starting multi-proxy scraper...")
asyncio.run(main())
print("Scraper finished.")
Step 6: The Complete Rotating Proxy Script
Here is the full code combining these changes:
import aiohttp
import asyncio
import random
# import re # Uncomment if using regex
# from bs4 import BeautifulSoup # Uncomment if using BeautifulSoup
# List of Evomi Datacenter Proxies (replace with your actual proxies)
proxy_list = [
"http://user-dc1:pass123@dc.evomi.com:2000",
"http://user-dc2:pass456@dc.evomi.com:2000",
"http://user-dc3:pass789@dc.evomi.com:2000",
"http://user-dc4:passabc@dc.evomi.com:2000",
"http://user-dc5:passdef@dc.evomi.com:2000",
]
# Updated async function to fetch data from a specific URL using a specific proxy
async def fetch_page_data(session, page_url, proxy):
proxy_host_for_log = proxy.split('@')[-1] if '@' in proxy else proxy # Log proxy host, hide credentials
print(f"Fetching {page_url} using proxy {proxy_host_for_log}...")
try:
# Increased timeout
async with session.get(
page_url, proxy=proxy, timeout=aiohttp.ClientTimeout(total=20)
) as response:
response.raise_for_status() # Check for HTTP errors
html_content = await response.text()
# --- Basic Extraction Logic ---
product_lines = [
line.strip()
for line in html_content.splitlines()
if 'class="title"' in line
]
product_names = [line for line in product_lines if 'href' in line]
# --- End Extraction Logic ---
print(f"Successfully fetched {page_url} via {proxy_host_for_log}")
return product_names
except aiohttp.ClientProxyConnectionError as e:
print(f"Proxy Connection Error for {proxy_host_for_log}: {e}")
return None
except aiohttp.ClientError as e:
print(f"Client Error fetching {page_url} via {proxy_host_for_log}: {e}")
return None
except asyncio.TimeoutError:
print(f"Timeout fetching {page_url} via {proxy_host_for_log}")
return None
# Updated main function for rotating proxies and multiple pages
async def main():
base_url = "http://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
num_requests = 5
target_urls = [base_url] * num_requests
all_results = {}
# Use a TCPConnector to limit concurrent connections if needed
# connector = aiohttp.TCPConnector(limit=10) # Limit to 10 concurrent connections per host
# async with aiohttp.ClientSession(connector=connector) as session:
async with aiohttp.ClientSession() as session: # Default connector
tasks = []
# Select proxies for this run
proxies_in_use = random.sample(proxy_list, min(num_requests, len(proxy_list)))
for i in range(num_requests):
# Cycle through the selected proxies if num_requests > len(proxies_in_use)
selected_proxy = proxies_in_use[i % len(proxies_in_use)]
page_url = target_urls[i]
task = asyncio.create_task(
fetch_page_data(session, page_url, selected_proxy)
)
tasks.append((selected_proxy, task))
# Capture exceptions too
results = await asyncio.gather(*(task for _, task in tasks), return_exceptions=True)
print("\n--- Scraping Results ---")
successful_fetches = 0
failed_fetches = 0
for i, (selected_proxy, _) in enumerate(tasks):
proxy_host = selected_proxy.split('@')[-1] if '@' in selected_proxy else selected_proxy
if isinstance(results[i], Exception):
print(f"Proxy {proxy_host}: Task failed with exception: {results[i]}")
failed_fetches += 1
elif results[i] is not None:
print(f"Proxy {proxy_host}: Successfully fetched {len(results[i])} items.")
if proxy_host not in all_results:
all_results[proxy_host] = []
all_results[proxy_host].extend(results[i])
successful_fetches += 1
else:
# This case might happen if fetch_page_data returned None without an exception
print(f"Proxy {proxy_host}: Failed to fetch data (returned None).")
failed_fetches += 1
print(f"\nSummary: {successful_fetches} successful fetches, {failed_fetches} failed fetches.")
# Optional: Print combined results per proxy
print("\n--- Combined Data Per Proxy (Sample) ---")
for proxy_host, data in all_results.items():
print(f"\nData from Proxy: {proxy_host} ({len(data)} total items)")
for item in data[:3]:
print(f" - {item}")
if len(data) > 3:
print(f" ... and {len(data) - 3} more items")
print("\nClient session closed.")
# Run the main async function
if __name__ == "__main__":
if not proxy_list:
print("Error: Proxy list is empty. Please add proxies.")
else:
print(f"Starting multi-proxy scraper with {len(proxy_list)} proxies...")
asyncio.run(main())
print("Scraper finished.")
When you run this script, you'll see output indicating which proxy is being used for each request. The final summary will show the data collected via each proxy IP, demonstrating successful rotation. This approach significantly enhances the resilience of your scraper against IP-based blocks.
Handling Proxy Authentication
Our examples already incorporate proxy authentication directly within the proxy URL string:
http://USERNAME:PASSWORD@HOSTNAME:PORT
aiohttp
automatically parses this format. When you pass a URL like "http://user-dc1:pass123@dc.evomi.com:2000"
to the proxy
parameter in session.get()
, aiohttp
handles the necessary Proxy-Authorization
header for Basic Authentication.
Alternatively, if your proxy provider requires or if you prefer separating credentials, you can use aiohttp.BasicAuth
:
proxy_url_no_auth = "http://dc.evomi.com:2000"
auth = aiohttp.BasicAuth("user-dc1", "pass123")
# ... inside your async function ...
async with session.get(target_url, proxy=proxy_url_no_auth, proxy_auth=auth) as response:
# ... rest of the code
Both methods achieve the same result. Using the embedded format is often more convenient when managing lists of proxies.
Securing Connections with SSL
When scraping sites over HTTPS or handling sensitive data, ensuring your connection is encrypted via SSL/TLS is vital. aiohttp
handles SSL verification by default when connecting to HTTPS URLs.
Our examples used HTTP URLs (http://...
). If you target HTTPS sites (https://...
), aiohttp
will automatically attempt an SSL handshake. By default, it verifies the server's SSL certificate against a trusted set of Certificate Authorities (CAs), usually provided by the certifi
library.
You generally don't need to manually configure SSL unless:
You need to trust a self-signed certificate (common in testing environments).
You want to disable SSL verification (strongly discouraged for production as it opens you to man-in-the-middle attacks).
You need to specify a particular set of CAs.
To customize SSL behavior, you create an ssl.SSLContext
:
import ssl
# Create a default SSL context (recommended starting point)
ssl_context = ssl.create_default_context()
# Example: Load custom CA bundle (if needed)
# ssl_context.load_verify_locations(cafile='/path/to/custom/ca.crt')
# Example: Disable verification (DANGEROUS - for testing only)
# ssl_context = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)
# ssl_context.check_hostname = False
# ssl_context.verify_mode = ssl.CERT_NONE
# Pass the context to the session.get call
async with session.get(https_url, proxy=proxy, ssl=ssl_context) as response:
# ...
For most scraping tasks involving standard HTTPS websites, the default SSL handling in aiohttp
is sufficient and secure.
Best Practices for Aiohttp Proxy Usage
You've now got the technical skills to integrate and rotate proxies with aiohttp
. To maximize effectiveness and minimize disruptions, consider these best practices:
Tips for Staying Under the Radar
Choose Quality Proxies: Not all proxies are equal. Opt for reputable providers like Evomi, known for reliable and ethically sourced residential or mobile proxies. These blend in better with normal user traffic compared to datacenter IPs, especially on stricter websites. Our Swiss base also reflects a commitment to quality and privacy. You can always verify proxy performance using tools like our free Proxy Tester.
Implement Smart Rotation: Don't just use multiple proxies; rotate them intelligently. Avoid hitting the same domain repeatedly with the same IP in a short period. The random rotation shown earlier is a good start. For large-scale scraping, consider session-based rotation (keeping one IP for a user's "session" on a site) or geographic targeting if needed.
Mimic Human Behavior: Automation is fast, humans aren't always. Introduce random delays between requests to avoid predictable, machine-like patterns.
import time # Inside your loop or before making a request sleep_time = random.uniform(1.5, 4.5) # Wait 1.5 to 4.5 seconds print(f"Sleeping for {sleep_time:.2f} seconds...") await asyncio.sleep(sleep_time) # Now make the request...
Manage Headers and Fingerprints: Send realistic User-Agent strings and other HTTP headers that match common browsers. Be aware of browser fingerprinting techniques websites might use. Tools like Evomi's Browser Fingerprint Checker can show what sites see, and our antidetect browser, Evomium (free for customers), is designed to manage these fingerprints effectively.
Respect
robots.txt
: While not technically related to anonymity, respecting a site'srobots.txt
file (which outlines scraping rules) is good practice and can prevent legal or ethical issues.Use SSL/TLS Correctly: Always use HTTPS where available and ensure SSL verification is enabled unless you have a very specific, understood reason to disable it.
Handle CAPTCHAs Gracefully: If you encounter CAPTCHAs, integrate a solving service. Don't just give up or hammer the site. Check out options in our review of top CAPTCHA solvers.
Common Problems and Fixes
Even with best practices, you might hit snags. Here are common aiohttp
proxy-related errors and how to approach them:
aiohttp.ClientProxyConnectionError
: This usually means your script couldn't even reach the proxy server.Check Proxy Details: Double-check the IP/hostname, port, username, and password. Typos are common!
Verify Proxy Status: Is the proxy online and working? Use a tool like Evomi's Proxy Tester or simple
curl -x http://user:pass@proxy:port http://example.com
from your terminal.Firewall Issues: Ensure no local or network firewall is blocking the connection to the proxy port.
aiohttp.ClientHttpProxyError
/ Status Code 407 Proxy Authentication Required: The connection to the proxy worked, but authentication failed.Check Credentials: Verify the username and password again.
Authentication Format: Ensure you're using the correct authentication method (Basic Auth is common, handled by the URL format or
aiohttp.BasicAuth
). Check your provider's documentation.IP Authorization: Some providers require you to authorize the IP address *from which* you are connecting to the proxy. Check your Evomi dashboard or provider's settings.
aiohttp.ClientHttpProxyError
/ Other 4xx/5xx Status Codes from Proxy: The proxy responded, but with an error (e.g., 403 Forbidden, 502 Bad Gateway).Proxy Restrictions: The proxy itself might be blocked from accessing the target site, or it might have internal issues. Try a different proxy from your pool.
Provider Issue: There might be a temporary problem with the proxy service. Check the provider's status page or contact support.
asyncio.TimeoutError
oraiohttp.ServerTimeoutError
: The request took too long.Increase Timeout: The default timeout might be too short for slow proxies or target sites. Increase it in
aiohttp.ClientTimeout(total=...)
passed to the request or session.Proxy Performance: The specific proxy might be slow or overloaded. Rotate to a different one.
Target Server Slow: The website you're scraping might be slow to respond.
aiohttp.ClientSSLError
: An issue occurred during the SSL handshake with the *target* server (when using HTTPS).Outdated Certificates: Ensure your system's CA certificates (often managed by `certifi`) are up to date (`pip install --upgrade certifi`).
Server Configuration: The target website might have an invalid or misconfigured SSL certificate. You might need to investigate further or, as a last resort (and if you understand the risks), customize the SSL context to be less strict (see SSL section above).
Proxy Interference (Less Common): Some proxies (especially transparent ones, not typically used for scraping this way) might interfere with SSL. Ensure you're using appropriate HTTP/S or SOCKS proxies designed for this.
Wrapping Up
Hopefully, this guide provides a solid foundation for using `aiohttp` with proxies for your web scraping projects. The asynchronous nature of `aiohttp` offers significant performance benefits, while proxies provide the necessary means to scrape responsibly and avoid interruptions. Remember that successful scraping often involves combining the right tools (`aiohttp`, quality proxies like those from Evomi) with smart strategies (rotation, delays, header management).
Python's ecosystem offers many tools for web scraping beyond `aiohttp`. To explore other options, take a look at our overview of the best Python web scraping libraries.
Diving into Asynchronous Web Scraping with Aiohttp and Proxies
Extracting data from the web, or web scraping, is a cornerstone technique for everything from tracking e-commerce prices to aggregating news feeds or monitoring financial markets. It's about using code to fetch information automatically and efficiently. If you're venturing into building your own scraper, getting comfortable with some coding is essential.
Python is a fantastic choice for this, largely thanks to its straightforward syntax and powerful libraries. When it comes to making HTTP requests, Python offers several popular options, including aiohttp, httpx, and the classic requests library. Each has its own strengths, which you can explore further in our comparison of httpx vs aiohttp vs requests.
This guide focuses on aiohttp
, an asynchronous library. What does "asynchronous" mean here? It means aiohttp
can juggle multiple web requests simultaneously without getting stuck waiting for each one to finish. This makes it incredibly efficient for tasks requiring many concurrent connections. However, to truly harness this power without running into roadblocks like IP bans, integrating proxies is crucial. Proxies help mask your origin and distribute your requests, making your scraping activities less likely to be flagged by target websites.
Here’s what we'll cover:
The basics of how
aiohttp
achieves concurrency.Getting
aiohttp
set up on your machine.Integrating proxies (like Evomi's residential proxies) with
aiohttp
.Smart strategies for using proxies with
aiohttp
effectively.
Let's get started!
What is Aiohttp Anyway?
Think about needing to grab real-time data from several different online sources, perhaps fetching currency exchange rates from multiple financial websites at once. With traditional, synchronous libraries (like the standard `requests` library), your program would make a request, wait for the response, process it, then move to the next request, one by one. This sequential process can become a significant bottleneck, especially when dealing with many slow or unresponsive servers.
Aiohttp
tackles this differently using Python's asyncio
framework. It allows your program to initiate a request and then immediately move on to other tasks (like starting another request) without waiting for the first one to complete. When a response arrives, aiohttp
handles it. This non-blocking approach means you can manage numerous HTTP operations concurrently, drastically speeding up I/O-bound tasks like web scraping.
Setting Up Aiohttp with Proxies
So, aiohttp
lets you fire off requests rapidly. That's great for speed, but it also increases the chances of overwhelming a target server or triggering its defenses. Websites often monitor incoming traffic, and a sudden flood of requests from a single IP address is a classic sign of automated scraping. This can lead to rate limiting (slowing you down), CAPTCHAs, or outright IP blocks.
This is where proxies become indispensable. By routing your aiohttp
requests through proxy servers, you change the source IP address for each request or group of requests. For high-volume scraping, simple datacenter proxies might not be enough. Rotating residential proxies, which use IP addresses assigned by ISPs to real home users, are often the gold standard. They provide a high degree of anonymity and legitimacy, making it harder for websites to distinguish your scraper from genuine user traffic. Evomi offers ethically sourced residential proxies starting at just $0.49/GB, perfect for these kinds of tasks.
Let's translate this into practical code.
What You'll Need
Before diving into the code, ensure you have:
Python 3.7 or a newer version installed.
Access to proxy servers. For robust scraping, consider rotating residential proxies. Evomi provides these, along with mobile and datacenter options, and even offers a completely free trial to test them out.
Got everything? Great, let's proceed.
Installing Aiohttp
Getting aiohttp
is simple using pip, Python's package installer. Open your terminal or command prompt and type:
To verify the installation, you can check the installed package details:
This command should display information about the installed aiohttp
library, including its version.
A Basic Script Using Aiohttp with a Proxy
Now that the setup is complete, let's write a simple Python script. We'll use `aiohttp` to fetch product names from a test e-commerce site, routing the request through an Evomi residential proxy. For this example, we'll target a test scraping site.
Step 1: Import Libraries
First, we need to import the necessary Python libraries: aiohttp
for the web requests and asyncio
to run our asynchronous code.
import aiohttp
import asyncio
Step 2: Configure Your Proxy
Define the proxy server details. We'll use Evomi's residential proxy endpoint format. Remember to replace placeholders with your actual credentials.
# Evomi Residential Proxy Configuration (replace with your details)
# Format: http://username:password@hostname:port
proxy_url = "http://user-xyz:pass123@rp.evomi.com:1000"
# If your proxy doesn't require auth embedded in the URL,
# you might use BasicAuth like this (adjust accordingly):
# proxy_url = "http://rp.evomi.com:1000"
# proxy_auth = aiohttp.BasicAuth('user-xyz', 'pass123')
# For this example, we embed auth in the proxy_url.
Note: Evomi offers different ports for HTTP (1000), HTTPS (1001), and SOCKS5 (1002) for residential proxies. Ensure you use the correct one for your needs.
Step 3: Write the Async Fetching Function
Let's create the asynchronous function that performs the actual web request via the proxy.
# Async function to fetch product names
async def fetch_product_names(session):
target_url = "http://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
print(f"Attempting to fetch data from: {target_url}")
# Use the proxy configured earlier
try:
async with session.get(target_url, proxy=proxy_url) as response:
# Check if the request was successful
response.raise_for_status() # Raises an exception for bad status codes (4xx or 5xx)
html_content = await response.text()
# Basic extraction: find lines containing product titles (usually in links within description class)
# This is a simplified approach; a real scraper would use libraries like BeautifulSoup
product_lines = [line.strip() for line in html_content.splitlines() if 'class="title"' in line]
# Further clean-up might be needed depending on HTML structure
product_names = [line for line in product_lines if 'href' in line] # Filter for lines likely containing titles
print("Successfully fetched data.")
return product_names
except aiohttp.ClientError as e:
print(f"An error occurred: {e}")
return [] # Return empty list on error
async def fetch_product_names(session)
: Defines an asynchronous function.async
signals it can perform non-blocking operations.target_url
: The web page we want to scrape.session.get(target_url, proxy=proxy_url)
: Makes an HTTP GET request using the provided session, directing it through our configuredproxy_url
.response.raise_for_status()
: A good practice to check if the request succeeded (status code 2xx).await response.text()
: Asynchronously gets the response body as text.await
pauses this function until the text is ready, allowing other tasks to run.Extracting Names: This example uses basic string searching (
'class="title"'
) to find lines likely containing product titles. For robust scraping, libraries like Beautiful Soup or lxml are recommended for parsing HTML.Error Handling: The
try...except
block catches potential connection or HTTP errors.
Step 4: Create the Main Async Function
This function orchestrates the process: it creates the aiohttp
session and calls our fetching function.
# Main async function to manage the process
async def main():
# Create a client session to manage connections
async with aiohttp.ClientSession() as session:
print("Client session started.")
product_titles = await fetch_product_names(session)
if product_titles:
print("\n--- Extracted Product Title Lines ---")
for title_line in product_titles:
# Print the raw line containing the title - further parsing needed for clean titles
print(title_line)
print("------------------------------------")
else:
print("No product titles extracted.")
print("Client session closed.")
async def main()
: The main entry point for our async operations.aiohttp.ClientSession()
: Creates a session object. Using a session is efficient as it can reuse connections and manage cookies.await fetch_product_names(session)
: Calls our fetching function and waits for it to complete.Printing Results: Loops through the returned list and prints the lines identified as potentially containing titles.
Step 5: Run the Async Code
Finally, use asyncio.run()
to execute the main
function.
# Entry point to run the main async function
if __name__ == "__main__":
print("Starting scraper...")
asyncio.run(main())
print("Scraper finished.")
Step 6: Putting It All Together
Here’s the complete script:
import aiohttp
import asyncio
# Evomi Residential Proxy Configuration (replace with your details)
# Format: http://username:password@hostname:port
proxy_url = "http://user-xyz:pass123@rp.evomi.com:1000"
# Async function to fetch product names
async def fetch_product_names(session):
target_url = "http://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
print(f"Attempting to fetch data from: {target_url} via proxy {proxy_url.split('@')[-1]}") # Hide credentials in log
try:
# Use the proxy configured earlier
async with session.get(target_url, proxy=proxy_url) as response:
# Check if the request was successful
response.raise_for_status() # Raises an exception for bad status codes (4xx or 5xx)
html_content = await response.text()
# Basic extraction: find lines containing product titles (usually in links within description class)
product_lines = [line.strip() for line in html_content.splitlines() if 'class="title"' in line]
product_names = [line for line in product_lines if 'href' in line] # Filter for lines likely containing titles
print(f"Successfully fetched data from {target_url}.")
return product_names
except aiohttp.ClientError as e:
print(f"An error occurred while fetching {target_url}: {e}")
return [] # Return empty list on error
# Main async function to manage the process
async def main():
# Create a client session to manage connections
async with aiohttp.ClientSession() as session:
print("Client session started.")
product_titles = await fetch_product_names(session)
if product_titles:
print("\n--- Extracted Product Title Lines ---")
for title_line in product_titles:
print(title_line)
print("------------------------------------")
else:
print("No product titles extracted.")
print("Client session closed.")
# Entry point to run the main async function
if __name__ == "__main__":
print("Starting scraper...")
asyncio.run(main())
print("Scraper finished.")
If everything runs correctly, the script will connect through the specified Evomi proxy and print the HTML lines containing the product titles from the target page. This basic example demonstrates proxy integration; next, we'll explore more advanced techniques like rotation.
Advanced Aiohttp Proxy Strategies
The simple script works, but for serious scraping, relying on a single proxy IP isn't ideal. Let's enhance our script to handle multiple proxies and rotate them, significantly improving robustness and reducing the likelihood of blocks.
Managing and Rotating Multiple Proxies
We'll modify the code to use a list of proxies and select one randomly for each request. This distributes the load and makes the scraping pattern less predictable. We'll aim to scrape product names from the first few pages of the laptop category on our test site.
Step 1: Import Additional Libraries
We'll need the random
library for selecting proxies and potentially re
(regular expressions) or a parsing library like BeautifulSoup
(recommended, but we'll stick to basic string methods for simplicity here, install with pip install beautifulsoup4
if you want to use it) for better data extraction.
import aiohttp
import asyncio
import random
# import re # If using regex for extraction
# from bs4 import BeautifulSoup # If using BeautifulSoup for parsing
Step 2: Define Your Proxy List
Create a list containing your proxy connection strings. We'll use Evomi's datacenter proxy endpoint format as an example. Datacenter proxies (starting at $0.30/GB with Evomi) can be cost-effective for some tasks, though residential might be needed for stricter sites.
# List of Evomi Datacenter Proxies (replace with your actual proxies)
# Format: http://username:password@hostname:port
proxy_list = [
"http://user-dc1:pass123@dc.evomi.com:2000",
"http://user-dc2:pass456@dc.evomi.com:2000",
"http://user-dc3:pass789@dc.evomi.com:2000",
"http://user-dc4:passabc@dc.evomi.com:2000",
"http://user-dc5:passdef@dc.evomi.com:2000",
]
Reminder: Evomi datacenter proxies use ports 2000 (HTTP), 2001 (HTTPS), and 2002 (SOCKS5).
Step 3: Update the Fetching Function
Modify the function to accept a URL and a specific proxy from the list for each call. We'll also add error handling specific to proxy connections.
# Updated async function to fetch data from a specific URL using a specific proxy
async def fetch_page_data(session, page_url, proxy):
# Log proxy host, hide credentials
proxy_host = "N/A"
if proxy:
try:
# Attempt to extract host safely
proxy_host = proxy.split('@')[-1].split(':')[0]
except IndexError:
proxy_host = "Invalid Format" # Or handle as needed
print(f"Fetching {page_url} using proxy {proxy_host}...")
try:
async with session.get(page_url, proxy=proxy, timeout=aiohttp.ClientTimeout(total=15)) as response:
response.raise_for_status() # Check for HTTP errors
html_content = await response.text()
# --- Extraction Logic ---
# Replace this with more robust parsing (e.g., BeautifulSoup)
product_lines = [line.strip() for line in html_content.splitlines() if 'class="title"' in line]
product_names = [line for line in product_lines if 'href' in line]
# --- End Extraction Logic ---
print(f"Successfully fetched {page_url}")
return product_names
# Handle proxy-specific connection errors
except aiohttp.ClientProxyConnectionError as e:
print(f"Proxy Connection Error for {proxy_host}: {e}")
return None # Indicate failure for this proxy/page
# Handle other potential client errors (timeout, DNS issues, etc.)
except aiohttp.ClientError as e:
print(f"Client Error fetching {page_url} via {proxy_host}: {e}")
return None
# Handle timeouts specifically
except asyncio.TimeoutError:
print(f"Timeout fetching {page_url} via {proxy_host}")
return None
The function now takes
page_url
andproxy
as arguments.random.choice(proxy_list)
will be used in the main loop to pick a proxy.We added specific error handling for
aiohttp.ClientProxyConnectionError
.A timeout (e.g., 15 seconds) is added to prevent hanging on unresponsive proxies/servers.
The extraction logic remains basic; consider using BeautifulSoup for real-world scenarios:
# Example with BeautifulSoup (install first: pip install beautifulsoup4) # from bs4 import BeautifulSoup # soup = BeautifulSoup(html_content, 'html.parser') # titles = [a['title'] for a in soup.select('a.title')] # return titles
Step 4: Update the Main Function
The main function will now manage the loop for multiple pages, select proxies randomly, and gather results.
# Updated main function for rotating proxies and multiple pages
async def main():
base_url = "http://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
# Let's try to scrape the first 3 pages (assuming pagination exists or modify URLs accordingly)
# NOTE: This site might not have simple pagination; adjust target URLs as needed.
# For this example, we'll just fetch the same page multiple times with different proxies.
num_requests = 5 # Make 5 requests in total
target_urls = [base_url] * num_requests # Re-use the base URL for demo purposes
all_results = {} # Dictionary to store results per proxy
async with aiohttp.ClientSession() as session:
tasks = []
for i in range(num_requests):
selected_proxy = random.choice(proxy_list)
page_url = target_urls[i] # In a real case, this would be page_url_1, page_url_2 etc.
# Create an asyncio task for each request
task = asyncio.create_task(fetch_page_data(session, page_url, selected_proxy))
tasks.append((selected_proxy, task)) # Store proxy with its task
# Wait for all tasks to complete
results = await asyncio.gather(*(task for _, task in tasks))
# Process results
print("\n--- Scraping Results ---")
for i, (selected_proxy, _) in enumerate(tasks):
proxy_host = selected_proxy.split('@')[-1]
if results[i] is not None:
print(f"Proxy {proxy_host}: Successfully fetched {len(results[i])} items.")
if proxy_host not in all_results:
all_results[proxy_host] = []
all_results[proxy_host].extend(results[i]) # Append results for this proxy
else:
print(f"Proxy {proxy_host}: Failed to fetch data.")
# Optional: Print combined results per proxy
print("\n--- Combined Data Per Proxy ---")
for proxy_host, data in all_results.items():
print(f"\nData from Proxy: {proxy_host}")
# Print first few items as example
for item in data[:3]:
print(f" - {item}")
if len(data) > 3:
print(f" ... and {len(data) - 3} more items")
print("\nClient session closed.")
We define the number of requests/pages to fetch.
random.choice(proxy_list)
selects a proxy for each request.asyncio.create_task()
creates tasks for concurrent execution.asyncio.gather()
runs all tasks concurrently and collects their results.The results are processed, showing which proxy fetched what data (or if it failed).
Step 5: Run the Updated Code
Use the standard asyncio entry point:
# Run the main async function
if __name__ == "__main__":
if not proxy_list:
print("Error: Proxy list is empty. Please add proxies.")
else:
print("Starting multi-proxy scraper...")
asyncio.run(main())
print("Scraper finished.")
Step 6: The Complete Rotating Proxy Script
Here is the full code combining these changes:
import aiohttp
import asyncio
import random
# import re # Uncomment if using regex
# from bs4 import BeautifulSoup # Uncomment if using BeautifulSoup
# List of Evomi Datacenter Proxies (replace with your actual proxies)
proxy_list = [
"http://user-dc1:pass123@dc.evomi.com:2000",
"http://user-dc2:pass456@dc.evomi.com:2000",
"http://user-dc3:pass789@dc.evomi.com:2000",
"http://user-dc4:passabc@dc.evomi.com:2000",
"http://user-dc5:passdef@dc.evomi.com:2000",
]
# Updated async function to fetch data from a specific URL using a specific proxy
async def fetch_page_data(session, page_url, proxy):
proxy_host_for_log = proxy.split('@')[-1] if '@' in proxy else proxy # Log proxy host, hide credentials
print(f"Fetching {page_url} using proxy {proxy_host_for_log}...")
try:
# Increased timeout
async with session.get(
page_url, proxy=proxy, timeout=aiohttp.ClientTimeout(total=20)
) as response:
response.raise_for_status() # Check for HTTP errors
html_content = await response.text()
# --- Basic Extraction Logic ---
product_lines = [
line.strip()
for line in html_content.splitlines()
if 'class="title"' in line
]
product_names = [line for line in product_lines if 'href' in line]
# --- End Extraction Logic ---
print(f"Successfully fetched {page_url} via {proxy_host_for_log}")
return product_names
except aiohttp.ClientProxyConnectionError as e:
print(f"Proxy Connection Error for {proxy_host_for_log}: {e}")
return None
except aiohttp.ClientError as e:
print(f"Client Error fetching {page_url} via {proxy_host_for_log}: {e}")
return None
except asyncio.TimeoutError:
print(f"Timeout fetching {page_url} via {proxy_host_for_log}")
return None
# Updated main function for rotating proxies and multiple pages
async def main():
base_url = "http://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
num_requests = 5
target_urls = [base_url] * num_requests
all_results = {}
# Use a TCPConnector to limit concurrent connections if needed
# connector = aiohttp.TCPConnector(limit=10) # Limit to 10 concurrent connections per host
# async with aiohttp.ClientSession(connector=connector) as session:
async with aiohttp.ClientSession() as session: # Default connector
tasks = []
# Select proxies for this run
proxies_in_use = random.sample(proxy_list, min(num_requests, len(proxy_list)))
for i in range(num_requests):
# Cycle through the selected proxies if num_requests > len(proxies_in_use)
selected_proxy = proxies_in_use[i % len(proxies_in_use)]
page_url = target_urls[i]
task = asyncio.create_task(
fetch_page_data(session, page_url, selected_proxy)
)
tasks.append((selected_proxy, task))
# Capture exceptions too
results = await asyncio.gather(*(task for _, task in tasks), return_exceptions=True)
print("\n--- Scraping Results ---")
successful_fetches = 0
failed_fetches = 0
for i, (selected_proxy, _) in enumerate(tasks):
proxy_host = selected_proxy.split('@')[-1] if '@' in selected_proxy else selected_proxy
if isinstance(results[i], Exception):
print(f"Proxy {proxy_host}: Task failed with exception: {results[i]}")
failed_fetches += 1
elif results[i] is not None:
print(f"Proxy {proxy_host}: Successfully fetched {len(results[i])} items.")
if proxy_host not in all_results:
all_results[proxy_host] = []
all_results[proxy_host].extend(results[i])
successful_fetches += 1
else:
# This case might happen if fetch_page_data returned None without an exception
print(f"Proxy {proxy_host}: Failed to fetch data (returned None).")
failed_fetches += 1
print(f"\nSummary: {successful_fetches} successful fetches, {failed_fetches} failed fetches.")
# Optional: Print combined results per proxy
print("\n--- Combined Data Per Proxy (Sample) ---")
for proxy_host, data in all_results.items():
print(f"\nData from Proxy: {proxy_host} ({len(data)} total items)")
for item in data[:3]:
print(f" - {item}")
if len(data) > 3:
print(f" ... and {len(data) - 3} more items")
print("\nClient session closed.")
# Run the main async function
if __name__ == "__main__":
if not proxy_list:
print("Error: Proxy list is empty. Please add proxies.")
else:
print(f"Starting multi-proxy scraper with {len(proxy_list)} proxies...")
asyncio.run(main())
print("Scraper finished.")
When you run this script, you'll see output indicating which proxy is being used for each request. The final summary will show the data collected via each proxy IP, demonstrating successful rotation. This approach significantly enhances the resilience of your scraper against IP-based blocks.
Handling Proxy Authentication
Our examples already incorporate proxy authentication directly within the proxy URL string:
http://USERNAME:PASSWORD@HOSTNAME:PORT
aiohttp
automatically parses this format. When you pass a URL like "http://user-dc1:pass123@dc.evomi.com:2000"
to the proxy
parameter in session.get()
, aiohttp
handles the necessary Proxy-Authorization
header for Basic Authentication.
Alternatively, if your proxy provider requires or if you prefer separating credentials, you can use aiohttp.BasicAuth
:
proxy_url_no_auth = "http://dc.evomi.com:2000"
auth = aiohttp.BasicAuth("user-dc1", "pass123")
# ... inside your async function ...
async with session.get(target_url, proxy=proxy_url_no_auth, proxy_auth=auth) as response:
# ... rest of the code
Both methods achieve the same result. Using the embedded format is often more convenient when managing lists of proxies.
Securing Connections with SSL
When scraping sites over HTTPS or handling sensitive data, ensuring your connection is encrypted via SSL/TLS is vital. aiohttp
handles SSL verification by default when connecting to HTTPS URLs.
Our examples used HTTP URLs (http://...
). If you target HTTPS sites (https://...
), aiohttp
will automatically attempt an SSL handshake. By default, it verifies the server's SSL certificate against a trusted set of Certificate Authorities (CAs), usually provided by the certifi
library.
You generally don't need to manually configure SSL unless:
You need to trust a self-signed certificate (common in testing environments).
You want to disable SSL verification (strongly discouraged for production as it opens you to man-in-the-middle attacks).
You need to specify a particular set of CAs.
To customize SSL behavior, you create an ssl.SSLContext
:
import ssl
# Create a default SSL context (recommended starting point)
ssl_context = ssl.create_default_context()
# Example: Load custom CA bundle (if needed)
# ssl_context.load_verify_locations(cafile='/path/to/custom/ca.crt')
# Example: Disable verification (DANGEROUS - for testing only)
# ssl_context = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)
# ssl_context.check_hostname = False
# ssl_context.verify_mode = ssl.CERT_NONE
# Pass the context to the session.get call
async with session.get(https_url, proxy=proxy, ssl=ssl_context) as response:
# ...
For most scraping tasks involving standard HTTPS websites, the default SSL handling in aiohttp
is sufficient and secure.
Best Practices for Aiohttp Proxy Usage
You've now got the technical skills to integrate and rotate proxies with aiohttp
. To maximize effectiveness and minimize disruptions, consider these best practices:
Tips for Staying Under the Radar
Choose Quality Proxies: Not all proxies are equal. Opt for reputable providers like Evomi, known for reliable and ethically sourced residential or mobile proxies. These blend in better with normal user traffic compared to datacenter IPs, especially on stricter websites. Our Swiss base also reflects a commitment to quality and privacy. You can always verify proxy performance using tools like our free Proxy Tester.
Implement Smart Rotation: Don't just use multiple proxies; rotate them intelligently. Avoid hitting the same domain repeatedly with the same IP in a short period. The random rotation shown earlier is a good start. For large-scale scraping, consider session-based rotation (keeping one IP for a user's "session" on a site) or geographic targeting if needed.
Mimic Human Behavior: Automation is fast, humans aren't always. Introduce random delays between requests to avoid predictable, machine-like patterns.
import time # Inside your loop or before making a request sleep_time = random.uniform(1.5, 4.5) # Wait 1.5 to 4.5 seconds print(f"Sleeping for {sleep_time:.2f} seconds...") await asyncio.sleep(sleep_time) # Now make the request...
Manage Headers and Fingerprints: Send realistic User-Agent strings and other HTTP headers that match common browsers. Be aware of browser fingerprinting techniques websites might use. Tools like Evomi's Browser Fingerprint Checker can show what sites see, and our antidetect browser, Evomium (free for customers), is designed to manage these fingerprints effectively.
Respect
robots.txt
: While not technically related to anonymity, respecting a site'srobots.txt
file (which outlines scraping rules) is good practice and can prevent legal or ethical issues.Use SSL/TLS Correctly: Always use HTTPS where available and ensure SSL verification is enabled unless you have a very specific, understood reason to disable it.
Handle CAPTCHAs Gracefully: If you encounter CAPTCHAs, integrate a solving service. Don't just give up or hammer the site. Check out options in our review of top CAPTCHA solvers.
Common Problems and Fixes
Even with best practices, you might hit snags. Here are common aiohttp
proxy-related errors and how to approach them:
aiohttp.ClientProxyConnectionError
: This usually means your script couldn't even reach the proxy server.Check Proxy Details: Double-check the IP/hostname, port, username, and password. Typos are common!
Verify Proxy Status: Is the proxy online and working? Use a tool like Evomi's Proxy Tester or simple
curl -x http://user:pass@proxy:port http://example.com
from your terminal.Firewall Issues: Ensure no local or network firewall is blocking the connection to the proxy port.
aiohttp.ClientHttpProxyError
/ Status Code 407 Proxy Authentication Required: The connection to the proxy worked, but authentication failed.Check Credentials: Verify the username and password again.
Authentication Format: Ensure you're using the correct authentication method (Basic Auth is common, handled by the URL format or
aiohttp.BasicAuth
). Check your provider's documentation.IP Authorization: Some providers require you to authorize the IP address *from which* you are connecting to the proxy. Check your Evomi dashboard or provider's settings.
aiohttp.ClientHttpProxyError
/ Other 4xx/5xx Status Codes from Proxy: The proxy responded, but with an error (e.g., 403 Forbidden, 502 Bad Gateway).Proxy Restrictions: The proxy itself might be blocked from accessing the target site, or it might have internal issues. Try a different proxy from your pool.
Provider Issue: There might be a temporary problem with the proxy service. Check the provider's status page or contact support.
asyncio.TimeoutError
oraiohttp.ServerTimeoutError
: The request took too long.Increase Timeout: The default timeout might be too short for slow proxies or target sites. Increase it in
aiohttp.ClientTimeout(total=...)
passed to the request or session.Proxy Performance: The specific proxy might be slow or overloaded. Rotate to a different one.
Target Server Slow: The website you're scraping might be slow to respond.
aiohttp.ClientSSLError
: An issue occurred during the SSL handshake with the *target* server (when using HTTPS).Outdated Certificates: Ensure your system's CA certificates (often managed by `certifi`) are up to date (`pip install --upgrade certifi`).
Server Configuration: The target website might have an invalid or misconfigured SSL certificate. You might need to investigate further or, as a last resort (and if you understand the risks), customize the SSL context to be less strict (see SSL section above).
Proxy Interference (Less Common): Some proxies (especially transparent ones, not typically used for scraping this way) might interfere with SSL. Ensure you're using appropriate HTTP/S or SOCKS proxies designed for this.
Wrapping Up
Hopefully, this guide provides a solid foundation for using `aiohttp` with proxies for your web scraping projects. The asynchronous nature of `aiohttp` offers significant performance benefits, while proxies provide the necessary means to scrape responsibly and avoid interruptions. Remember that successful scraping often involves combining the right tools (`aiohttp`, quality proxies like those from Evomi) with smart strategies (rotation, delays, header management).
Python's ecosystem offers many tools for web scraping beyond `aiohttp`. To explore other options, take a look at our overview of the best Python web scraping libraries.

Author
David Foster
Proxy & Network Security Analyst
About Author
David is an expert in network security, web scraping, and proxy technologies, helping businesses optimize data extraction while maintaining privacy and efficiency. With a deep understanding of residential, datacenter, and rotating proxies, he explores how proxies enhance cybersecurity, bypass geo-restrictions, and power large-scale web scraping. David’s insights help businesses and developers choose the right proxy solutions for SEO monitoring, competitive intelligence, and anonymous browsing.