Python cURL: Web Scraping with PycURL and Proxies

David Foster

Last edited on May 4, 2025
Last edited on May 4, 2025

Scraping Techniques

Understanding cURL and its Python Counterpart, PycURL

This article dives into using cURL functionalities within Python via the PycURL library. We'll cover what cURL is, how to harness it in Python, and walk through practical examples for web scraping – including GET and POST requests, and handling JSON data. Ready to explore?

What Exactly is cURL?

At its core, cURL is a command-line utility designed for transferring data with servers. It's incredibly versatile, supporting numerous protocols beyond just HTTP, like FTP, SMTP, and more. This makes cURL a dependable choice for a wide array of tasks involving server communication, from simple data retrieval to complex API interactions.

For web scraping enthusiasts, cURL proves particularly useful for scraping static websites and extracting data loaded via XHR requests or APIs.

Static websites are those delivering their entire content in a single HTML file upon request. Think of sites like Wikipedia – each page load provides the complete HTML. You don't need complex JavaScript rendering; you simply fetch the HTML and parse the data you need.

Scraping XHR (XMLHttpRequest) content, on the other hand, is often necessary for modern, dynamic websites. Take a platform like Reddit as an example. The initial HTML load might contain minimal data; much of the content (posts, comments, etc.) is fetched dynamically using JavaScript after the page loads. These background requests are often XHR requests.

You can typically spot these XHR requests using your browser's developer tools, specifically under the "Network" tab. Look for requests labeled Fetch/XHR.

Here’s a conceptual example of what you might see:

Browser developer tools showing network requests

You could leverage cURL (or PycURL) to replicate such a GET request and retrieve, for instance, trending search data directly. Identifying the specific request that loads the desired data can sometimes be tricky, but once found, using cURL to fetch and process it is quite straightforward.

Is There a Native Python cURL Equivalent?

Python doesn't have a built-in command that directly mirrors cURL's command-line usage. However, several libraries facilitate server communication and data retrieval.

PycURL is a popular choice, providing Python bindings for libcurl, the underlying C library that powers cURL. Other libraries like `requests`, `urllib3`, `httpx`, and `wget` also offer HTTP client functionalities.

It's important to note that PycURL has its own Pythonic interface, distinct from the raw cURL command-line syntax. You'll be using Python methods to set options and perform requests, essentially translating cURL concepts into Python code. If you absolutely need to run raw cURL commands from Python, you might look into the `subprocess` module, but PycURL generally offers a more integrated approach.

Getting Started with PycURL

To use PycURL, you'll first need to install it and then import it into your Python script. The typical workflow involves two main components:

  • The Curl Object: This object represents the cURL session. You use it to set options (like the URL, headers, proxy settings) and execute the request.

  • A Buffer Object: This object (often an `io.BytesIO` instance) acts as a temporary storage location where PycURL writes the response data received from the server.

Let's walk through the process.

Setting Up Your Environment: Installing PycURL

First things first, ensure you have Python installed. You can grab the latest version from the official Python downloads page. Modern Python distributions usually include `pip` (the package installer) and necessary SSL libraries.

Installing PycURL can sometimes be straightforward:

However, depending on your operating system and Python setup, you might need a different command. For instance, on macOS, the default `python` command might point to an older system version (like Python 2.7) that lacks `pip`. If you've installed a newer version (e.g., Python 3.11), you'll need to use its specific command or alias.

To install PycURL for a specific Python 3 version, you might use:

python3.11 -m

Or more generally for Python 3:

python3 -m

Alternatively, you might use the version-specific `pip` command:

Note: PycURL has system dependencies (like libcurl itself and development headers). If `pip install pycurl` fails, check the PycURL documentation or search for installation guides specific to your OS (e.g., "install pycurl ubuntu", "install pycurl macos", "install pycurl windows").

Running Your First PycURL Script

With Python and PycURL installed, you can write your code in any text editor and save it (e.g., as `myscript.py`). Then, execute it from your terminal:

Remember to use the correct Python alias if you have multiple versions installed (e.g., `python3 myscript.py` or `python3.11 myscript.py`).

Crucially, ensure your terminal's current directory is the one containing your `myscript.py` file. Otherwise, Python won't find the script to execute.

Let's create a basic `myscript.py` file:

import pycurl
from io import BytesIO

# Prepare a buffer to store the response
response_buffer = BytesIO()

# Initialize a Curl object
curl_handle = pycurl.Curl()

# Set the target URL
target_url = 'https://api.ipify.org?format=json' # A simple service to get your IP
curl_handle.setopt(curl_handle.URL, target_url)

# Tell PycURL where to write the response data
curl_handle.setopt(curl_handle.WRITEDATA, response_buffer)

# Execute the request
print(f"Performing request to {target_url}...")
curl_handle.perform()

# Always close the handle
curl_handle.close()

# Retrieve the response data from the buffer
response_body_bytes = response_buffer.getvalue()

# Decode the bytes to a string (assuming UTF-8 encoding) and print
response_body_str = response_body_bytes.decode('utf-8')
print(f"Response Body:\n{response_body_str}")

Running this script should print your current public IP address in JSON format.

Terminal output showing IP address from script execution

Breaking down the code:

  • Import necessary libraries (`pycurl`, `BytesIO`).

  • Create `BytesIO` and `pycurl.Curl` instances.

  • Use `setopt` to configure the URL (`curl_handle.URL`).

  • Use `setopt` again to direct the output (`curl_handle.WRITEDATA`).

  • Execute with `perform()`.

  • Clean up with `close()`.

  • Read the data from the buffer using `getvalue()` and decode it.

Using PycURL with Custom Headers and Proxies

As you've seen, the `setopt` method is key to configuring your cURL requests. This is how you add custom HTTP headers, set timeouts, and crucially for web scraping, configure proxies.

Proxies are essential for any serious web scraping project to avoid getting blocked. Websites often monitor incoming requests, looking for suspicious patterns. They check headers (like the User-Agent) to see if requests look like they're coming from real browsers and scrutinize IP addresses for high request volumes or speeds typical of bots.

Using a reliable proxy service, like Evomi's residential proxies, allows you to route your requests through different IP addresses. This makes it appear as though your requests are originating from various genuine users across different locations, significantly reducing the chance of detection and blocks. Evomi prides itself on ethically sourced proxies and competitive pricing (Residential plans start at just $0.49/GB).

To use a proxy with PycURL, you'll need the proxy server's details: host, port, username, and password. You typically find these in your proxy provider's dashboard after signing up.

Here's how you can integrate proxy settings and a custom User-Agent into the previous example:

import pycurl
from io import BytesIO

# Proxy details (Replace with your actual Evomi credentials)
proxy_host = "rp.evomi.com"  # Example: Evomi residential endpoint
proxy_port = 1000            # Example: Evomi residential HTTP port
proxy_user = "your_username"
proxy_pass = "your_password"

# Prepare buffer and Curl object
response_buffer = BytesIO()
curl_handle = pycurl.Curl()

# --- Set custom headers ---
# Set a realistic User-Agent
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
curl_handle.setopt(pycurl.USERAGENT, user_agent)

# --- Configure the proxy ---
proxy_url = f"http://{proxy_host}:{proxy_port}"
curl_handle.setopt(pycurl.PROXY, proxy_url)

# --- Set proxy authentication ---
proxy_auth = f"{proxy_user}:{proxy_pass}"
curl_handle.setopt(pycurl.PROXYUSERPWD, proxy_auth)

# --- Set target URL ---
target_url = 'https://api.ipify.org?format=json' # Check IP via proxy
curl_handle.setopt(curl_handle.URL, target_url)

# Tell PycURL where to write the response data
curl_handle.setopt(curl_handle.WRITEDATA, response_buffer)

# Execute the request
print(f"Performing request to {target_url} via proxy {proxy_host}...")
curl_handle.perform()

# Always close the handle
curl_handle.close()

# Retrieve the response data
response_body_bytes = response_buffer.getvalue()

# Decode and print
response_body_str = response_body_bytes.decode('utf-8')
print(f"Response Body (via proxy):\n{response_body_str}")

Running this modified script should now show the IP address of the Evomi proxy server you connected through, not your own.

Terminal output showing a different IP address, indicating proxy usage

Notice the new elements:

  • Variables holding proxy credentials.

  • `setopt(pycurl.USERAGENT, ...)`: Sets a browser-like User-Agent string. Without this, PycURL might identify itself, making blocks easier.

  • `setopt(pycurl.PROXY, ...)`: Specifies the proxy server address and port.

  • `setopt(pycurl.PROXYUSERPWD, ...)`: Provides the username and password for proxy authentication.

You can use `setopt` similarly to add any other required HTTP headers for your scraping tasks.

Practical PycURL Examples for Web Scraping

Let's see how PycURL fits into common web scraping scenarios.

Example: PycURL GET Request

GET requests are the most common type, used when you simply request a resource (like a webpage) from a server. It's what your browser does when you type a URL. You can use GET requests with PycURL to scrape static HTML content.

To parse the fetched HTML, you can use Python's built-in HTMLParser or more robust libraries like Beautiful Soup or lxml.

Here's an example fetching a Wikipedia page and extracting its title tag using `HTMLParser`:

import pycurl
from io import BytesIO
from html.parser import HTMLParser


# --- HTML Parser Class ---
class TitleParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_title_tag = False
        self.page_title = ""

    def handle_starttag(self, tag, attrs):
        if tag == "title":
            self.in_title_tag = True

    def handle_data(self, data):
        if self.in_title_tag:
            self.page_title += data

    def handle_endtag(self, tag):
        if tag == "title":
            self.in_title_tag = False


# --- PycURL Request ---
response_buffer = BytesIO()
curl_handle = pycurl.Curl()

# Target a specific Wikipedia page
target_url = 'https://en.wikipedia.org/wiki/Web_scraping'
curl_handle.setopt(curl_handle.URL, target_url)
curl_handle.setopt(curl_handle.WRITEDATA, response_buffer)

# Set a user agent (good practice)
user_agent = 'MySimpleScraper/1.0 (https://example.com/scraper-info)'
curl_handle.setopt(pycurl.USERAGENT, user_agent)

print(f"Fetching {target_url}...")
curl_handle.perform()
curl_handle.close()

# --- Process Response ---
html_content_bytes = response_buffer.getvalue()
html_content_str = html_content_bytes.decode('utf-8') # Assuming UTF-8

# Parse the HTML to find the title
parser = TitleParser()
parser.feed(html_content_str)

print(f"Found Title: {parser.page_title.strip()}")

This script fetches the HTML, then feeds it to our simple `TitleParser` which looks for the content within the `

Terminal output showing the extracted title of the Wikipedia page

This is a basic illustration. You could extend the parser to extract links, specific divs, table data, or any other elements needed for your scraping task.

Example: PycURL POST Request

Sometimes, retrieving data requires sending information *to* the server, often via a POST request. This is common for submitting forms (like search queries or login credentials).

PycURL handles POST requests using the `POSTFIELDS` option. You'll typically need to URL-encode the data you're sending.

import pycurl
from io import BytesIO
from urllib.parse import urlencode  # For encoding POST data

# --- Prepare Data and Request ---
response_buffer = BytesIO()
curl_handle = pycurl.Curl()

# Target URL that accepts POST requests (httpbin is great for testing)
target_url = 'https://httpbin.org/post'
curl_handle.setopt(curl_handle.URL, target_url)

# Define the data to send
post_data = {'searchTerm': 'PycURL example', 'userID': '12345'}

# Encode the data for the POST request
encoded_fields = urlencode(post_data)

# Set the POSTFIELDS option
curl_handle.setopt(curl_handle.POSTFIELDS, encoded_fields)

# Set WRITEDATA to capture the response
curl_handle.setopt(curl_handle.WRITEDATA, response_buffer)

# --- Execute and Process ---
print(f"Sending POST request to {target_url}...")
curl_handle.perform()
curl_handle.close()

response_body_bytes = response_buffer.getvalue()
response_body_str = response_body_bytes.decode('utf-8')

print(f"POST Response Body:\n{response_body_str}")

The response from `httpbin.org/post` will typically echo back the data you sent in the `form` field of the JSON response.

Terminal output showing the JSON response from httpbin, including the submitted POST data

Key takeaway: Import `urlencode`, define your data as a dictionary, encode it, and pass it to `POSTFIELDS`.

Example: PycURL for XHR or API Connections

To fetch data loaded via XHR or interact with an API, you'll use either GET or POST requests with PycURL, depending on the specific endpoint's requirements. Many APIs use GET requests for retrieving data and POST for sending data or triggering actions.

Let's revisit the Reddit trending searches example. Using browser developer tools, you might find an API endpoint URL like the one shown earlier (often ending in `.json`). Since it retrieves data, it likely uses a GET request.

Browser developer tools highlighting an API request URL for trending searches

Let's assume the URL is:

https://www.reddit.com/api/trending_searches_v1.json?raw_json=1

We can fetch this using PycURL and parse the resulting JSON:

import pycurl
from io import BytesIO
import json  # To parse the JSON response

# --- Prepare Request ---
response_buffer = BytesIO()
curl_handle = pycurl.Curl()

# The API endpoint URL
api_url = 'https://www.reddit.com/api/trending_searches_v1.json?raw_json=1'
curl_handle.setopt(curl_handle.URL, api_url)

# Set a user agent - often required by APIs
user_agent = 'MyRedditTrendScraper/1.0'
curl_handle.setopt(pycurl.USERAGENT, user_agent)

# Set WRITEDATA
curl_handle.setopt(curl_handle.WRITEDATA, response_buffer)

# --- Execute and Process ---
print(f"Fetching data from {api_url}...")
curl_handle.perform()
curl_handle.close()

response_body_bytes = response_buffer.getvalue()
response_body_str = response_body_bytes.decode('utf-8')

# Load the JSON string into a Python dictionary
try:
    trending_data = json.loads(response_body_str)
    # Now you can work with the data, e.g., print trending searches
    print("Successfully parsed JSON data:")
    # Example: Print keys or specific data points
    # print(trending_data.keys())
    # for search in trending_data.get('trending_searches', []):
    #    print(f"- {search.get('query_string')}")
    print(json.dumps(trending_data, indent=2))  # Pretty print the JSON
except json.JSONDecodeError:
    print("Error: Could not decode JSON response.")
    print(response_body_str)  # Print raw response if JSON fails

This fetches the data from the API endpoint and uses the `json` library to parse it into a workable Python object.

Terminal output showing the JSON data received from the Reddit API endpoint

Wrapping Up

We've journeyed through the fundamentals of cURL and explored how to leverage its power within Python using the PycURL library. From basic GET requests for static content to handling POST data and interacting with APIs or XHR endpoints, you should now have a solid foundation for using PycURL in your web scraping projects. Remember the importance of proxies, like those offered by Evomi, for reliable and undisrupted scraping.

Happy scraping!

Understanding cURL and its Python Counterpart, PycURL

This article dives into using cURL functionalities within Python via the PycURL library. We'll cover what cURL is, how to harness it in Python, and walk through practical examples for web scraping – including GET and POST requests, and handling JSON data. Ready to explore?

What Exactly is cURL?

At its core, cURL is a command-line utility designed for transferring data with servers. It's incredibly versatile, supporting numerous protocols beyond just HTTP, like FTP, SMTP, and more. This makes cURL a dependable choice for a wide array of tasks involving server communication, from simple data retrieval to complex API interactions.

For web scraping enthusiasts, cURL proves particularly useful for scraping static websites and extracting data loaded via XHR requests or APIs.

Static websites are those delivering their entire content in a single HTML file upon request. Think of sites like Wikipedia – each page load provides the complete HTML. You don't need complex JavaScript rendering; you simply fetch the HTML and parse the data you need.

Scraping XHR (XMLHttpRequest) content, on the other hand, is often necessary for modern, dynamic websites. Take a platform like Reddit as an example. The initial HTML load might contain minimal data; much of the content (posts, comments, etc.) is fetched dynamically using JavaScript after the page loads. These background requests are often XHR requests.

You can typically spot these XHR requests using your browser's developer tools, specifically under the "Network" tab. Look for requests labeled Fetch/XHR.

Here’s a conceptual example of what you might see:

Browser developer tools showing network requests

You could leverage cURL (or PycURL) to replicate such a GET request and retrieve, for instance, trending search data directly. Identifying the specific request that loads the desired data can sometimes be tricky, but once found, using cURL to fetch and process it is quite straightforward.

Is There a Native Python cURL Equivalent?

Python doesn't have a built-in command that directly mirrors cURL's command-line usage. However, several libraries facilitate server communication and data retrieval.

PycURL is a popular choice, providing Python bindings for libcurl, the underlying C library that powers cURL. Other libraries like `requests`, `urllib3`, `httpx`, and `wget` also offer HTTP client functionalities.

It's important to note that PycURL has its own Pythonic interface, distinct from the raw cURL command-line syntax. You'll be using Python methods to set options and perform requests, essentially translating cURL concepts into Python code. If you absolutely need to run raw cURL commands from Python, you might look into the `subprocess` module, but PycURL generally offers a more integrated approach.

Getting Started with PycURL

To use PycURL, you'll first need to install it and then import it into your Python script. The typical workflow involves two main components:

  • The Curl Object: This object represents the cURL session. You use it to set options (like the URL, headers, proxy settings) and execute the request.

  • A Buffer Object: This object (often an `io.BytesIO` instance) acts as a temporary storage location where PycURL writes the response data received from the server.

Let's walk through the process.

Setting Up Your Environment: Installing PycURL

First things first, ensure you have Python installed. You can grab the latest version from the official Python downloads page. Modern Python distributions usually include `pip` (the package installer) and necessary SSL libraries.

Installing PycURL can sometimes be straightforward:

However, depending on your operating system and Python setup, you might need a different command. For instance, on macOS, the default `python` command might point to an older system version (like Python 2.7) that lacks `pip`. If you've installed a newer version (e.g., Python 3.11), you'll need to use its specific command or alias.

To install PycURL for a specific Python 3 version, you might use:

python3.11 -m

Or more generally for Python 3:

python3 -m

Alternatively, you might use the version-specific `pip` command:

Note: PycURL has system dependencies (like libcurl itself and development headers). If `pip install pycurl` fails, check the PycURL documentation or search for installation guides specific to your OS (e.g., "install pycurl ubuntu", "install pycurl macos", "install pycurl windows").

Running Your First PycURL Script

With Python and PycURL installed, you can write your code in any text editor and save it (e.g., as `myscript.py`). Then, execute it from your terminal:

Remember to use the correct Python alias if you have multiple versions installed (e.g., `python3 myscript.py` or `python3.11 myscript.py`).

Crucially, ensure your terminal's current directory is the one containing your `myscript.py` file. Otherwise, Python won't find the script to execute.

Let's create a basic `myscript.py` file:

import pycurl
from io import BytesIO

# Prepare a buffer to store the response
response_buffer = BytesIO()

# Initialize a Curl object
curl_handle = pycurl.Curl()

# Set the target URL
target_url = 'https://api.ipify.org?format=json' # A simple service to get your IP
curl_handle.setopt(curl_handle.URL, target_url)

# Tell PycURL where to write the response data
curl_handle.setopt(curl_handle.WRITEDATA, response_buffer)

# Execute the request
print(f"Performing request to {target_url}...")
curl_handle.perform()

# Always close the handle
curl_handle.close()

# Retrieve the response data from the buffer
response_body_bytes = response_buffer.getvalue()

# Decode the bytes to a string (assuming UTF-8 encoding) and print
response_body_str = response_body_bytes.decode('utf-8')
print(f"Response Body:\n{response_body_str}")

Running this script should print your current public IP address in JSON format.

Terminal output showing IP address from script execution

Breaking down the code:

  • Import necessary libraries (`pycurl`, `BytesIO`).

  • Create `BytesIO` and `pycurl.Curl` instances.

  • Use `setopt` to configure the URL (`curl_handle.URL`).

  • Use `setopt` again to direct the output (`curl_handle.WRITEDATA`).

  • Execute with `perform()`.

  • Clean up with `close()`.

  • Read the data from the buffer using `getvalue()` and decode it.

Using PycURL with Custom Headers and Proxies

As you've seen, the `setopt` method is key to configuring your cURL requests. This is how you add custom HTTP headers, set timeouts, and crucially for web scraping, configure proxies.

Proxies are essential for any serious web scraping project to avoid getting blocked. Websites often monitor incoming requests, looking for suspicious patterns. They check headers (like the User-Agent) to see if requests look like they're coming from real browsers and scrutinize IP addresses for high request volumes or speeds typical of bots.

Using a reliable proxy service, like Evomi's residential proxies, allows you to route your requests through different IP addresses. This makes it appear as though your requests are originating from various genuine users across different locations, significantly reducing the chance of detection and blocks. Evomi prides itself on ethically sourced proxies and competitive pricing (Residential plans start at just $0.49/GB).

To use a proxy with PycURL, you'll need the proxy server's details: host, port, username, and password. You typically find these in your proxy provider's dashboard after signing up.

Here's how you can integrate proxy settings and a custom User-Agent into the previous example:

import pycurl
from io import BytesIO

# Proxy details (Replace with your actual Evomi credentials)
proxy_host = "rp.evomi.com"  # Example: Evomi residential endpoint
proxy_port = 1000            # Example: Evomi residential HTTP port
proxy_user = "your_username"
proxy_pass = "your_password"

# Prepare buffer and Curl object
response_buffer = BytesIO()
curl_handle = pycurl.Curl()

# --- Set custom headers ---
# Set a realistic User-Agent
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
curl_handle.setopt(pycurl.USERAGENT, user_agent)

# --- Configure the proxy ---
proxy_url = f"http://{proxy_host}:{proxy_port}"
curl_handle.setopt(pycurl.PROXY, proxy_url)

# --- Set proxy authentication ---
proxy_auth = f"{proxy_user}:{proxy_pass}"
curl_handle.setopt(pycurl.PROXYUSERPWD, proxy_auth)

# --- Set target URL ---
target_url = 'https://api.ipify.org?format=json' # Check IP via proxy
curl_handle.setopt(curl_handle.URL, target_url)

# Tell PycURL where to write the response data
curl_handle.setopt(curl_handle.WRITEDATA, response_buffer)

# Execute the request
print(f"Performing request to {target_url} via proxy {proxy_host}...")
curl_handle.perform()

# Always close the handle
curl_handle.close()

# Retrieve the response data
response_body_bytes = response_buffer.getvalue()

# Decode and print
response_body_str = response_body_bytes.decode('utf-8')
print(f"Response Body (via proxy):\n{response_body_str}")

Running this modified script should now show the IP address of the Evomi proxy server you connected through, not your own.

Terminal output showing a different IP address, indicating proxy usage

Notice the new elements:

  • Variables holding proxy credentials.

  • `setopt(pycurl.USERAGENT, ...)`: Sets a browser-like User-Agent string. Without this, PycURL might identify itself, making blocks easier.

  • `setopt(pycurl.PROXY, ...)`: Specifies the proxy server address and port.

  • `setopt(pycurl.PROXYUSERPWD, ...)`: Provides the username and password for proxy authentication.

You can use `setopt` similarly to add any other required HTTP headers for your scraping tasks.

Practical PycURL Examples for Web Scraping

Let's see how PycURL fits into common web scraping scenarios.

Example: PycURL GET Request

GET requests are the most common type, used when you simply request a resource (like a webpage) from a server. It's what your browser does when you type a URL. You can use GET requests with PycURL to scrape static HTML content.

To parse the fetched HTML, you can use Python's built-in HTMLParser or more robust libraries like Beautiful Soup or lxml.

Here's an example fetching a Wikipedia page and extracting its title tag using `HTMLParser`:

import pycurl
from io import BytesIO
from html.parser import HTMLParser


# --- HTML Parser Class ---
class TitleParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_title_tag = False
        self.page_title = ""

    def handle_starttag(self, tag, attrs):
        if tag == "title":
            self.in_title_tag = True

    def handle_data(self, data):
        if self.in_title_tag:
            self.page_title += data

    def handle_endtag(self, tag):
        if tag == "title":
            self.in_title_tag = False


# --- PycURL Request ---
response_buffer = BytesIO()
curl_handle = pycurl.Curl()

# Target a specific Wikipedia page
target_url = 'https://en.wikipedia.org/wiki/Web_scraping'
curl_handle.setopt(curl_handle.URL, target_url)
curl_handle.setopt(curl_handle.WRITEDATA, response_buffer)

# Set a user agent (good practice)
user_agent = 'MySimpleScraper/1.0 (https://example.com/scraper-info)'
curl_handle.setopt(pycurl.USERAGENT, user_agent)

print(f"Fetching {target_url}...")
curl_handle.perform()
curl_handle.close()

# --- Process Response ---
html_content_bytes = response_buffer.getvalue()
html_content_str = html_content_bytes.decode('utf-8') # Assuming UTF-8

# Parse the HTML to find the title
parser = TitleParser()
parser.feed(html_content_str)

print(f"Found Title: {parser.page_title.strip()}")

This script fetches the HTML, then feeds it to our simple `TitleParser` which looks for the content within the `

Terminal output showing the extracted title of the Wikipedia page

This is a basic illustration. You could extend the parser to extract links, specific divs, table data, or any other elements needed for your scraping task.

Example: PycURL POST Request

Sometimes, retrieving data requires sending information *to* the server, often via a POST request. This is common for submitting forms (like search queries or login credentials).

PycURL handles POST requests using the `POSTFIELDS` option. You'll typically need to URL-encode the data you're sending.

import pycurl
from io import BytesIO
from urllib.parse import urlencode  # For encoding POST data

# --- Prepare Data and Request ---
response_buffer = BytesIO()
curl_handle = pycurl.Curl()

# Target URL that accepts POST requests (httpbin is great for testing)
target_url = 'https://httpbin.org/post'
curl_handle.setopt(curl_handle.URL, target_url)

# Define the data to send
post_data = {'searchTerm': 'PycURL example', 'userID': '12345'}

# Encode the data for the POST request
encoded_fields = urlencode(post_data)

# Set the POSTFIELDS option
curl_handle.setopt(curl_handle.POSTFIELDS, encoded_fields)

# Set WRITEDATA to capture the response
curl_handle.setopt(curl_handle.WRITEDATA, response_buffer)

# --- Execute and Process ---
print(f"Sending POST request to {target_url}...")
curl_handle.perform()
curl_handle.close()

response_body_bytes = response_buffer.getvalue()
response_body_str = response_body_bytes.decode('utf-8')

print(f"POST Response Body:\n{response_body_str}")

The response from `httpbin.org/post` will typically echo back the data you sent in the `form` field of the JSON response.

Terminal output showing the JSON response from httpbin, including the submitted POST data

Key takeaway: Import `urlencode`, define your data as a dictionary, encode it, and pass it to `POSTFIELDS`.

Example: PycURL for XHR or API Connections

To fetch data loaded via XHR or interact with an API, you'll use either GET or POST requests with PycURL, depending on the specific endpoint's requirements. Many APIs use GET requests for retrieving data and POST for sending data or triggering actions.

Let's revisit the Reddit trending searches example. Using browser developer tools, you might find an API endpoint URL like the one shown earlier (often ending in `.json`). Since it retrieves data, it likely uses a GET request.

Browser developer tools highlighting an API request URL for trending searches

Let's assume the URL is:

https://www.reddit.com/api/trending_searches_v1.json?raw_json=1

We can fetch this using PycURL and parse the resulting JSON:

import pycurl
from io import BytesIO
import json  # To parse the JSON response

# --- Prepare Request ---
response_buffer = BytesIO()
curl_handle = pycurl.Curl()

# The API endpoint URL
api_url = 'https://www.reddit.com/api/trending_searches_v1.json?raw_json=1'
curl_handle.setopt(curl_handle.URL, api_url)

# Set a user agent - often required by APIs
user_agent = 'MyRedditTrendScraper/1.0'
curl_handle.setopt(pycurl.USERAGENT, user_agent)

# Set WRITEDATA
curl_handle.setopt(curl_handle.WRITEDATA, response_buffer)

# --- Execute and Process ---
print(f"Fetching data from {api_url}...")
curl_handle.perform()
curl_handle.close()

response_body_bytes = response_buffer.getvalue()
response_body_str = response_body_bytes.decode('utf-8')

# Load the JSON string into a Python dictionary
try:
    trending_data = json.loads(response_body_str)
    # Now you can work with the data, e.g., print trending searches
    print("Successfully parsed JSON data:")
    # Example: Print keys or specific data points
    # print(trending_data.keys())
    # for search in trending_data.get('trending_searches', []):
    #    print(f"- {search.get('query_string')}")
    print(json.dumps(trending_data, indent=2))  # Pretty print the JSON
except json.JSONDecodeError:
    print("Error: Could not decode JSON response.")
    print(response_body_str)  # Print raw response if JSON fails

This fetches the data from the API endpoint and uses the `json` library to parse it into a workable Python object.

Terminal output showing the JSON data received from the Reddit API endpoint

Wrapping Up

We've journeyed through the fundamentals of cURL and explored how to leverage its power within Python using the PycURL library. From basic GET requests for static content to handling POST data and interacting with APIs or XHR endpoints, you should now have a solid foundation for using PycURL in your web scraping projects. Remember the importance of proxies, like those offered by Evomi, for reliable and undisrupted scraping.

Happy scraping!

Understanding cURL and its Python Counterpart, PycURL

This article dives into using cURL functionalities within Python via the PycURL library. We'll cover what cURL is, how to harness it in Python, and walk through practical examples for web scraping – including GET and POST requests, and handling JSON data. Ready to explore?

What Exactly is cURL?

At its core, cURL is a command-line utility designed for transferring data with servers. It's incredibly versatile, supporting numerous protocols beyond just HTTP, like FTP, SMTP, and more. This makes cURL a dependable choice for a wide array of tasks involving server communication, from simple data retrieval to complex API interactions.

For web scraping enthusiasts, cURL proves particularly useful for scraping static websites and extracting data loaded via XHR requests or APIs.

Static websites are those delivering their entire content in a single HTML file upon request. Think of sites like Wikipedia – each page load provides the complete HTML. You don't need complex JavaScript rendering; you simply fetch the HTML and parse the data you need.

Scraping XHR (XMLHttpRequest) content, on the other hand, is often necessary for modern, dynamic websites. Take a platform like Reddit as an example. The initial HTML load might contain minimal data; much of the content (posts, comments, etc.) is fetched dynamically using JavaScript after the page loads. These background requests are often XHR requests.

You can typically spot these XHR requests using your browser's developer tools, specifically under the "Network" tab. Look for requests labeled Fetch/XHR.

Here’s a conceptual example of what you might see:

Browser developer tools showing network requests

You could leverage cURL (or PycURL) to replicate such a GET request and retrieve, for instance, trending search data directly. Identifying the specific request that loads the desired data can sometimes be tricky, but once found, using cURL to fetch and process it is quite straightforward.

Is There a Native Python cURL Equivalent?

Python doesn't have a built-in command that directly mirrors cURL's command-line usage. However, several libraries facilitate server communication and data retrieval.

PycURL is a popular choice, providing Python bindings for libcurl, the underlying C library that powers cURL. Other libraries like `requests`, `urllib3`, `httpx`, and `wget` also offer HTTP client functionalities.

It's important to note that PycURL has its own Pythonic interface, distinct from the raw cURL command-line syntax. You'll be using Python methods to set options and perform requests, essentially translating cURL concepts into Python code. If you absolutely need to run raw cURL commands from Python, you might look into the `subprocess` module, but PycURL generally offers a more integrated approach.

Getting Started with PycURL

To use PycURL, you'll first need to install it and then import it into your Python script. The typical workflow involves two main components:

  • The Curl Object: This object represents the cURL session. You use it to set options (like the URL, headers, proxy settings) and execute the request.

  • A Buffer Object: This object (often an `io.BytesIO` instance) acts as a temporary storage location where PycURL writes the response data received from the server.

Let's walk through the process.

Setting Up Your Environment: Installing PycURL

First things first, ensure you have Python installed. You can grab the latest version from the official Python downloads page. Modern Python distributions usually include `pip` (the package installer) and necessary SSL libraries.

Installing PycURL can sometimes be straightforward:

However, depending on your operating system and Python setup, you might need a different command. For instance, on macOS, the default `python` command might point to an older system version (like Python 2.7) that lacks `pip`. If you've installed a newer version (e.g., Python 3.11), you'll need to use its specific command or alias.

To install PycURL for a specific Python 3 version, you might use:

python3.11 -m

Or more generally for Python 3:

python3 -m

Alternatively, you might use the version-specific `pip` command:

Note: PycURL has system dependencies (like libcurl itself and development headers). If `pip install pycurl` fails, check the PycURL documentation or search for installation guides specific to your OS (e.g., "install pycurl ubuntu", "install pycurl macos", "install pycurl windows").

Running Your First PycURL Script

With Python and PycURL installed, you can write your code in any text editor and save it (e.g., as `myscript.py`). Then, execute it from your terminal:

Remember to use the correct Python alias if you have multiple versions installed (e.g., `python3 myscript.py` or `python3.11 myscript.py`).

Crucially, ensure your terminal's current directory is the one containing your `myscript.py` file. Otherwise, Python won't find the script to execute.

Let's create a basic `myscript.py` file:

import pycurl
from io import BytesIO

# Prepare a buffer to store the response
response_buffer = BytesIO()

# Initialize a Curl object
curl_handle = pycurl.Curl()

# Set the target URL
target_url = 'https://api.ipify.org?format=json' # A simple service to get your IP
curl_handle.setopt(curl_handle.URL, target_url)

# Tell PycURL where to write the response data
curl_handle.setopt(curl_handle.WRITEDATA, response_buffer)

# Execute the request
print(f"Performing request to {target_url}...")
curl_handle.perform()

# Always close the handle
curl_handle.close()

# Retrieve the response data from the buffer
response_body_bytes = response_buffer.getvalue()

# Decode the bytes to a string (assuming UTF-8 encoding) and print
response_body_str = response_body_bytes.decode('utf-8')
print(f"Response Body:\n{response_body_str}")

Running this script should print your current public IP address in JSON format.

Terminal output showing IP address from script execution

Breaking down the code:

  • Import necessary libraries (`pycurl`, `BytesIO`).

  • Create `BytesIO` and `pycurl.Curl` instances.

  • Use `setopt` to configure the URL (`curl_handle.URL`).

  • Use `setopt` again to direct the output (`curl_handle.WRITEDATA`).

  • Execute with `perform()`.

  • Clean up with `close()`.

  • Read the data from the buffer using `getvalue()` and decode it.

Using PycURL with Custom Headers and Proxies

As you've seen, the `setopt` method is key to configuring your cURL requests. This is how you add custom HTTP headers, set timeouts, and crucially for web scraping, configure proxies.

Proxies are essential for any serious web scraping project to avoid getting blocked. Websites often monitor incoming requests, looking for suspicious patterns. They check headers (like the User-Agent) to see if requests look like they're coming from real browsers and scrutinize IP addresses for high request volumes or speeds typical of bots.

Using a reliable proxy service, like Evomi's residential proxies, allows you to route your requests through different IP addresses. This makes it appear as though your requests are originating from various genuine users across different locations, significantly reducing the chance of detection and blocks. Evomi prides itself on ethically sourced proxies and competitive pricing (Residential plans start at just $0.49/GB).

To use a proxy with PycURL, you'll need the proxy server's details: host, port, username, and password. You typically find these in your proxy provider's dashboard after signing up.

Here's how you can integrate proxy settings and a custom User-Agent into the previous example:

import pycurl
from io import BytesIO

# Proxy details (Replace with your actual Evomi credentials)
proxy_host = "rp.evomi.com"  # Example: Evomi residential endpoint
proxy_port = 1000            # Example: Evomi residential HTTP port
proxy_user = "your_username"
proxy_pass = "your_password"

# Prepare buffer and Curl object
response_buffer = BytesIO()
curl_handle = pycurl.Curl()

# --- Set custom headers ---
# Set a realistic User-Agent
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
curl_handle.setopt(pycurl.USERAGENT, user_agent)

# --- Configure the proxy ---
proxy_url = f"http://{proxy_host}:{proxy_port}"
curl_handle.setopt(pycurl.PROXY, proxy_url)

# --- Set proxy authentication ---
proxy_auth = f"{proxy_user}:{proxy_pass}"
curl_handle.setopt(pycurl.PROXYUSERPWD, proxy_auth)

# --- Set target URL ---
target_url = 'https://api.ipify.org?format=json' # Check IP via proxy
curl_handle.setopt(curl_handle.URL, target_url)

# Tell PycURL where to write the response data
curl_handle.setopt(curl_handle.WRITEDATA, response_buffer)

# Execute the request
print(f"Performing request to {target_url} via proxy {proxy_host}...")
curl_handle.perform()

# Always close the handle
curl_handle.close()

# Retrieve the response data
response_body_bytes = response_buffer.getvalue()

# Decode and print
response_body_str = response_body_bytes.decode('utf-8')
print(f"Response Body (via proxy):\n{response_body_str}")

Running this modified script should now show the IP address of the Evomi proxy server you connected through, not your own.

Terminal output showing a different IP address, indicating proxy usage

Notice the new elements:

  • Variables holding proxy credentials.

  • `setopt(pycurl.USERAGENT, ...)`: Sets a browser-like User-Agent string. Without this, PycURL might identify itself, making blocks easier.

  • `setopt(pycurl.PROXY, ...)`: Specifies the proxy server address and port.

  • `setopt(pycurl.PROXYUSERPWD, ...)`: Provides the username and password for proxy authentication.

You can use `setopt` similarly to add any other required HTTP headers for your scraping tasks.

Practical PycURL Examples for Web Scraping

Let's see how PycURL fits into common web scraping scenarios.

Example: PycURL GET Request

GET requests are the most common type, used when you simply request a resource (like a webpage) from a server. It's what your browser does when you type a URL. You can use GET requests with PycURL to scrape static HTML content.

To parse the fetched HTML, you can use Python's built-in HTMLParser or more robust libraries like Beautiful Soup or lxml.

Here's an example fetching a Wikipedia page and extracting its title tag using `HTMLParser`:

import pycurl
from io import BytesIO
from html.parser import HTMLParser


# --- HTML Parser Class ---
class TitleParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_title_tag = False
        self.page_title = ""

    def handle_starttag(self, tag, attrs):
        if tag == "title":
            self.in_title_tag = True

    def handle_data(self, data):
        if self.in_title_tag:
            self.page_title += data

    def handle_endtag(self, tag):
        if tag == "title":
            self.in_title_tag = False


# --- PycURL Request ---
response_buffer = BytesIO()
curl_handle = pycurl.Curl()

# Target a specific Wikipedia page
target_url = 'https://en.wikipedia.org/wiki/Web_scraping'
curl_handle.setopt(curl_handle.URL, target_url)
curl_handle.setopt(curl_handle.WRITEDATA, response_buffer)

# Set a user agent (good practice)
user_agent = 'MySimpleScraper/1.0 (https://example.com/scraper-info)'
curl_handle.setopt(pycurl.USERAGENT, user_agent)

print(f"Fetching {target_url}...")
curl_handle.perform()
curl_handle.close()

# --- Process Response ---
html_content_bytes = response_buffer.getvalue()
html_content_str = html_content_bytes.decode('utf-8') # Assuming UTF-8

# Parse the HTML to find the title
parser = TitleParser()
parser.feed(html_content_str)

print(f"Found Title: {parser.page_title.strip()}")

This script fetches the HTML, then feeds it to our simple `TitleParser` which looks for the content within the `

Terminal output showing the extracted title of the Wikipedia page

This is a basic illustration. You could extend the parser to extract links, specific divs, table data, or any other elements needed for your scraping task.

Example: PycURL POST Request

Sometimes, retrieving data requires sending information *to* the server, often via a POST request. This is common for submitting forms (like search queries or login credentials).

PycURL handles POST requests using the `POSTFIELDS` option. You'll typically need to URL-encode the data you're sending.

import pycurl
from io import BytesIO
from urllib.parse import urlencode  # For encoding POST data

# --- Prepare Data and Request ---
response_buffer = BytesIO()
curl_handle = pycurl.Curl()

# Target URL that accepts POST requests (httpbin is great for testing)
target_url = 'https://httpbin.org/post'
curl_handle.setopt(curl_handle.URL, target_url)

# Define the data to send
post_data = {'searchTerm': 'PycURL example', 'userID': '12345'}

# Encode the data for the POST request
encoded_fields = urlencode(post_data)

# Set the POSTFIELDS option
curl_handle.setopt(curl_handle.POSTFIELDS, encoded_fields)

# Set WRITEDATA to capture the response
curl_handle.setopt(curl_handle.WRITEDATA, response_buffer)

# --- Execute and Process ---
print(f"Sending POST request to {target_url}...")
curl_handle.perform()
curl_handle.close()

response_body_bytes = response_buffer.getvalue()
response_body_str = response_body_bytes.decode('utf-8')

print(f"POST Response Body:\n{response_body_str}")

The response from `httpbin.org/post` will typically echo back the data you sent in the `form` field of the JSON response.

Terminal output showing the JSON response from httpbin, including the submitted POST data

Key takeaway: Import `urlencode`, define your data as a dictionary, encode it, and pass it to `POSTFIELDS`.

Example: PycURL for XHR or API Connections

To fetch data loaded via XHR or interact with an API, you'll use either GET or POST requests with PycURL, depending on the specific endpoint's requirements. Many APIs use GET requests for retrieving data and POST for sending data or triggering actions.

Let's revisit the Reddit trending searches example. Using browser developer tools, you might find an API endpoint URL like the one shown earlier (often ending in `.json`). Since it retrieves data, it likely uses a GET request.

Browser developer tools highlighting an API request URL for trending searches

Let's assume the URL is:

https://www.reddit.com/api/trending_searches_v1.json?raw_json=1

We can fetch this using PycURL and parse the resulting JSON:

import pycurl
from io import BytesIO
import json  # To parse the JSON response

# --- Prepare Request ---
response_buffer = BytesIO()
curl_handle = pycurl.Curl()

# The API endpoint URL
api_url = 'https://www.reddit.com/api/trending_searches_v1.json?raw_json=1'
curl_handle.setopt(curl_handle.URL, api_url)

# Set a user agent - often required by APIs
user_agent = 'MyRedditTrendScraper/1.0'
curl_handle.setopt(pycurl.USERAGENT, user_agent)

# Set WRITEDATA
curl_handle.setopt(curl_handle.WRITEDATA, response_buffer)

# --- Execute and Process ---
print(f"Fetching data from {api_url}...")
curl_handle.perform()
curl_handle.close()

response_body_bytes = response_buffer.getvalue()
response_body_str = response_body_bytes.decode('utf-8')

# Load the JSON string into a Python dictionary
try:
    trending_data = json.loads(response_body_str)
    # Now you can work with the data, e.g., print trending searches
    print("Successfully parsed JSON data:")
    # Example: Print keys or specific data points
    # print(trending_data.keys())
    # for search in trending_data.get('trending_searches', []):
    #    print(f"- {search.get('query_string')}")
    print(json.dumps(trending_data, indent=2))  # Pretty print the JSON
except json.JSONDecodeError:
    print("Error: Could not decode JSON response.")
    print(response_body_str)  # Print raw response if JSON fails

This fetches the data from the API endpoint and uses the `json` library to parse it into a workable Python object.

Terminal output showing the JSON data received from the Reddit API endpoint

Wrapping Up

We've journeyed through the fundamentals of cURL and explored how to leverage its power within Python using the PycURL library. From basic GET requests for static content to handling POST data and interacting with APIs or XHR endpoints, you should now have a solid foundation for using PycURL in your web scraping projects. Remember the importance of proxies, like those offered by Evomi, for reliable and undisrupted scraping.

Happy scraping!

Author

David Foster

Proxy & Network Security Analyst

About Author

David is an expert in network security, web scraping, and proxy technologies, helping businesses optimize data extraction while maintaining privacy and efficiency. With a deep understanding of residential, datacenter, and rotating proxies, he explores how proxies enhance cybersecurity, bypass geo-restrictions, and power large-scale web scraping. David’s insights help businesses and developers choose the right proxy solutions for SEO monitoring, competitive intelligence, and anonymous browsing.

Like this article? Share it.
You asked, we answer - Users questions:
How does PycURL compare to the Python `requests` library for web scraping?+
What are common errors when using PycURL for scraping, and how can I handle them?+
Can PycURL execute JavaScript like a browser for scraping dynamic websites?+
How can I manage cookies (e.g., for session persistence) when scraping with PycURL?+
Does PycURL support SOCKS proxies in addition to HTTP/HTTPS proxies?+

In This Article

Read More Blogs