Master Python Web Scraping (2025): Tools, Tips & Proxies

David Foster

Last edited on May 28, 2025
Last edited on May 28, 2025

Scraping Techniques

Getting Started with Python Web Scraping in 2025

Thinking about diving into web scraping? Python is arguably your best bet. Its straightforward syntax combined with a rich collection of libraries makes it a top contender for data extraction tasks, even as we look towards 2025.

This guide will walk you through using Python's popular requests library to fetch web content and Beautiful Soup to parse it. By the end, you'll have a practical example under your belt: scraping post titles from the r/programming subreddit (using its classic interface) to figure out which programming languages are currently buzzing in the community.

So, What Exactly is Web Scraping?

Web scraping is essentially the process of automatically extracting information from websites. Instead of manually copying data, you use automated scripts (often called bots, scrapers, or spiders) to grab the data you need.

Typically, these tools download the raw HTML source code of a web page and then sift through it to find specific pieces of information. Some more sophisticated scrapers might even employ headless browsers (browsers without a graphical interface) to interact with pages more like a human user would, especially for sites heavy on JavaScript.

Be warned, though: web scraping isn't always a walk in the park. It often requires some tinkering to get right, and your carefully crafted scripts can break if the target website changes its structure (HTML/CSS). That's why it's generally better to use an official API (Application Programming Interface) if one is available. But when an API isn't an option, web scraping is a powerful skill for gathering data for things like market analysis, competitor tracking, or academic research.

Why Python Reigns Supreme for Scraping

While many programming languages offer tools for web scraping (you really just need an HTTP client and an HTML parser), Python's ecosystem is particularly well-suited for the job.

You've got libraries like Requests, known for its elegant simplicity in handling HTTP communication, and Beautiful Soup, a fantastic tool for navigating and searching HTML documents. For more complex scenarios, frameworks like Scrapy offer a complete scraping solution, while tools like Playwright allow for browser automation. These libraries are widely used, well-documented, and have strong community support.

Plus, Python's relatively gentle learning curve makes it accessible, even if you don't write code for a living. It's great for quickly putting together scripts and testing ideas.

A Practical Python Web Scraping Example

Let's get hands-on. We'll build a Python script to scrape Reddit. The goal is to collect the titles of roughly the first 500 posts from r/programming and then analyze them to see which programming languages get mentioned most often.

We'll target the classic Reddit interface (old.reddit.com) because its simpler HTML structure is easier for beginners to work with.

Setting Up Your Environment

First, ensure you have Python installed. You can grab it from the official Python website if needed.

Next, you'll need the requests and Beautiful Soup libraries. Install them using pip, Python's package manager:

Finally, create a new file named reddit_scraper.py. This is where we'll write our code.

Fetching the Web Page Content

The core process involves two main steps: getting the HTML and then parsing it.

To download the HTML from the r/programming main page, we'll use the requests library.

import requests

The requests.get() function fetches the page. It's crucial to provide a custom User-Agent header. Many websites, including Reddit, might block or limit requests from default script user agents. A unique User-Agent makes your script look less like a generic bot.

target_url = "https://old.reddit.com/r/programming/"
headers = {'User-agent': 'Python Scraping Bot - Learning Project 1.0'}
response = requests.get(target_url, headers=headers)

The raw HTML content is stored in the .content attribute of the response object.

page_html = response.content

So far, your script should look like this:

import requests

target_url = "https://old.reddit.com/r/programming/"
headers = {'User-agent': 'Python Scraping Bot - Learning Project 1.0'}
response = requests.get(target_url, headers=headers)
page_html = response.content
# We'll add parsing logic next

Now that we have the HTML, we need to extract the post titles using Beautiful Soup.

Parsing HTML with Beautiful Soup

First, import the library at the top of your script:

from bs4 import BeautifulSoup

Then, create a Beautiful Soup object to parse the HTML content we fetched:

soup_parser = BeautifulSoup(page_html, "html.parser")

This soup_parser object allows us to navigate the HTML structure using methods like find() and find_all().

But how do we know *what* to look for? Most web browsers have built-in "Developer Tools". Open r/programming in your browser (preferably in Incognito/Private mode to avoid being logged in), right-click on a post title, and select "Inspect" or "Inspect Element". This will open the developer tools and highlight the HTML code corresponding to that title.

You need to identify HTML tags and attributes (like classes or IDs) that consistently mark the elements you want. For old Reddit titles, they are typically within an anchor tag (<a>) inside a paragraph tag (<p>) that has the class title.

We can use find_all() to get all paragraph tags with the class "title":

title_paragraphs = soup_parser.find_all(
    "p", class_="title"
)

This gives us the paragraph elements. Each one contains an anchor tag (<a>) whose text is the actual title. We can extract just the text using a list comprehension:

extracted_titles = [
    p_tag.find("a").get_text() 
    for p_tag in title_paragraphs
]

Let's print the results to see:

print(extracted_titles)

Here's the complete script up to this point:

import requests
from bs4 import BeautifulSoup

target_url = "https://old.reddit.com/r/programming/"
headers = {'User-agent': 'Python Scraping Bot - Learning Project 1.0'}

response = requests.get(target_url, headers=headers)
page_html = response.contents

soup_parser = BeautifulSoup(page_html, "html.parser")
title_paragraphs = soup_parser.find_all("p", class_="title")

extracted_titles = [p_tag.find("a").get_text() for p_tag in title_paragraphs]

print(extracted_titles)

Running this script will output the titles from the first page of r/programming. To build a more substantial dataset, we need to scrape multiple pages.

Scraping Across Multiple Pages

Let's modify the script to grab titles from the first 15 pages.

The strategy is: scrape the current page, find the link to the "next" page, load that page, scrape it, and repeat.

Using browser inspection again, you can find the "next" button on old Reddit. It's usually within a <span> tag with the class next-button. The actual link is in an anchor tag (<a>) inside that span.

We can find this link like so:

next_button_span = soup_parser.find("span", class_="next-button")
# Check if the button exists before trying to get the link
if next_button_span and next_button_span.find("a"):
    next_page_url = next_button_span.find("a")['href']
else:
    next_page_url = None # No more pages

Now, let's restructure the code for multi-page scraping. We'll need the time library to add delays between requests, which is polite to the server and makes our scraping less aggressive.

import time

We'll initialize a list to store all titles and a variable for the URL of the page we're currently scraping.

all_post_titles = []
current_page_url = "https://old.reddit.com/r/programming/"
scrape_page_count = 15 # How many pages to scrape

We'll use a loop to iterate through the pages. Inside the loop, we fetch, parse, extract titles, add them to our main list, find the next page's URL, and pause.

headers = {'User-agent': 'Python Scraping Bot - Learning Project 1.0'}
for page_num in range(scrape_page_count):
    if not current_page_url:
        print("No more pages found. Stopping.")
        break
    print(f"Scraping page {page_num + 1}: {current_page_url}")
    try:
        response = requests.get(current_page_url, headers=headers)
        response.raise_for_status()  # Check for HTTP errors (like 404, 500)
        page_html = response.content
        soup_parser = BeautifulSoup(page_html, "html.parser")
        title_paragraphs = soup_parser.find_all("p", class_="title")
        page_titles = [p_tag.find("a").get_text() for p_tag in title_paragraphs]
        all_post_titles.extend(page_titles)  # Use extend to add elements from list

        # Find the next page URL
        next_button_span = soup_parser.find("span", class_="next-button")
        if next_button_span and next_button_span.find("a"):
            current_page_url = next_button_span.find("a")['href']
        else:
            current_page_url = None  # Reached the end

        # Be polite and wait before the next request
        time.sleep(4)  # Pause for 4 seconds
    except requests.exceptions.RequestException as e:
        print(f"Error fetching page {current_page_url}: {e}")
        break  # Stop if there's a network/HTTP error
    except Exception as e:
        print(f"An error occurred during parsing: {e}")
        # Decide if you want to break or continue
        break

# After the loop, print the total number of titles collected
print(f"\nFinished scraping. Collected {len(all_post_titles)} titles.")

Here's the consolidated multi-page script:

import requests
from bs4 import BeautifulSoup
import time

all_post_titles = []
current_page_url = "https://old.reddit.com/r/programming/"
scrape_page_count = 15  # How many pages to scrape
headers = {'User-agent': 'Python Scraping Bot - Learning Project 1.0'}

print("Starting scraper...")
for page_num in range(scrape_page_count):
    if not current_page_url:
        print("No more pages found. Stopping.")
        break

    print(f"Scraping page {page_num + 1}: {current_page_url}")
    try:
        response = requests.get(current_page_url, headers=headers)
        response.raise_for_status()  # Raise an exception for bad status codes

        page_html = response.content
        soup_parser = BeautifulSoup(page_html, "html.parser")

        title_paragraphs = soup_parser.find_all("p", class_="title")
        page_titles = [p_tag.find("a").get_text() for p_tag in title_paragraphs]
        all_post_titles.extend(page_titles)

        # Find the next page URL
        next_button_span = soup_parser.find("span", class_="next-button")
        if next_button_span and next_button_span.find("a"):
            current_page_url = next_button_span.find("a")['href']
        else:
            print("Could not find 'next' button. Assuming last page.")
            current_page_url = None

        # Be polite!
        print(f"Pausing for 4 seconds...")
        time.sleep(4)

    except requests.exceptions.RequestException as e:
        print(f"Error fetching page {current_page_url}: {e}")
        break
    except Exception as e:
        print(f"An error occurred: {e}")
        break  # Stop on other unexpected errors

print(f"\nFinished scraping. Collected {len(all_post_titles)} titles.")

# Optional: print all titles
# print(all_post_titles)

Analyzing the Scraped Data: Most Mentioned Languages

Now that we have a list of titles (all_post_titles), let's analyze them. A simple analysis is to count mentions of popular programming languages.

First, define a dictionary to keep track of the counts for languages we care about. Let's use a selection based loosely on popular languages, ensuring they are lowercase for case-insensitive matching.

language_mentions = {
    "python": 0,
    "javascript": 0,
    "java": 0,
    "c#": 0,
    "c++": 0,
    "c": 0,
    "go": 0,
    "rust": 0,
    "php": 0,
    "swift": 0,
    "kotlin": 0,
    "ruby": 0,
    "typescript": 0,
    "html": 0,
    "css": 0,
    "sql": 0
}

Next, we need to process the titles. We'll iterate through each title, convert it to lowercase, split it into words, and then check if any of those words match our target languages.

import re # Import regular expressions for better word splitting

all_words = []
for title in all_post_titles:
    # Split title into words, convert to lowercase, remove basic punctuation
    words_in_title = re.findall(r'\b\w+\b', title.lower())
    all_words.extend(words_in_title)

# Count mentions
for word in all_words:
    if word in language_mentions:
        language_mentions[word] += 1

# Print the results nicely
print("\nProgramming Language Mention Counts:")
# Sort results by count descending
sorted_mentions = sorted(language_mentions.items(), key=lambda item: item[1], reverse=True)
for lang, count in sorted_mentions:
    if count > 0: # Only show languages that were actually mentioned
        print(f"- {lang.capitalize()}: {count}")

Add this analysis code block to the end of your `reddit_scraper.py` file (after the scraping loop finishes). Running the full script will now scrape the pages and then output a sorted list of language mentions based on the titles collected.

*(Note: The exact counts will vary depending on the posts active on r/programming when you run the script. This simple word matching might also catch unrelated words (e.g., "go" in "let's go"), so more sophisticated analysis might be needed for higher accuracy.)*

Level Up Your Scraping: Using Proxies

When you start scraping more seriously or frequently, you'll likely run into limitations. Websites often implement anti-scraping measures.

Repeated, rapid requests from the same IP address are a dead giveaway for automated activity. This can lead to temporary blocks, CAPTCHA challenges, or even permanent IP bans, stopping your scraper in its tracks.

This is where proxy servers come in. A proxy acts as an intermediary: your requests go to the proxy, and the proxy forwards them to the target website. The website sees the proxy's IP address, not yours. By using different proxies, especially rotating ones, you can make your scraping traffic look like it's coming from multiple different users, significantly reducing the chance of being detected and blocked.

While free proxies exist, they are often unreliable, slow, or even compromised. For any consistent scraping task, investing in a reputable paid proxy service is highly recommended. The improved reliability and success rate usually outweigh the cost.

Getting Started with Evomi Proxies

At Evomi, we offer various proxy types suitable for web scraping, including Residential, Mobile, Datacenter, and Static ISP proxies, all sourced ethically. We pride ourselves on competitive pricing (e.g., Residential proxies start at $0.49/GB) and quality support, operating under Swiss standards of quality and privacy.

To use Evomi proxies, you'll first need an account. Once registered, you can choose the proxy type that best fits your needs (Residential proxies are often a good choice for mimicking real users) and obtain your access credentials (endpoint address, port, username, password). We even offer a free trial for our Residential, Mobile, and Datacenter proxies so you can test them out.

Integrating Evomi Proxies into Your Python Script

Adding proxies to your `requests` calls is straightforward. You'll define a dictionary specifying the proxy address for HTTP and HTTPS traffic.

Let's assume you chose Evomi's Residential proxies. Your credentials might look something like this (replace placeholders with your actual details):

# Replace with your actual Evomi credentials and endpoint
evomi_proxy_user = "your-evomi-username"
evomi_proxy_pass = "your-evomi-password"
evomi_proxy_endpoint = "rp.evomi.com"  # Residential proxy endpoint
evomi_proxy_port_http = "1000"  # HTTP port for residential
proxy_url_http = f"http://{evomi_proxy_user}:{evomi_proxy_pass}@{evomi_proxy_endpoint}:{evomi_proxy_port_http}"

# For HTTPS, you might use a different port if provided, e.g., 1001
# proxy_url_https = f"http://{evomi_proxy_user}:{evomi_proxy_pass}@{evomi_proxy_endpoint}:1001"

# Or often, the same HTTP proxy URL works for HTTPS requests via CONNECT tunneling:
proxy_url_https = proxy_url_http

PROXIES = {
    "http": proxy_url_http,
    "https": proxy_url_https  # Use the appropriate HTTPS proxy URL/port
}

Now, simply add the `proxies` argument to your `requests.get()` call within the loop:

        response = requests.get(
            current_page_url,
            headers=headers,
            proxies=PROXIES
        )

With this change, your script's traffic will be routed through the specified Evomi proxy server. If you're using rotating residential proxies, each request can potentially go through a different IP address, making your scraping much stealthier.

Concluding Thoughts

You've now seen how to use Python, with the help of `requests` and `Beautiful Soup`, to scrape data from a website, handle pagination, and even perform some basic analysis. We also covered the importance of using proxies like those from Evomi for more robust and reliable scraping.

Remember that scraping simple HTML like old Reddit is often easier than dealing with modern, dynamic websites that heavily rely on JavaScript to load content. For those, the techniques here might not be sufficient. You might need tools that can render JavaScript, such as Scrapy (a comprehensive framework) combined with browser automation tools like Playwright or Selenium.

Web scraping is a versatile skill, but always scrape responsibly: respect website terms of service, avoid overloading servers (use delays!), and prefer official APIs when available.

Getting Started with Python Web Scraping in 2025

Thinking about diving into web scraping? Python is arguably your best bet. Its straightforward syntax combined with a rich collection of libraries makes it a top contender for data extraction tasks, even as we look towards 2025.

This guide will walk you through using Python's popular requests library to fetch web content and Beautiful Soup to parse it. By the end, you'll have a practical example under your belt: scraping post titles from the r/programming subreddit (using its classic interface) to figure out which programming languages are currently buzzing in the community.

So, What Exactly is Web Scraping?

Web scraping is essentially the process of automatically extracting information from websites. Instead of manually copying data, you use automated scripts (often called bots, scrapers, or spiders) to grab the data you need.

Typically, these tools download the raw HTML source code of a web page and then sift through it to find specific pieces of information. Some more sophisticated scrapers might even employ headless browsers (browsers without a graphical interface) to interact with pages more like a human user would, especially for sites heavy on JavaScript.

Be warned, though: web scraping isn't always a walk in the park. It often requires some tinkering to get right, and your carefully crafted scripts can break if the target website changes its structure (HTML/CSS). That's why it's generally better to use an official API (Application Programming Interface) if one is available. But when an API isn't an option, web scraping is a powerful skill for gathering data for things like market analysis, competitor tracking, or academic research.

Why Python Reigns Supreme for Scraping

While many programming languages offer tools for web scraping (you really just need an HTTP client and an HTML parser), Python's ecosystem is particularly well-suited for the job.

You've got libraries like Requests, known for its elegant simplicity in handling HTTP communication, and Beautiful Soup, a fantastic tool for navigating and searching HTML documents. For more complex scenarios, frameworks like Scrapy offer a complete scraping solution, while tools like Playwright allow for browser automation. These libraries are widely used, well-documented, and have strong community support.

Plus, Python's relatively gentle learning curve makes it accessible, even if you don't write code for a living. It's great for quickly putting together scripts and testing ideas.

A Practical Python Web Scraping Example

Let's get hands-on. We'll build a Python script to scrape Reddit. The goal is to collect the titles of roughly the first 500 posts from r/programming and then analyze them to see which programming languages get mentioned most often.

We'll target the classic Reddit interface (old.reddit.com) because its simpler HTML structure is easier for beginners to work with.

Setting Up Your Environment

First, ensure you have Python installed. You can grab it from the official Python website if needed.

Next, you'll need the requests and Beautiful Soup libraries. Install them using pip, Python's package manager:

Finally, create a new file named reddit_scraper.py. This is where we'll write our code.

Fetching the Web Page Content

The core process involves two main steps: getting the HTML and then parsing it.

To download the HTML from the r/programming main page, we'll use the requests library.

import requests

The requests.get() function fetches the page. It's crucial to provide a custom User-Agent header. Many websites, including Reddit, might block or limit requests from default script user agents. A unique User-Agent makes your script look less like a generic bot.

target_url = "https://old.reddit.com/r/programming/"
headers = {'User-agent': 'Python Scraping Bot - Learning Project 1.0'}
response = requests.get(target_url, headers=headers)

The raw HTML content is stored in the .content attribute of the response object.

page_html = response.content

So far, your script should look like this:

import requests

target_url = "https://old.reddit.com/r/programming/"
headers = {'User-agent': 'Python Scraping Bot - Learning Project 1.0'}
response = requests.get(target_url, headers=headers)
page_html = response.content
# We'll add parsing logic next

Now that we have the HTML, we need to extract the post titles using Beautiful Soup.

Parsing HTML with Beautiful Soup

First, import the library at the top of your script:

from bs4 import BeautifulSoup

Then, create a Beautiful Soup object to parse the HTML content we fetched:

soup_parser = BeautifulSoup(page_html, "html.parser")

This soup_parser object allows us to navigate the HTML structure using methods like find() and find_all().

But how do we know *what* to look for? Most web browsers have built-in "Developer Tools". Open r/programming in your browser (preferably in Incognito/Private mode to avoid being logged in), right-click on a post title, and select "Inspect" or "Inspect Element". This will open the developer tools and highlight the HTML code corresponding to that title.

You need to identify HTML tags and attributes (like classes or IDs) that consistently mark the elements you want. For old Reddit titles, they are typically within an anchor tag (<a>) inside a paragraph tag (<p>) that has the class title.

We can use find_all() to get all paragraph tags with the class "title":

title_paragraphs = soup_parser.find_all(
    "p", class_="title"
)

This gives us the paragraph elements. Each one contains an anchor tag (<a>) whose text is the actual title. We can extract just the text using a list comprehension:

extracted_titles = [
    p_tag.find("a").get_text() 
    for p_tag in title_paragraphs
]

Let's print the results to see:

print(extracted_titles)

Here's the complete script up to this point:

import requests
from bs4 import BeautifulSoup

target_url = "https://old.reddit.com/r/programming/"
headers = {'User-agent': 'Python Scraping Bot - Learning Project 1.0'}

response = requests.get(target_url, headers=headers)
page_html = response.contents

soup_parser = BeautifulSoup(page_html, "html.parser")
title_paragraphs = soup_parser.find_all("p", class_="title")

extracted_titles = [p_tag.find("a").get_text() for p_tag in title_paragraphs]

print(extracted_titles)

Running this script will output the titles from the first page of r/programming. To build a more substantial dataset, we need to scrape multiple pages.

Scraping Across Multiple Pages

Let's modify the script to grab titles from the first 15 pages.

The strategy is: scrape the current page, find the link to the "next" page, load that page, scrape it, and repeat.

Using browser inspection again, you can find the "next" button on old Reddit. It's usually within a <span> tag with the class next-button. The actual link is in an anchor tag (<a>) inside that span.

We can find this link like so:

next_button_span = soup_parser.find("span", class_="next-button")
# Check if the button exists before trying to get the link
if next_button_span and next_button_span.find("a"):
    next_page_url = next_button_span.find("a")['href']
else:
    next_page_url = None # No more pages

Now, let's restructure the code for multi-page scraping. We'll need the time library to add delays between requests, which is polite to the server and makes our scraping less aggressive.

import time

We'll initialize a list to store all titles and a variable for the URL of the page we're currently scraping.

all_post_titles = []
current_page_url = "https://old.reddit.com/r/programming/"
scrape_page_count = 15 # How many pages to scrape

We'll use a loop to iterate through the pages. Inside the loop, we fetch, parse, extract titles, add them to our main list, find the next page's URL, and pause.

headers = {'User-agent': 'Python Scraping Bot - Learning Project 1.0'}
for page_num in range(scrape_page_count):
    if not current_page_url:
        print("No more pages found. Stopping.")
        break
    print(f"Scraping page {page_num + 1}: {current_page_url}")
    try:
        response = requests.get(current_page_url, headers=headers)
        response.raise_for_status()  # Check for HTTP errors (like 404, 500)
        page_html = response.content
        soup_parser = BeautifulSoup(page_html, "html.parser")
        title_paragraphs = soup_parser.find_all("p", class_="title")
        page_titles = [p_tag.find("a").get_text() for p_tag in title_paragraphs]
        all_post_titles.extend(page_titles)  # Use extend to add elements from list

        # Find the next page URL
        next_button_span = soup_parser.find("span", class_="next-button")
        if next_button_span and next_button_span.find("a"):
            current_page_url = next_button_span.find("a")['href']
        else:
            current_page_url = None  # Reached the end

        # Be polite and wait before the next request
        time.sleep(4)  # Pause for 4 seconds
    except requests.exceptions.RequestException as e:
        print(f"Error fetching page {current_page_url}: {e}")
        break  # Stop if there's a network/HTTP error
    except Exception as e:
        print(f"An error occurred during parsing: {e}")
        # Decide if you want to break or continue
        break

# After the loop, print the total number of titles collected
print(f"\nFinished scraping. Collected {len(all_post_titles)} titles.")

Here's the consolidated multi-page script:

import requests
from bs4 import BeautifulSoup
import time

all_post_titles = []
current_page_url = "https://old.reddit.com/r/programming/"
scrape_page_count = 15  # How many pages to scrape
headers = {'User-agent': 'Python Scraping Bot - Learning Project 1.0'}

print("Starting scraper...")
for page_num in range(scrape_page_count):
    if not current_page_url:
        print("No more pages found. Stopping.")
        break

    print(f"Scraping page {page_num + 1}: {current_page_url}")
    try:
        response = requests.get(current_page_url, headers=headers)
        response.raise_for_status()  # Raise an exception for bad status codes

        page_html = response.content
        soup_parser = BeautifulSoup(page_html, "html.parser")

        title_paragraphs = soup_parser.find_all("p", class_="title")
        page_titles = [p_tag.find("a").get_text() for p_tag in title_paragraphs]
        all_post_titles.extend(page_titles)

        # Find the next page URL
        next_button_span = soup_parser.find("span", class_="next-button")
        if next_button_span and next_button_span.find("a"):
            current_page_url = next_button_span.find("a")['href']
        else:
            print("Could not find 'next' button. Assuming last page.")
            current_page_url = None

        # Be polite!
        print(f"Pausing for 4 seconds...")
        time.sleep(4)

    except requests.exceptions.RequestException as e:
        print(f"Error fetching page {current_page_url}: {e}")
        break
    except Exception as e:
        print(f"An error occurred: {e}")
        break  # Stop on other unexpected errors

print(f"\nFinished scraping. Collected {len(all_post_titles)} titles.")

# Optional: print all titles
# print(all_post_titles)

Analyzing the Scraped Data: Most Mentioned Languages

Now that we have a list of titles (all_post_titles), let's analyze them. A simple analysis is to count mentions of popular programming languages.

First, define a dictionary to keep track of the counts for languages we care about. Let's use a selection based loosely on popular languages, ensuring they are lowercase for case-insensitive matching.

language_mentions = {
    "python": 0,
    "javascript": 0,
    "java": 0,
    "c#": 0,
    "c++": 0,
    "c": 0,
    "go": 0,
    "rust": 0,
    "php": 0,
    "swift": 0,
    "kotlin": 0,
    "ruby": 0,
    "typescript": 0,
    "html": 0,
    "css": 0,
    "sql": 0
}

Next, we need to process the titles. We'll iterate through each title, convert it to lowercase, split it into words, and then check if any of those words match our target languages.

import re # Import regular expressions for better word splitting

all_words = []
for title in all_post_titles:
    # Split title into words, convert to lowercase, remove basic punctuation
    words_in_title = re.findall(r'\b\w+\b', title.lower())
    all_words.extend(words_in_title)

# Count mentions
for word in all_words:
    if word in language_mentions:
        language_mentions[word] += 1

# Print the results nicely
print("\nProgramming Language Mention Counts:")
# Sort results by count descending
sorted_mentions = sorted(language_mentions.items(), key=lambda item: item[1], reverse=True)
for lang, count in sorted_mentions:
    if count > 0: # Only show languages that were actually mentioned
        print(f"- {lang.capitalize()}: {count}")

Add this analysis code block to the end of your `reddit_scraper.py` file (after the scraping loop finishes). Running the full script will now scrape the pages and then output a sorted list of language mentions based on the titles collected.

*(Note: The exact counts will vary depending on the posts active on r/programming when you run the script. This simple word matching might also catch unrelated words (e.g., "go" in "let's go"), so more sophisticated analysis might be needed for higher accuracy.)*

Level Up Your Scraping: Using Proxies

When you start scraping more seriously or frequently, you'll likely run into limitations. Websites often implement anti-scraping measures.

Repeated, rapid requests from the same IP address are a dead giveaway for automated activity. This can lead to temporary blocks, CAPTCHA challenges, or even permanent IP bans, stopping your scraper in its tracks.

This is where proxy servers come in. A proxy acts as an intermediary: your requests go to the proxy, and the proxy forwards them to the target website. The website sees the proxy's IP address, not yours. By using different proxies, especially rotating ones, you can make your scraping traffic look like it's coming from multiple different users, significantly reducing the chance of being detected and blocked.

While free proxies exist, they are often unreliable, slow, or even compromised. For any consistent scraping task, investing in a reputable paid proxy service is highly recommended. The improved reliability and success rate usually outweigh the cost.

Getting Started with Evomi Proxies

At Evomi, we offer various proxy types suitable for web scraping, including Residential, Mobile, Datacenter, and Static ISP proxies, all sourced ethically. We pride ourselves on competitive pricing (e.g., Residential proxies start at $0.49/GB) and quality support, operating under Swiss standards of quality and privacy.

To use Evomi proxies, you'll first need an account. Once registered, you can choose the proxy type that best fits your needs (Residential proxies are often a good choice for mimicking real users) and obtain your access credentials (endpoint address, port, username, password). We even offer a free trial for our Residential, Mobile, and Datacenter proxies so you can test them out.

Integrating Evomi Proxies into Your Python Script

Adding proxies to your `requests` calls is straightforward. You'll define a dictionary specifying the proxy address for HTTP and HTTPS traffic.

Let's assume you chose Evomi's Residential proxies. Your credentials might look something like this (replace placeholders with your actual details):

# Replace with your actual Evomi credentials and endpoint
evomi_proxy_user = "your-evomi-username"
evomi_proxy_pass = "your-evomi-password"
evomi_proxy_endpoint = "rp.evomi.com"  # Residential proxy endpoint
evomi_proxy_port_http = "1000"  # HTTP port for residential
proxy_url_http = f"http://{evomi_proxy_user}:{evomi_proxy_pass}@{evomi_proxy_endpoint}:{evomi_proxy_port_http}"

# For HTTPS, you might use a different port if provided, e.g., 1001
# proxy_url_https = f"http://{evomi_proxy_user}:{evomi_proxy_pass}@{evomi_proxy_endpoint}:1001"

# Or often, the same HTTP proxy URL works for HTTPS requests via CONNECT tunneling:
proxy_url_https = proxy_url_http

PROXIES = {
    "http": proxy_url_http,
    "https": proxy_url_https  # Use the appropriate HTTPS proxy URL/port
}

Now, simply add the `proxies` argument to your `requests.get()` call within the loop:

        response = requests.get(
            current_page_url,
            headers=headers,
            proxies=PROXIES
        )

With this change, your script's traffic will be routed through the specified Evomi proxy server. If you're using rotating residential proxies, each request can potentially go through a different IP address, making your scraping much stealthier.

Concluding Thoughts

You've now seen how to use Python, with the help of `requests` and `Beautiful Soup`, to scrape data from a website, handle pagination, and even perform some basic analysis. We also covered the importance of using proxies like those from Evomi for more robust and reliable scraping.

Remember that scraping simple HTML like old Reddit is often easier than dealing with modern, dynamic websites that heavily rely on JavaScript to load content. For those, the techniques here might not be sufficient. You might need tools that can render JavaScript, such as Scrapy (a comprehensive framework) combined with browser automation tools like Playwright or Selenium.

Web scraping is a versatile skill, but always scrape responsibly: respect website terms of service, avoid overloading servers (use delays!), and prefer official APIs when available.

Getting Started with Python Web Scraping in 2025

Thinking about diving into web scraping? Python is arguably your best bet. Its straightforward syntax combined with a rich collection of libraries makes it a top contender for data extraction tasks, even as we look towards 2025.

This guide will walk you through using Python's popular requests library to fetch web content and Beautiful Soup to parse it. By the end, you'll have a practical example under your belt: scraping post titles from the r/programming subreddit (using its classic interface) to figure out which programming languages are currently buzzing in the community.

So, What Exactly is Web Scraping?

Web scraping is essentially the process of automatically extracting information from websites. Instead of manually copying data, you use automated scripts (often called bots, scrapers, or spiders) to grab the data you need.

Typically, these tools download the raw HTML source code of a web page and then sift through it to find specific pieces of information. Some more sophisticated scrapers might even employ headless browsers (browsers without a graphical interface) to interact with pages more like a human user would, especially for sites heavy on JavaScript.

Be warned, though: web scraping isn't always a walk in the park. It often requires some tinkering to get right, and your carefully crafted scripts can break if the target website changes its structure (HTML/CSS). That's why it's generally better to use an official API (Application Programming Interface) if one is available. But when an API isn't an option, web scraping is a powerful skill for gathering data for things like market analysis, competitor tracking, or academic research.

Why Python Reigns Supreme for Scraping

While many programming languages offer tools for web scraping (you really just need an HTTP client and an HTML parser), Python's ecosystem is particularly well-suited for the job.

You've got libraries like Requests, known for its elegant simplicity in handling HTTP communication, and Beautiful Soup, a fantastic tool for navigating and searching HTML documents. For more complex scenarios, frameworks like Scrapy offer a complete scraping solution, while tools like Playwright allow for browser automation. These libraries are widely used, well-documented, and have strong community support.

Plus, Python's relatively gentle learning curve makes it accessible, even if you don't write code for a living. It's great for quickly putting together scripts and testing ideas.

A Practical Python Web Scraping Example

Let's get hands-on. We'll build a Python script to scrape Reddit. The goal is to collect the titles of roughly the first 500 posts from r/programming and then analyze them to see which programming languages get mentioned most often.

We'll target the classic Reddit interface (old.reddit.com) because its simpler HTML structure is easier for beginners to work with.

Setting Up Your Environment

First, ensure you have Python installed. You can grab it from the official Python website if needed.

Next, you'll need the requests and Beautiful Soup libraries. Install them using pip, Python's package manager:

Finally, create a new file named reddit_scraper.py. This is where we'll write our code.

Fetching the Web Page Content

The core process involves two main steps: getting the HTML and then parsing it.

To download the HTML from the r/programming main page, we'll use the requests library.

import requests

The requests.get() function fetches the page. It's crucial to provide a custom User-Agent header. Many websites, including Reddit, might block or limit requests from default script user agents. A unique User-Agent makes your script look less like a generic bot.

target_url = "https://old.reddit.com/r/programming/"
headers = {'User-agent': 'Python Scraping Bot - Learning Project 1.0'}
response = requests.get(target_url, headers=headers)

The raw HTML content is stored in the .content attribute of the response object.

page_html = response.content

So far, your script should look like this:

import requests

target_url = "https://old.reddit.com/r/programming/"
headers = {'User-agent': 'Python Scraping Bot - Learning Project 1.0'}
response = requests.get(target_url, headers=headers)
page_html = response.content
# We'll add parsing logic next

Now that we have the HTML, we need to extract the post titles using Beautiful Soup.

Parsing HTML with Beautiful Soup

First, import the library at the top of your script:

from bs4 import BeautifulSoup

Then, create a Beautiful Soup object to parse the HTML content we fetched:

soup_parser = BeautifulSoup(page_html, "html.parser")

This soup_parser object allows us to navigate the HTML structure using methods like find() and find_all().

But how do we know *what* to look for? Most web browsers have built-in "Developer Tools". Open r/programming in your browser (preferably in Incognito/Private mode to avoid being logged in), right-click on a post title, and select "Inspect" or "Inspect Element". This will open the developer tools and highlight the HTML code corresponding to that title.

You need to identify HTML tags and attributes (like classes or IDs) that consistently mark the elements you want. For old Reddit titles, they are typically within an anchor tag (<a>) inside a paragraph tag (<p>) that has the class title.

We can use find_all() to get all paragraph tags with the class "title":

title_paragraphs = soup_parser.find_all(
    "p", class_="title"
)

This gives us the paragraph elements. Each one contains an anchor tag (<a>) whose text is the actual title. We can extract just the text using a list comprehension:

extracted_titles = [
    p_tag.find("a").get_text() 
    for p_tag in title_paragraphs
]

Let's print the results to see:

print(extracted_titles)

Here's the complete script up to this point:

import requests
from bs4 import BeautifulSoup

target_url = "https://old.reddit.com/r/programming/"
headers = {'User-agent': 'Python Scraping Bot - Learning Project 1.0'}

response = requests.get(target_url, headers=headers)
page_html = response.contents

soup_parser = BeautifulSoup(page_html, "html.parser")
title_paragraphs = soup_parser.find_all("p", class_="title")

extracted_titles = [p_tag.find("a").get_text() for p_tag in title_paragraphs]

print(extracted_titles)

Running this script will output the titles from the first page of r/programming. To build a more substantial dataset, we need to scrape multiple pages.

Scraping Across Multiple Pages

Let's modify the script to grab titles from the first 15 pages.

The strategy is: scrape the current page, find the link to the "next" page, load that page, scrape it, and repeat.

Using browser inspection again, you can find the "next" button on old Reddit. It's usually within a <span> tag with the class next-button. The actual link is in an anchor tag (<a>) inside that span.

We can find this link like so:

next_button_span = soup_parser.find("span", class_="next-button")
# Check if the button exists before trying to get the link
if next_button_span and next_button_span.find("a"):
    next_page_url = next_button_span.find("a")['href']
else:
    next_page_url = None # No more pages

Now, let's restructure the code for multi-page scraping. We'll need the time library to add delays between requests, which is polite to the server and makes our scraping less aggressive.

import time

We'll initialize a list to store all titles and a variable for the URL of the page we're currently scraping.

all_post_titles = []
current_page_url = "https://old.reddit.com/r/programming/"
scrape_page_count = 15 # How many pages to scrape

We'll use a loop to iterate through the pages. Inside the loop, we fetch, parse, extract titles, add them to our main list, find the next page's URL, and pause.

headers = {'User-agent': 'Python Scraping Bot - Learning Project 1.0'}
for page_num in range(scrape_page_count):
    if not current_page_url:
        print("No more pages found. Stopping.")
        break
    print(f"Scraping page {page_num + 1}: {current_page_url}")
    try:
        response = requests.get(current_page_url, headers=headers)
        response.raise_for_status()  # Check for HTTP errors (like 404, 500)
        page_html = response.content
        soup_parser = BeautifulSoup(page_html, "html.parser")
        title_paragraphs = soup_parser.find_all("p", class_="title")
        page_titles = [p_tag.find("a").get_text() for p_tag in title_paragraphs]
        all_post_titles.extend(page_titles)  # Use extend to add elements from list

        # Find the next page URL
        next_button_span = soup_parser.find("span", class_="next-button")
        if next_button_span and next_button_span.find("a"):
            current_page_url = next_button_span.find("a")['href']
        else:
            current_page_url = None  # Reached the end

        # Be polite and wait before the next request
        time.sleep(4)  # Pause for 4 seconds
    except requests.exceptions.RequestException as e:
        print(f"Error fetching page {current_page_url}: {e}")
        break  # Stop if there's a network/HTTP error
    except Exception as e:
        print(f"An error occurred during parsing: {e}")
        # Decide if you want to break or continue
        break

# After the loop, print the total number of titles collected
print(f"\nFinished scraping. Collected {len(all_post_titles)} titles.")

Here's the consolidated multi-page script:

import requests
from bs4 import BeautifulSoup
import time

all_post_titles = []
current_page_url = "https://old.reddit.com/r/programming/"
scrape_page_count = 15  # How many pages to scrape
headers = {'User-agent': 'Python Scraping Bot - Learning Project 1.0'}

print("Starting scraper...")
for page_num in range(scrape_page_count):
    if not current_page_url:
        print("No more pages found. Stopping.")
        break

    print(f"Scraping page {page_num + 1}: {current_page_url}")
    try:
        response = requests.get(current_page_url, headers=headers)
        response.raise_for_status()  # Raise an exception for bad status codes

        page_html = response.content
        soup_parser = BeautifulSoup(page_html, "html.parser")

        title_paragraphs = soup_parser.find_all("p", class_="title")
        page_titles = [p_tag.find("a").get_text() for p_tag in title_paragraphs]
        all_post_titles.extend(page_titles)

        # Find the next page URL
        next_button_span = soup_parser.find("span", class_="next-button")
        if next_button_span and next_button_span.find("a"):
            current_page_url = next_button_span.find("a")['href']
        else:
            print("Could not find 'next' button. Assuming last page.")
            current_page_url = None

        # Be polite!
        print(f"Pausing for 4 seconds...")
        time.sleep(4)

    except requests.exceptions.RequestException as e:
        print(f"Error fetching page {current_page_url}: {e}")
        break
    except Exception as e:
        print(f"An error occurred: {e}")
        break  # Stop on other unexpected errors

print(f"\nFinished scraping. Collected {len(all_post_titles)} titles.")

# Optional: print all titles
# print(all_post_titles)

Analyzing the Scraped Data: Most Mentioned Languages

Now that we have a list of titles (all_post_titles), let's analyze them. A simple analysis is to count mentions of popular programming languages.

First, define a dictionary to keep track of the counts for languages we care about. Let's use a selection based loosely on popular languages, ensuring they are lowercase for case-insensitive matching.

language_mentions = {
    "python": 0,
    "javascript": 0,
    "java": 0,
    "c#": 0,
    "c++": 0,
    "c": 0,
    "go": 0,
    "rust": 0,
    "php": 0,
    "swift": 0,
    "kotlin": 0,
    "ruby": 0,
    "typescript": 0,
    "html": 0,
    "css": 0,
    "sql": 0
}

Next, we need to process the titles. We'll iterate through each title, convert it to lowercase, split it into words, and then check if any of those words match our target languages.

import re # Import regular expressions for better word splitting

all_words = []
for title in all_post_titles:
    # Split title into words, convert to lowercase, remove basic punctuation
    words_in_title = re.findall(r'\b\w+\b', title.lower())
    all_words.extend(words_in_title)

# Count mentions
for word in all_words:
    if word in language_mentions:
        language_mentions[word] += 1

# Print the results nicely
print("\nProgramming Language Mention Counts:")
# Sort results by count descending
sorted_mentions = sorted(language_mentions.items(), key=lambda item: item[1], reverse=True)
for lang, count in sorted_mentions:
    if count > 0: # Only show languages that were actually mentioned
        print(f"- {lang.capitalize()}: {count}")

Add this analysis code block to the end of your `reddit_scraper.py` file (after the scraping loop finishes). Running the full script will now scrape the pages and then output a sorted list of language mentions based on the titles collected.

*(Note: The exact counts will vary depending on the posts active on r/programming when you run the script. This simple word matching might also catch unrelated words (e.g., "go" in "let's go"), so more sophisticated analysis might be needed for higher accuracy.)*

Level Up Your Scraping: Using Proxies

When you start scraping more seriously or frequently, you'll likely run into limitations. Websites often implement anti-scraping measures.

Repeated, rapid requests from the same IP address are a dead giveaway for automated activity. This can lead to temporary blocks, CAPTCHA challenges, or even permanent IP bans, stopping your scraper in its tracks.

This is where proxy servers come in. A proxy acts as an intermediary: your requests go to the proxy, and the proxy forwards them to the target website. The website sees the proxy's IP address, not yours. By using different proxies, especially rotating ones, you can make your scraping traffic look like it's coming from multiple different users, significantly reducing the chance of being detected and blocked.

While free proxies exist, they are often unreliable, slow, or even compromised. For any consistent scraping task, investing in a reputable paid proxy service is highly recommended. The improved reliability and success rate usually outweigh the cost.

Getting Started with Evomi Proxies

At Evomi, we offer various proxy types suitable for web scraping, including Residential, Mobile, Datacenter, and Static ISP proxies, all sourced ethically. We pride ourselves on competitive pricing (e.g., Residential proxies start at $0.49/GB) and quality support, operating under Swiss standards of quality and privacy.

To use Evomi proxies, you'll first need an account. Once registered, you can choose the proxy type that best fits your needs (Residential proxies are often a good choice for mimicking real users) and obtain your access credentials (endpoint address, port, username, password). We even offer a free trial for our Residential, Mobile, and Datacenter proxies so you can test them out.

Integrating Evomi Proxies into Your Python Script

Adding proxies to your `requests` calls is straightforward. You'll define a dictionary specifying the proxy address for HTTP and HTTPS traffic.

Let's assume you chose Evomi's Residential proxies. Your credentials might look something like this (replace placeholders with your actual details):

# Replace with your actual Evomi credentials and endpoint
evomi_proxy_user = "your-evomi-username"
evomi_proxy_pass = "your-evomi-password"
evomi_proxy_endpoint = "rp.evomi.com"  # Residential proxy endpoint
evomi_proxy_port_http = "1000"  # HTTP port for residential
proxy_url_http = f"http://{evomi_proxy_user}:{evomi_proxy_pass}@{evomi_proxy_endpoint}:{evomi_proxy_port_http}"

# For HTTPS, you might use a different port if provided, e.g., 1001
# proxy_url_https = f"http://{evomi_proxy_user}:{evomi_proxy_pass}@{evomi_proxy_endpoint}:1001"

# Or often, the same HTTP proxy URL works for HTTPS requests via CONNECT tunneling:
proxy_url_https = proxy_url_http

PROXIES = {
    "http": proxy_url_http,
    "https": proxy_url_https  # Use the appropriate HTTPS proxy URL/port
}

Now, simply add the `proxies` argument to your `requests.get()` call within the loop:

        response = requests.get(
            current_page_url,
            headers=headers,
            proxies=PROXIES
        )

With this change, your script's traffic will be routed through the specified Evomi proxy server. If you're using rotating residential proxies, each request can potentially go through a different IP address, making your scraping much stealthier.

Concluding Thoughts

You've now seen how to use Python, with the help of `requests` and `Beautiful Soup`, to scrape data from a website, handle pagination, and even perform some basic analysis. We also covered the importance of using proxies like those from Evomi for more robust and reliable scraping.

Remember that scraping simple HTML like old Reddit is often easier than dealing with modern, dynamic websites that heavily rely on JavaScript to load content. For those, the techniques here might not be sufficient. You might need tools that can render JavaScript, such as Scrapy (a comprehensive framework) combined with browser automation tools like Playwright or Selenium.

Web scraping is a versatile skill, but always scrape responsibly: respect website terms of service, avoid overloading servers (use delays!), and prefer official APIs when available.

Author

David Foster

Proxy & Network Security Analyst

About Author

David is an expert in network security, web scraping, and proxy technologies, helping businesses optimize data extraction while maintaining privacy and efficiency. With a deep understanding of residential, datacenter, and rotating proxies, he explores how proxies enhance cybersecurity, bypass geo-restrictions, and power large-scale web scraping. David’s insights help businesses and developers choose the right proxy solutions for SEO monitoring, competitive intelligence, and anonymous browsing.

Like this article? Share it.
You asked, we answer - Users questions:
Is web scraping always legal, and what are the key ethical considerations?+
Why do Requests and Beautiful Soup sometimes fail to extract data from modern websites?+
What strategies can I use if my Python web scraper encounters CAPTCHA challenges?+
What are the best ways to store large amounts of data scraped using Python?+
How do I decide between Residential, Datacenter, and Mobile proxies for my scraping project?+
Beyond adding simple delays, how can I implement more sophisticated rate limiting in my Python scraper?+

In This Article