Guides

Master Python Web Scraping with Proxying.io: Unblock Global Data Collection

Tomas Jurgaitis
Last Updated on 2025-07-07

Ready to scale your data?

Subscribe to our newsletter

Ready to master Python web scraping? It is a powerful means of collecting data, such as the price of products or market trends. IP blocks and JavaScript can trip you up. Proxying.io’s proxies allow your web scraping to be unblocked and smooth.
This guide walks you through building a simple scraper from setup to saving data.
Ready? Let’s Dive in.

What you will Learn

Set up Python for web scraping.
Use libraries like Requests, Beautiful Soup, and Pandas.
Find HTML elements with Developer Tools.
Save a file as an Excel or CSV file.
Avoid troubles with Proxying.io options.

This web scraping tutorial runs on any operating system, with some trivial adjustments. Let’s get started

Pre-requisities

You require a Python 3.4+ version. It is available on python.org. On Windows, check “Add Python to PATH” during install for easy pip access. Missed it? Rerun the installer and select “Modify” to add it.

For demonstration purposes, we’ll scrape Reddit.

Remember: This is a sample website. Replace the website you want to scrape with your own.

Python Web Scraping Libraries

Python’s libraries make web scraping easy. Here’s what we’ll use:

Requests: Make a simple HTTP request to fetch web pages. Excellent in the case of static pages
Beautiful Soup: Processes HTML to get information.
Selenium: Copes with heavy websites, JavaScript sites, as a result of browser automation.
Pandas: Writes the data in CSV or Excel, which is scraped.

Install them

pip install requests beautifulsoup4 selenium pandas pyarrow openpyxl

pip install requests beautifulsoup4 selenium pandas pyarrow openpyxl

Proxies are key for web scraping. Proxying.io’s residential proxies rotate IPs to avoid bans, ensuring smooth data collection. You’ll see a confirmation in your terminal after installing the libraries.

Coding Environment

Choose a coding Tool. Any IDE or Jupyter Notebook is an alternative to PyCharm, but PyCharm is beginner-friendly. In any IDE, use New> File to create a file (e.g., scraper.py). This prepares your web scraping project in Python.

from selenium import webdriver
from selenium.webdriver import ChromeOptions
options = ChromeOptions()
driver = webdriver.Chrome(options=options)

from selenium import webdriver
from selenium.webdriver import ChromeOptions
options = ChromeOptions()
driver = webdriver.Chrome(options=options)

Browser Setup

Selenium needs a browser such as Firefox or Chrome. To make debugging convenient, start with a visible browser. Later, use headless mode for speed, Selenium 4.6+ auto-manages WebDrivers, but match your browser version. Set up Chrome:

This preps your browser for web scraping.

Targeting Website

We’ll scrape post titles and upvotes from

https://old.reddit.com/r/technology/.

The old Reddit layout is less JavaScript-heavy, making web scraping easier. Check

https://old.reddit.com/robots.txt

to confirm scraping is allowed for public data. Load the URL:

driver.get('https://old.reddit.com/r/technology/')

driver.get('https://old.reddit.com/r/technology/')

This sets up your web scraping target.

Remember: This is a sample website. Replace the website you want to scrape with your own

Importing libraries

These are the libraries you must import for web scraping in python.

Requests
Beautiful Soup
lxml
Selenium
Scrapy

Start your script

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import ChromeOptions
import requests

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import ChromeOptions
import requests

These libraries drive your web scraping tool.

Extracting Data

To scrape data from any website, use Developer Tools(F12) to inspect HTML. Post Titles are in <a> tags with class title inside <div class=”thing”>. Upvotes are in <dev class=”score unvoted”>.
Here is the code:

results = []
upvotes = []
content = driver.page_source
soup = BeautifulSoup(content, 'html.parser')
for post in soup.find_all(attrs={'class': 'thing'}):
    title = post.find('a', attrs={'class': 'title'})
    score = post.find('div', attrs={'class': 'score unvoted'})
    if title and score:
        results.append(title.text.strip())
        upvotes.append(score.text.strip())

results = []
upvotes = []
content = driver.page_source
soup = BeautifulSoup(content, 'html.parser')
for post in soup.find_all(attrs={'class': 'thing'}):
    title = post.find('a', attrs={'class': 'title'})
    score = post.find('div', attrs={'class': 'score unvoted'})
    if title and score:
        results.append(title.text.strip())
        upvotes.append(score.text.strip())

This extracts titles and upvotes for your web scraping project.

Saving data

Save your data into file using Pandas.

For CSV

df = pd.DataFrame({'Post Titles': results, 'Upvotes': upvotes})
df.to_csv('reddit_posts.csv', index=False, encoding='utf-8')

df = pd.DataFrame({'Post Titles': results, 'Upvotes': upvotes})
df.to_csv('reddit_posts.csv', index=False, encoding='utf-8')

For Excel

Pandas library features a function to export data in an Excel file, but it requires you to install openpyxl library, which you can do in your terminal by following the command.

pip install openpyxl

pip install openpyxl

Now, let’s see how we can scrape data into an Excel file.

df.to_excel('reddit_posts.xlsx', index=False)

df.to_excel('reddit_posts.xlsx', index=False)

Close the browser.

driver.quit()

driver.quit()

Proxies

Websites may block scrapers by over-requesting. To stay anonymous while web scraping, use proxies to rotate IPs. Whether using Proxying.io proxies or another provider, you will need key details:

Proxy server address
Port
Username
Password

For proxies with Requests, use this format:

proxies = {
    'http': 'http://USERNAME:PASSWORD@pr.proxying.io:7777',
    'https': 'http://USERNAME:PASSWORD@pr.proxying.io:7777'
}
response = requests.get('https://old.reddit.com/r/technology/', proxies=proxies)
soup = BeautifulSoup(response.text, 'html.parser')

proxies = {
    'http': 'http://USERNAME:[email protected]:7777',
    'https': 'http://USERNAME:[email protected]:7777'
}
response = requests.get('https://old.reddit.com/r/technology/', proxies=proxies)
soup = BeautifulSoup(response.text, 'html.parser')

If you have other proxies, adapt this format:

http://USER:PASS@proxy.provider.com:port

http://USER:PASS@proxy.provider.com:port

For Selenium, we offer easy integration. These ensure reliable web scraping.

Full Python Web Scraping Code

Here’s a script to scrape pages, integrating proxies with Selenium.

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import ChromeOptions
import time

# Set up proxy with Selenium
options = ChromeOptions()
options.add_argument('--headless=new')
options.add_argument('--proxy-server=http://USERNAME:PASSWORD@pr.proxying.io:7777')
driver = webdriver.Chrome(options=options)

pages = ['https://old.reddit.com/r/technology/', 'https://old.reddit.com/r/technology/?count=25&after=t3_1f3k4x0']
results = []
upvotes = []

for page in pages:
    print(f'Crawling {page}')
    driver.get(page)
    time.sleep(2)  # Avoid bans
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    for post in soup.find_all(attrs={'class': 'thing'}):
        title = post.find('a', attrs={'class': 'title'})
        score = post.find('div', attrs={'class': 'score unvoted'})
        if title and score:
            results.append(title.text.strip())
            upvotes.append(score.text.strip())

driver.quit()

df = pd.DataFrame({'Post Titles': results, 'Upvotes': upvotes})
df.to_csv('reddit_posts.csv', index=False, encoding='utf-8')
print('Data saved!')

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import ChromeOptions
import time

# Set up proxy with Selenium
options = ChromeOptions()
options.add_argument('--headless=new')
options.add_argument('--proxy-server=http://USERNAME:[email protected]:7777')
driver = webdriver.Chrome(options=options)

pages = ['https://old.reddit.com/r/technology/', 'https://old.reddit.com/r/technology/?count=25&after=t3_1f3k4x0']
results = []
upvotes = []

for page in pages:
    print(f'Crawling {page}')
    driver.get(page)
    time.sleep(2)  # Avoid bans
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    for post in soup.find_all(attrs={'class': 'thing'}):
        title = post.find('a', attrs={'class': 'title'})
        score = post.find('div', attrs={'class': 'score unvoted'})
        if title and score:
            results.append(title.text.strip())
            upvotes.append(score.text.strip())

driver.quit()

df = pd.DataFrame({'Post Titles': results, 'Upvotes': upvotes})
df.to_csv('reddit_posts.csv', index=False, encoding='utf-8')
print('Data saved!')

The output should look like:

This is the output screen of whole program

Pro Tips

Enhance your web scraping:

Handle Errors:

try:

    response = requests.get(page, proxies=proxies)

    response.raise_for_status()

except requests.RequestException as e:

try:

    response = requests.get(page, proxies=proxies)

    response.raise_for_status()

except requests.RequestException as e:

print(f”Error: {e}”)

Avoid Bans: To avoid bans, please add delays:

import time
time.sleep(2)

import time
time.sleep(2)

Set User-Agent:

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0'}
response = requests.get(page, headers=headers, proxies=proxies)

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0'}
response = requests.get(page, headers=headers, proxies=proxies)

JavaScript Sites: Use Selenium for websites’ dynamic content.
Scale Up: Try aiohttp for async web scraping:

import aiohttp
async with aiohttp.ClientSession() as session:
    async with session.get(page) as response:
        text = await response.text()

import aiohttp
async with aiohttp.ClientSession() as session:
    async with session.get(page) as response:
        text = await response.text()

Stay Ethical: Scrape public data, respect robots.txt.

Data Cleaning and Normalization (Optional)

Scraped Reddit data is often messy, so cleaning is essential; remove duplicates, fix errors, and filter noise. Normalization helps by standardizing formats, like converting upvotes to integers or tidying post titles. This makes the data ready for accurate analysis or modeling.

results = [title.strip() for title in set(results) if title]  # Remove duplicates
upvotes = [int(score) if score.isdigit() else 0 for score in upvotes]  # Normalize to int

results = [title.strip() for title in set(results) if title]  # Remove duplicates
upvotes = [int(score) if score.isdigit() else 0 for score in upvotes]  # Normalize to int

This ensures your web scraping data is analysis-ready.

Web Scraping Use Cases

Market Research: Use Python to scrape r/technology and spot the latest trends before they go mainstream.
Sentiment Analysis: Collect user comments to understand public opinion on different products or news.
Content Creation: Find hot and trending topics to create content your audience cares about.
Data Science: Build your own AI-ready datasets by scraping and organizing real-world Reddit Data.

Proxies Are Essential

Web scraping is something that requires proxies at scale. You can be halted by IP bans or CAPTCHAs. The proxies of Proxying.io are IP rotating so that they can scrape worldwide, in a global way, such as subreddits in certain regions. Other proxies do too, but our web scraper AI makes proxies, rendering, and CAPTCHAs simple.

Common Mistakes

Avoid these mistakes in web scraping

Skipping error handling.
Missing JavaScript content (use Selenium).
Rapid requests without delays.
Wrong Selectors (check Developer Tools).
Scraping without proxies, risking bans.

Troubleshooting

Empty results: If you are getting an empty list or result, please verify selectors in Developer Tools.
Uneven Lists: You can use the “zip” file type to avoid this

df = pd.DataFrame(list(zip(results, upvotes)), columns=['Post Titles', 'Upvotes'])

df = pd.DataFrame(list(zip(results, upvotes)), columns=['Post Titles', 'Upvotes'])

Selenium Issues: Please check your browser and driver compatibility.

Conclusion

Nice work building your Python web scraping tool. You’re ready to grab data like a pro, and Proxying.io’s proxies keep you unblocked for global scraping. Try our Web Scraper API for an easier route, or keep tweaking your script.

Frequently Question Answers(FAQs)

How can I ensure my web scraping complies with legal and ethical guidelines when using Proxying.io proxies?

Always review the target website’s terms of service and robots.txt file to confirm scraping is permitted. Use Proxying.io proxies to respect rate limits and avoid disrupting the site, ensuring ethical data collection from public sources.

What types of Proxying.io proxies are best for web scraping large-scale data?

Proxying.io offers residential and datacenter proxies. Residential proxies are ideal for large-scale scraping as they use real IP addresses, reducing the likelihood of detection and bans compared to datacenter proxies.

Can Proxying.io proxies help with scraping websites in different languages or regions?

Yes, Proxying.io’s rotating residential proxies allow you to use IPs from specific regions, enabling access to region-locked content or localized websites in different languages for comprehensive data collection.

How do I handle websites that require login authentication for scraping with Proxying.io?

You can configure Proxying.io proxies to work with login-based websites by passing cookies or session tokens through your scraper, maintaining anonymity while accessing protected content. Check Proxying.io’s documentation for setup details.

About the author

Tomas Jurgaitis

Tomas Jurgaitis has led PR initiatives at the forefront of tech, blending a sharp eye for storytelling with a deep-rooted curiosity for all things digital. Raised in an environment where innovation was the norm, his passion for the internet and emerging tech came naturally where he regularly crafts how-to tutorials for web scraping.

Starts from

$4/GB

Pay as You Go

ISP Proxies

Starts from

$2/IP

Coming Soon

Dedicated Datacenter Proxies

Starts from

$1.5/IP

Coming Soon

IN THIS ARTICLE:

Ready to scale your data?

Subscribe to our newsletter

What you will Learn

Pre-requisities

Python Web Scraping Libraries

Install them

Coding Environment

Browser Setup

Targeting Website

Importing libraries

Start your script

Extracting Data

Saving data

For CSV

For Excel

Proxies

Full Python Web Scraping Code

Pro Tips

Data Cleaning and Normalization (Optional)

Web Scraping Use Cases

Proxies Are Essential

Common Mistakes

Troubleshooting

Conclusion

Frequently Question Answers(FAQs)

About the author

IN THIS ARTICLE:

Ready to scale your data?

Subscribe to our newsletter

Want to scale your web data gathering with Proxies?

Related articles

Start testing our proxies for free

Helping companies scale their web data gathering with residential proxies.