Web scraping involves two major challenges. The first challenge is avoiding request blocking and rate limiting while collecting webpage data. This second challenge is parsing the raw HTML into structured, usable information.
While proxies help address IP blocking, developers still need to manually process large amounts of HTML. This usually requires maintaining CSS selectors, XPath expressions, and parsing logic that can break whenever a website updates its structure.
With the rise of LLMs like Claude by Anthropic, developers now have another option. Instead of manually analyzing HTML and maintaining selectors, Claude can analyze webpage structures and extract data automatically. By combining Claude AI with Python and Proxying proxies, it becomes possible to build a flexible and intelligent scraping workflow that requires significantly less maintenance.
In this guide, you’ll learn how to use Claude for web scraping with Python. We’ll cover:
- Setting up the environment
- Scraping webpages with Proxying proxies
- Parsing HTML with Claude API
- Scraping an Amazon product data
- Optimizing token usage for a large webpage
- Building a reusable scraping workflow
By the end of this tutorial, you will have a working Claude-powered scraping solution capable of extracting structured data from modern websites.
Why Use Claude for Web Scraping?
Traditional web scraping workflows rely heavily on manually written selectors and parsing rules. While tools like BeautifulSoup and lxml make parsing easier, scrapers are hard to maintain at scale.
A small HTML structure update can completely break a scraper. Developers then need to inspect the webpage again, update selectors, test extraction logic, and redeploy their scripts.
Claude changes this workflow by using natural language understanding to interpret webpage structures dynamically.
Instead of hardcoding every selector manually, Claude can analyze raw HTML, extract structured data automatically, generate CSS selectors, adapt to webpage structure changes, return validated JSON responses, and reduce parser maintenance.
How Claude Fits Into the Scraping Workflow
A Claude-powered scraping workflow usually looks like this:
- Sends requests through residential proxies.
- Retrieve webpage HTML.
- Clean unnecessary HTML content.
- Send relevant HTML sections to Claude.
- Ask Claude to extract data or generate selectors.
- Parse the remaining webpage automatically.
- Store structured results.
This workflow reduces the amount of manual parser maintenance needed over time.
Setting Up the Environment
Before building the scraper, install Python and the required libraries. You can download Python from the official website.
For this tutorial, we will use the following libraries:
- Anthropic
- Instructor
- Requests
- BeautifulSoup4
- Pydantic
These libraries handle the Claude API communication, structured LLM responses, HTTP requests, HTML parsing, and response validation.
Create a virtual environment and install everything using the following commands.
python -m venv env
source env/bin/activate
pip install anthropic instructor requests beautifulsoup4 pydanticEach of these libraries plays a specific role in the scraping pipeline. Requests handles network communication, BeautifulSoup processes HTML, and Claude is responsible for understanding and extracting structured information.
Getting API Credentials
To use Claude for scraping, you’ll need an API key from Anthropic. You will also need residential proxy credentials for Proxying.
Residential proxies help distribute requests across multiple IP addresses, which reduces blocking and improves scraping reliability.
Creating the Initial Python Script
Create a file called main.py and import the required libraries.
import requests
import anthropic
import instructor
from bs4 import BeautifulSoup
from pydantic import BaseModel
Next, define your API credentials and Claude model.
CLAUDE_API_KEY = "API_KEY"
CLAUDE_MODEL = "claude-3-5-haiku-20241022"
PROXYING_USERNAME = "USERNAME"
PROXYING_PASSWORD = "PASSWORD"For this tutorial, we’re using Claude 3.5 Haiku because it is fast, lightweight, and cost-effective for HTML parsing tasks.
Defining Structured Output Models
One of the most important improvements in the system is structured output generation. Instead of receiving raw text from Claude, we enforce a strict data format using Pydantic models. This ensures that every response follows a predictable structure.
class Product(BaseModel):
title: str
price: str
rating: str
class ProductList(BaseModel):
products: list[Product]This structure is important because it allows Claude to return multiple products in a consistent format. Without this, responses could vary in format, making them harder to process in downstream applications.
Scraping Web Pages Using Proxying
The real first step in the scraping pipeline is retrieving HTML from a website. We define a function that takes a URL and returns its HTML content. Inside this function, we configure Proxying residential proxies so that all requests are routed through external IPs instead of our local machine.
This is important because many websites detect and block repeated requests coming fro the same IP address. By using residential proxies, each request appears to come from a different real user, which significantly improves the success rate.
We also include browser-like leaders such as user-agent and language preferences. These headers make the request appear more natural and reduce the chances of getting blocked.
def get_html(url: str) -> str:
proxy_url = f"http://{PROXY_USER}:{PROXY_PASS}@gate.proxying.io:10000"
proxies = {
"http": proxy_url,
"https": proxy_url,
}
headers = {
"User-Agent": "Mozilla/5.0",
"Accept-Language": "en-US, en;q=0.9",
}
response = requests.get(url, headers=headers, proxies=proxies)
response.raise_for_status()
return response.textWhen the function runs, it sends a request through Proxying, receives the full HTML page, verifies that the request was successful, and then returns the raw HTML for further processing.
At this stage, the data is completely unstructured.
Scraping Books to Scrape Using Claude
To understand how Claude processes HTML, we start with a simple dataset called Books to Scrape. This website is designed for learning scraping techniques and has a clean, consistent HTML structure.
We now define a function that sends this HTML to Claude and asks it to extract structured book data.
def parse_books(html: str) -> ProductList:
client = instructor.from_anthropic(
anthropic.Anthropic(api_key=CLAUDE_API_KEY)
)
prompt = f"Extract book title, price, and rating from the following HTML: {html}"
return client.messages.create(
model=CLAUDE_MODEL,
max_tokens=4096,
messages=[{
"role": "user",
"content": [{"type": "text", "text": prompt}]
}],
response_model=ProductList,In this function, the HTML is passed directly into Claude along with a clear instruction. Claude then analyzes the structure of the page and identifies repeating patterns that represent books. It ignores unrelated elements such as navigation bars or footer sections and focuses only on meaningful product data.
The result is a structured list of books that matches our predefined schema, which can be directly used in Python without additional parsing.
Scraping a Realistic E-commerce Page
After understanding the basic workflow, we move to a more realistic dataset using WebScraper. Laptop listing page. This page simulates a real e-commerce structure with multiple product cards, prices, and ratings
https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
The scraping logic remains the same, but the HTML becomes more complex. Instead of manually inspecting nested structures or writing selectors, we simply send the HTML to Claude and let it extract meaningful product information.
parse_products(html: str) -> ProductList:
client = instructor.from_anthropic(
anthropic.Anthropic(api_key=CLAUDE_API_KEY)
)
prompt = f"Extract product name, price, and rating from the following HTML: {html}"
return client.messages.create(
model=CLAUDE_MODEL,
max_tokens=4096,
messages=[{
"role": "user",
"content": [{"type": "text", "text": prompt}]
}],
response_model=ProductList,
)This function works by passing the full HTML of the WebScraper page to Claude, along with a clear instruction on what we want to extract. Internally, Claude analyzes the structure of the page, identifies repeating product patterns, and separates meaningful information such as product titles, prices, and ratings from irrelevant layout elements like containers or spacing divs. The important point here is that we are no longer defining how to extract data; instead, we are defining what we want, and Claude handles the extraction logic automatically.
Because the page is more realistic, the HTML contains more nesting and structural noise compared to the earlier example.
Cleaning HTML Before Sending to Claude
When dealing with larger web pages, sending raw HTML directly to Claude can be inefficient because it includes unnecessary elements like scripts, styles, and metadata. These elements do not contribute to data extraction and only increase token usage.
To solve this, we clean the HTML before sending it to Claude. This step removes unnecessary components and keeps only meaningful content. As a result, Claude can focus on relevant data, which improves both accuracy and performance.
Why This Approach Is Powerful?
The main advantage of this system is that it removes dependency on fragile scraping logic. Instead of writing and maintaining selectors manually, Claude dynamically interprets page structure and extracts data based on context.
This makes the scraper more flexible and significantly easier to maintain over time. When combined with Proxying, the system also becomes scalable because requests are distributed across multiple IP addresses, reducing the risk of blocking.
Limitations and Practical Considerations
Although this approach is powerful, it is not without limitations. Large HTML pages can increase token usage, which affects cost and performance. Some websites also rely heavily on JavaScript rendering, which may require additional tools like browser automation frameworks.
Additionally, not all HTML structures are clean or consistent, so Claude may occasionally require more precise prompts to extract accurate results.
Despite these limitations, the combination of Claude, Python, and Proxying provides a strong and modern foundation for building scalable scraping systems.
Conclusion
Claude represents a major shift in how web scraping systems are designed. Instead of relying entirely on manual parsing logic, developers can now use AI to interpret and structure HTML automatically.
When combined with Python and Proxying residential proxies, this approach enables the creation of flexible, scalable, and intelligent scraping pipelines that are far easier to maintain than traditional systems.
