Guides

Claude Web Scraping With Python

Tomas Jurgaitis
Last Updated on 2026-05-12

Web scraping involves two major challenges. The first challenge is avoiding request blocking and rate limiting while collecting webpage data. This second challenge is parsing the raw HTML into structured, usable information.

While proxies help address IP blocking, developers still need to manually process large amounts of HTML. This usually requires maintaining CSS selectors, XPath expressions, and parsing logic that can break whenever a website updates its structure.

With the rise of LLMs like Claude by Anthropic, developers now have another option. Instead of manually analyzing HTML and maintaining selectors, Claude can analyze webpage structures and extract data automatically. By combining Claude AI with Python and Proxying proxies, it becomes possible to build a flexible and intelligent scraping workflow that requires significantly less maintenance.

In this guide, you’ll learn how to use Claude for web scraping with Python. We’ll cover:

Setting up the environment
Scraping webpages with Proxying proxies
Parsing HTML with Claude API
Scraping an Amazon product data
Optimizing token usage for a large webpage
Building a reusable scraping workflow

By the end of this tutorial, you will have a working Claude-powered scraping solution capable of extracting structured data from modern websites.

Why Use Claude for Web Scraping?

Traditional web scraping workflows rely heavily on manually written selectors and parsing rules. While tools like BeautifulSoup and lxml make parsing easier, scrapers are hard to maintain at scale.
A small HTML structure update can completely break a scraper. Developers then need to inspect the webpage again, update selectors, test extraction logic, and redeploy their scripts.

Claude changes this workflow by using natural language understanding to interpret webpage structures dynamically.

Instead of hardcoding every selector manually, Claude can analyze raw HTML, extract structured data automatically, generate CSS selectors, adapt to webpage structure changes, return validated JSON responses, and reduce parser maintenance.

How Claude Fits Into the Scraping Workflow

A Claude-powered scraping workflow usually looks like this:

Sends requests through residential proxies.
Retrieve webpage HTML.
Clean unnecessary HTML content.
Send relevant HTML sections to Claude.
Ask Claude to extract data or generate selectors.
Parse the remaining webpage automatically.
Store structured results.

This workflow reduces the amount of manual parser maintenance needed over time.

Setting Up the Environment

Before building the scraper, install Python and the required libraries. You can download Python from the official website.

For this tutorial, we will use the following libraries:

Anthropic
Instructor
Requests
BeautifulSoup4
Pydantic

These libraries handle the Claude API communication, structured LLM responses, HTTP requests, HTML parsing, and response validation.

Create a virtual environment and install everything using the following commands.

python -m venv env
source env/bin/activate
pip install anthropic instructor requests beautifulsoup4 pydantic

python -m venv env
source env/bin/activate
pip install anthropic instructor requests beautifulsoup4 pydantic

Each of these libraries plays a specific role in the scraping pipeline. Requests handles network communication, BeautifulSoup processes HTML, and Claude is responsible for understanding and extracting structured information.

Getting API Credentials

To use Claude for scraping, you’ll need an API key from Anthropic. You will also need residential proxy credentials for Proxying.

Residential proxies help distribute requests across multiple IP addresses, which reduces blocking and improves scraping reliability.

Creating the Initial Python Script

Create a file called main.py and import the required libraries.

import requests
import anthropic
import instructor
from bs4 import BeautifulSoup
from pydantic import BaseModel
Next, define your API credentials and Claude model.
CLAUDE_API_KEY = "API_KEY"
CLAUDE_MODEL = "claude-3-5-haiku-20241022"
PROXYING_USERNAME = "USERNAME"
PROXYING_PASSWORD = "PASSWORD"

import requests
import anthropic
import instructor
from bs4 import BeautifulSoup
from pydantic import BaseModel
Next, define your API credentials and Claude model.
CLAUDE_API_KEY = "API_KEY"
CLAUDE_MODEL = "claude-3-5-haiku-20241022"
PROXYING_USERNAME = "USERNAME"
PROXYING_PASSWORD = "PASSWORD"

For this tutorial, we’re using Claude 3.5 Haiku because it is fast, lightweight, and cost-effective for HTML parsing tasks.

Defining Structured Output Models

One of the most important improvements in the system is structured output generation. Instead of receiving raw text from Claude, we enforce a strict data format using Pydantic models. This ensures that every response follows a predictable structure.

class Product(BaseModel):
title: str
price: str
rating: str
class ProductList(BaseModel):
products: list[Product]

class Product(BaseModel):
title: str
price: str
rating: str
class ProductList(BaseModel):
products: list[Product]

This structure is important because it allows Claude to return multiple products in a consistent format. Without this, responses could vary in format, making them harder to process in downstream applications.

Scraping Web Pages Using Proxying

The real first step in the scraping pipeline is retrieving HTML from a website. We define a function that takes a URL and returns its HTML content. Inside this function, we configure Proxying residential proxies so that all requests are routed through external IPs instead of our local machine.

This is important because many websites detect and block repeated requests coming fro the same IP address. By using residential proxies, each request appears to come from a different real user, which significantly improves the success rate.

We also include browser-like leaders such as user-agent and language preferences. These headers make the request appear more natural and reduce the chances of getting blocked.

def get_html(url: str) -> str:
proxy_url = f"http://{PROXY_USER}:{PROXY_PASS}@gate.proxying.io:10000"
proxies = {
  "http": proxy_url,
  "https": proxy_url,
}
headers = {
  "User-Agent": "Mozilla/5.0",
  "Accept-Language": "en-US, en;q=0.9",
}
response = requests.get(url, headers=headers, proxies=proxies)
response.raise_for_status()
return response.text

def get_html(url: str) -> str:
proxy_url = f"http://{PROXY_USER}:{PROXY_PASS}@gate.proxying.io:10000"
proxies = {
  "http": proxy_url,
  "https": proxy_url,
}
headers = {
  "User-Agent": "Mozilla/5.0",
  "Accept-Language": "en-US, en;q=0.9",
}
response = requests.get(url, headers=headers, proxies=proxies)
response.raise_for_status()
return response.text

When the function runs, it sends a request through Proxying, receives the full HTML page, verifies that the request was successful, and then returns the raw HTML for further processing.

At this stage, the data is completely unstructured.

Scraping Books to Scrape Using Claude

To understand how Claude processes HTML, we start with a simple dataset called Books to Scrape. This website is designed for learning scraping techniques and has a clean, consistent HTML structure.

https://books.toscrape.com

We now define a function that sends this HTML to Claude and asks it to extract structured book data.

def parse_books(html: str) -> ProductList:
client = instructor.from_anthropic(
  anthropic.Anthropic(api_key=CLAUDE_API_KEY)
)
prompt = f"Extract book title, price, and rating from the following HTML: {html}"
return client.messages.create(
  model=CLAUDE_MODEL,
  max_tokens=4096,
  messages=[{
  "role": "user",
  "content": [{"type": "text", "text": prompt}]
}],
response_model=ProductList,

def parse_books(html: str) -> ProductList:
client = instructor.from_anthropic(
  anthropic.Anthropic(api_key=CLAUDE_API_KEY)
)
prompt = f"Extract book title, price, and rating from the following HTML: {html}"
return client.messages.create(
  model=CLAUDE_MODEL,
  max_tokens=4096,
  messages=[{
  "role": "user",
  "content": [{"type": "text", "text": prompt}]
}],
response_model=ProductList,

In this function, the HTML is passed directly into Claude along with a clear instruction. Claude then analyzes the structure of the page and identifies repeating patterns that represent books. It ignores unrelated elements such as navigation bars or footer sections and focuses only on meaningful product data.

The result is a structured list of books that matches our predefined schema, which can be directly used in Python without additional parsing.

Scraping a Realistic E-commerce Page

After understanding the basic workflow, we move to a more realistic dataset using WebScraper. Laptop listing page. This page simulates a real e-commerce structure with multiple product cards, prices, and ratings

https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops

The scraping logic remains the same, but the HTML becomes more complex. Instead of manually inspecting nested structures or writing selectors, we simply send the HTML to Claude and let it extract meaningful product information.

parse_products(html: str) -> ProductList:
  client = instructor.from_anthropic(
    anthropic.Anthropic(api_key=CLAUDE_API_KEY)
  )
prompt = f"Extract product name, price, and rating from the following HTML: {html}"
return client.messages.create(
  model=CLAUDE_MODEL,
  max_tokens=4096,
  messages=[{
    "role": "user",
    "content": [{"type": "text", "text": prompt}]
  }],
  response_model=ProductList,
)

parse_products(html: str) -> ProductList:
  client = instructor.from_anthropic(
    anthropic.Anthropic(api_key=CLAUDE_API_KEY)
  )
prompt = f"Extract product name, price, and rating from the following HTML: {html}"
return client.messages.create(
  model=CLAUDE_MODEL,
  max_tokens=4096,
  messages=[{
    "role": "user",
    "content": [{"type": "text", "text": prompt}]
  }],
  response_model=ProductList,
)

This function works by passing the full HTML of the WebScraper page to Claude, along with a clear instruction on what we want to extract. Internally, Claude analyzes the structure of the page, identifies repeating product patterns, and separates meaningful information such as product titles, prices, and ratings from irrelevant layout elements like containers or spacing divs. The important point here is that we are no longer defining how to extract data; instead, we are defining what we want, and Claude handles the extraction logic automatically.

Because the page is more realistic, the HTML contains more nesting and structural noise compared to the earlier example.

Cleaning HTML Before Sending to Claude

When dealing with larger web pages, sending raw HTML directly to Claude can be inefficient because it includes unnecessary elements like scripts, styles, and metadata. These elements do not contribute to data extraction and only increase token usage.

To solve this, we clean the HTML before sending it to Claude. This step removes unnecessary components and keeps only meaningful content. As a result, Claude can focus on relevant data, which improves both accuracy and performance.

Why This Approach Is Powerful?

The main advantage of this system is that it removes dependency on fragile scraping logic. Instead of writing and maintaining selectors manually, Claude dynamically interprets page structure and extracts data based on context.

This makes the scraper more flexible and significantly easier to maintain over time. When combined with Proxying, the system also becomes scalable because requests are distributed across multiple IP addresses, reducing the risk of blocking.

Limitations and Practical Considerations

Although this approach is powerful, it is not without limitations. Large HTML pages can increase token usage, which affects cost and performance. Some websites also rely heavily on JavaScript rendering, which may require additional tools like browser automation frameworks.

Additionally, not all HTML structures are clean or consistent, so Claude may occasionally require more precise prompts to extract accurate results.

Despite these limitations, the combination of Claude, Python, and Proxying provides a strong and modern foundation for building scalable scraping systems.

Conclusion

Claude represents a major shift in how web scraping systems are designed. Instead of relying entirely on manual parsing logic, developers can now use AI to interpret and structure HTML automatically.

When combined with Python and Proxying residential proxies, this approach enables the creation of flexible, scalable, and intelligent scraping pipelines that are far easier to maintain than traditional systems.

About the author

Tomas Jurgaitis

Tomas Jurgaitis has led PR initiatives at the forefront of tech, blending a sharp eye for storytelling with a deep-rooted curiosity for all things digital. Raised in an environment where innovation was the norm, his passion for the internet and emerging tech came naturally where he regularly crafts how-to tutorials for web scraping.

Earn Up to $2500 from referrals!

Subscribe to our newsletter

Want to scale your web data gathering with Proxies?

Claude Web Scraping With Python

IN THIS ARTICLE:

Why Use Claude for Web Scraping?

How Claude Fits Into the Scraping Workflow

Setting Up the Environment

Getting API Credentials

Creating the Initial Python Script

Defining Structured Output Models

Scraping Web Pages Using Proxying

Scraping Books to Scrape Using Claude

Scraping a Realistic E-commerce Page

Cleaning HTML Before Sending to Claude

Why This Approach Is Powerful?

Limitations and Practical Considerations

Conclusion

Frequently Asked Questions (FAQs)

Do I still need proxies when using Claude for scraping?

Is Claude's web scraping suitable for large websites?

What kind of websites can I scrape with Claude?

About the author

IN THIS ARTICLE:

Earn Up to $2500 from referrals!

Subscribe to our newsletter

Want to scale your web data gathering with Proxies?

Related articles

How to Use cURL With a Proxy: A Complete Guide for Beginners

How to Use cURL POST with Proxying for Secure API Requests

How to Use cURL Header for Custom HTTP Requests