Scale your Web Data Gathering: Talk to us and we’ll help scale with high quality Residential Proxies.

Python Parse XML A Step-by-Step Guide for Scraping and Automation

IN THIS ARTICLE:

Ready to scale your data?

Subscribe to our newsletter

Whether you are scraping product listings, handling API responses, or parsing sitemap files, XML is one of the most common data formats you’ll run into. While JSON has become the go-to for most APIs, XML still powers a large portion of the web, especially when dealing with enterprise systems, older websites, and RSS feeds.

If you’re building Python-based scraping tools or proxy-powered automation scripts, understanding how to parse XML is essential.

In this guide, we’ll break down multiple ways to parse XML in Python, with working examples, so you can pick the right method for your use case.

What is XML?

XML (eXtensible Markup Language) is a markup language designed to store and transport data. It uses a hierarchical structure, a tag-based structure that resembles HTML, but its purpose is to describe data, not display it.

Here’s a quick example of an XML file:

<products>
  <product>
    <id>1</id>
    <name>Noise Cancelling Headphones</name>
    <price>249.99</price>
  </product>
  <product>
    <id>2</id>
    <name>Wireless Mouse</name>
    <price>29.99</price>
  </product>
</products>

This kind of structure is common in eCommerce sitemaps, product feeds, and even some legacy APIs, all of which are goldmines for web scraping and data extraction.

Python Libraries For Parsing XML

Python offers several built-in and third-party libraries for working with XML.

Here are the three most common:

LibraryTypeProsBest For
xml.etree.ElementTreeBuilt-inLightweight, easy to useBeginners, simple XML structures
xml.dom.minidomBuilt-inPrettier formatting, DOM-stylePretty-printing, smaller files
lxmlThird-partyFast, supports XPath, robustLarge files, complex queries

Method 1: Using ElementTree

Importing and Parsing

import xml.etree.ElementTree as ET
tree = ET.parse('products.xml')
root = tree.getroot()
If you already have the XML as a string (e.g., from an HTTP response), use:
root = ET.fromstring(xml_string)

Extracting Data

for product in root.findall('product'):
    name = product.find('name').text
    price = product.find('price').text
    print(f"{name}: ${price}")

Why Use This?

  • It’s built-in and requires no installation.
  • Great for lightweight XML parsing in simple scraping tasks.

Method 2: Using minidom 

Minidom provides a DOM-like interface for working with XML documents.

Example

from xml.dom import minidom
dom = minidom.parse('products.xml')
products = dom.getElementsByTagName('product')
for product in products:
    name = product.getElementsByTagName('name')[0].firstChild.nodeValue
    print("Product:", name)

Prettify XML Output

pretty_xml = dom.toprettyxml()
print(pretty_xml)

Best For:

  • Pretty-printing.
  • Smaller XML files.
  • Not ideal for performance-heavy tasks.

Method 3: Using lxml

Lxml is a third-party library known for its speed and XPath support.

Installation 

pip install lxml

Parsing and Querying with XPath

tree = etree.parse('products.xml')
products = tree.xpath('//product')
for product in products:
    name = product.xpath('name/text()')[0]
    price = product.xpath('price/text()')[0]
    print(f"{name} — ${price}")

Why use lxml?

  • Handles large files efficiently.
  • Ideal for scraping at scale using proxies or automation tools.
  • XPath makes it easier to target complex elements, especially those deeply nested.

Parsing XML from a Proxy API Request

Let’s say you are scraping a proxy-enabled API that returns an XML response. Here’s how you could handle it:

import requests
from lxml import etree
proxy_url = "http://your-proxy-url:port"
response = requests.get("https://example.com/data.xml", proxies={"http": proxy_url, "https": proxy_url})
tree = etree.fromstring(response.content)
items = tree.xpath('//item')
for item in items:
    title = item.findtext('title')
    print("Title:", title)

Proxying.io users often scrape public sitemaps, product feeds, or search engine data that comes in XML format; this approach fits perfectly.

Tips for Working with XML in Python

  • Use XPath in lxml when working with deeply nested structures.
  • Always validate your XML source; malformed XML can crash your script.
  • Convert XML to JSON if your pipeline expects JSON format.
  • Use proxies if scraping rate-limited XML endpoints, such as sitemap.xml or feed.xml..
  • Handle encoding (UTF-8, ISO-8859-1, etc.) to avoid UnicodeDecodeErrors.

Conclusion

Knowing how to parse XML in Python using ElementTree, minidom, or lxml gives you a serious edge when building robust scripts that consume structured data. Whether you’re scraping search engine results, parsing sitemaps, or reading product feeds, being comfortable with XML makes you more adaptable and effective.

And if you want your scripts to scale without interruption, pair your XML scraper with residential or datacenter proxies from Proxying.io to bypass geo-blocks, rate limits, and firewalls.

Frequently Asked Questions (FAQs)

Yes, most Python XML libraries allow you to read tag attributes (like href, id, or type) in addition to the text content. This is useful when scraping links, metadata, or API identifiers.

fromstring() is used when the XML is in string form, such as data returned from a web request. parse() is used when loading XML directly from a file stored on disk.

Use lxml’s recovery mode or sanitize your input before parsing.

About the author

IN THIS ARTICLE:

Ready to scale your data?

Subscribe to our newsletter

Want to scale your web data gathering with Proxies?

Related articles