Chapter 11: XML Parser
1. What does an “XML Parser” actually do?
An XML parser is a piece of software (usually a library) that:
- Reads raw XML text (string or file)
- Checks if it is well-formed (follows XML syntax rules)
- Breaks it into a structured representation that a program can easily use
- (Optionally) validates it against a schema (XSD, DTD…)
Think of the parser as a very strict librarian who:
- Reads your handwritten book (the XML)
- Checks whether the chapters are properly numbered and nested (well-formedness)
- Makes a neat table of contents + index so you can quickly find any information (tree / events)
- (Optionally) checks whether the content follows the official rules of the library (validation)
2. The Two Main Families of XML Parsers
Almost every programming language offers at least two different philosophies for parsing XML:
| Parser Type | Also called | Memory usage | Speed | Best for | Gives you | Most common names |
|---|---|---|---|---|---|---|
| DOM | Tree-based | High | Slower | When you need to read + modify freely | Complete tree in memory | DOM, DocumentBuilder (Java), xml.etree.ElementTree (Python), XML DOM (JS) |
| SAX | Event-based / Streaming | Very low | Faster | Huge files, memory-constrained, one-pass reading | Calls your functions on events | SAX, StAX (Java), xml.sax (Python), Expat, SAX-like in many languages |
| Pull / Streaming | Pull parser | Low | Fast | Modern middle-ground (most used today) | You control when to read next token | StAX (Java), XmlReader (C#), xmlpull (Android), lxml.iterparse (Python) |
Very important modern reality (2025–2026):
- DOM → still very popular when file is small/medium and you need random access
- Pure SAX → less common today (old-school)
- Pull parsers (StAX, XmlReader, iterparse…) → most common in serious modern code
3. DOM Parser – The Tree in Memory (Most Intuitive)
How it works:
- Parser reads entire XML
- Builds complete tree in memory
- You get a navigable object tree
Advantages:
- Very easy to understand and use
- You can go up/down/left/right freely
- Modify the tree and write it back
Disadvantages:
- Uses a lot of memory (2–10× size of XML)
- Slow for very large files
Real example – Python (xml.etree.ElementTree – very popular)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
import xml.etree.ElementTree as ET xml_string = """ <order orderId="ORD-20250710-4521"> <customer> <name>Samarth Jain</name> <email>samarth.j@example.com</email> </customer> <items> <item sku="TS-BLK-M"> <name>Black T-Shirt Medium</name> <quantity>2</quantity> <price>499.00</price> </item> </items> <total>998.00</total> </order> """ # Parse the XML root = ET.fromstring(xml_string) # Navigate freely order_id = root.get('orderId') # ORD-20250710-4521 customer_name = root.find('customer/name').text # Samarth Jain items = root.find('items') first_item_name = items.find('item/name').text # Black T-Shirt Medium print(f"Order: {order_id}") print(f"Customer: {customer_name}") print(f"First item: {first_item_name}") |
Java example (very classic DOM way)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import javax.xml.parsers.DocumentBuilderFactory; import org.w3c.dom.Document; import org.w3c.dom.Element; import org.w3c.dom.NodeList; Document doc = DocumentBuilderFactory.newInstance() .newDocumentBuilder() .parse("order.xml"); Element root = doc.getDocumentElement(); String orderId = root.getAttribute("orderId"); NodeList items = root.getElementsByTagName("item"); Element firstItem = (Element) items.item(0); String sku = firstItem.getAttribute("sku"); |
4. Streaming / Pull Parser – Modern & Memory Efficient
How it works:
- Parser gives you one piece at a time (start tag, text, end tag…)
- You decide when to read the next piece
- You never have the whole document in memory
Most popular modern variants:
- Java → StAX (Streaming API for XML) → XMLStreamReader
- C# → XmlReader
- Python → lxml.iterparse or xml.etree.iterparse
- Android → XmlPullParser
Real example – Python lxml.iterparse (very memory efficient)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
from lxml import etree context = etree.iterparse("very_large_orders.xml", events=("start", "end")) for event, elem in context: if event == "end" and elem.tag == "order": order_id = elem.get("orderId") total = elem.find("total").text print(f"Processed order {order_id} - Total: {total}") # Very important: clear memory! elem.clear() while elem.getprevious() is not None: del elem.getparent()[0] |
Java StAX example (very common in enterprise code)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import javax.xml.stream.XMLInputFactory; import javax.xml.stream.XMLStreamReader; XMLInputFactory factory = XMLInputFactory.newInstance(); XMLStreamReader reader = factory.createXMLStreamReader(new FileInputStream("orders.xml")); while (reader.hasNext()) { int event = reader.next(); if (event == XMLStreamReader.START_ELEMENT) { if ("order".equals(reader.getLocalName())) { String orderId = reader.getAttributeValue(null, "orderId"); System.out.println("Order ID: " + orderId); } } } |
5. Quick Comparison Table – When to Choose What
| Situation | Best choice | Why? |
|---|---|---|
| Small / medium file (< 5–10 MB) | DOM | Simple, readable, full navigation |
| Very large file (> 50 MB – several GB) | Streaming / Pull / iterparse | Memory usage stays low |
| You only need to extract few fields | Streaming / Pull | Fastest & lowest memory |
| You need to modify the structure | DOM | Easy to change tree and write back |
| You are writing new code in 2025–2026 | Pull parser (StAX, XmlReader, iterparse) | Best balance of speed, memory, control |
| Very old legacy system | DOM or SAX | Still very common in old codebases |
6. Very Common Real-World Use Cases
- Reading configuration → usually DOM (small file)
- Processing e-invoices / EDI → streaming (huge volume)
- Android apps → XmlPullParser (battery + memory important)
- Spring / Java EE → often StAX or DOM
- Python scripts for ETL / data import → lxml.iterparse
- Node.js / browser → DOMParser (browser built-in)
Quick Summary – What You Should Remember
- DOM → whole tree in memory → easy but memory-hungry
- SAX → event callbacks → old-school, low memory, hard to navigate backward
- Pull / Streaming → you control reading → modern sweet spot
- Most new code in 2025–2026 uses pull parsers
- Always remember to clear elements when using iterparse / streaming
Would you like to go deeper into any of these?
- Detailed StAX example with namespaces
- lxml.iterparse for huge files – real memory-saving patterns
- How to write XML back after modifying (DOM vs streaming)
- DOMParser and XMLSerializer in browser JavaScript
- Comparison with JSON parsing (why XML parsing is more complex)
- Common error handling patterns when parsing XML
Just tell me what feels most useful or interesting for you right now! 😊
