Last modified: June 06, 2026
This article is written in: 🇺🇸
XML stands for Extensible Markup Language. It is a text-based format for representing structured information.
Unlike HTML, XML does not give you a fixed list of tags such as <p> or <table>. You define tags that describe the information in your own domain:
<employee id="E1001">
<name>Amina Rahman</name>
<department>Engineering</department>
</employee>
Here, <employee>, <name>, and <department> are not built-in XML instructions. They are names chosen by the document author.
A useful mental model is:
XML stores data as a labeled tree. Elements create the branches, text holds values, attributes add small pieces of information, and optional schemas define which shapes are acceptable.
XML is still important in document formats, SOAP services, configuration files, SVG, Office documents, data feeds, and industries that rely on formal validation.
Start with a realistic but compact example:
<?xml version="1.0" encoding="UTF-8"?>
<employee id="E1001" active="true">
<name>Amina Rahman</name>
<department code="ENG">Engineering</department>
<skills>
<skill level="advanced">Python</skill>
<skill level="intermediate">SQL</skill>
</skills>
</employee>
Read it in plain English:
employee.E1001 and is active.Amina Rahman.Engineering, with the code ENG.flowchart TB
A["employee<br/>id='E1001', active='true'"]
A --> B["name<br/>'Amina Rahman'"]
A --> C["department<br/>code='ENG'<br/>'Engineering'"]
A --> D["skills"]
D --> E["skill<br/>level='advanced'<br/>'Python'"]
D --> F["skill<br/>level='intermediate'<br/>'SQL'"]
That tree structure is central to XML. The same idea appears in XPath queries, DOM parsers, XSLT transformations, and XSD validation.
An XML document is built mainly from elements, attributes, and text values.
An element usually has an opening tag, content, and a closing tag:
<department>Engineering</department>
An element with no content can be written with an empty-element tag:
<middleName/>
Nested elements express structure:
<address>
<city>Berlin</city>
<country>Germany</country>
</address>
Attributes appear inside the opening tag:
<employee id="E1001" active="true">
Attributes are often useful for identifiers, classifications, flags, or small metadata-like values. Elements are generally better when a value may repeat, contain nested structure, or be extended later.
For example:
<!-- Convenient for a short identifier -->
<employee id="E1001"/>
<!-- Better for repeating structured information -->
<employee>
<phone type="mobile">+49 30 555 0101</phone>
<phone type="work">+49 30 555 0102</phone>
</employee>
There is no universal rule that all “data” must be elements and all “metadata” must be attributes. Choose the representation that remains clear as the format grows.
Text lives between tags:
<name>Amina Rahman</name>
Some XML documents are mostly data, such as configuration or API messages. Others are documents containing mixed text and markup:
<paragraph>Please review the <emphasis>updated</emphasis> policy.</paragraph>
This ability to mix text and structure is one reason XML remains useful for publishing and document formats.
A document is well-formed when it follows XML's basic grammar. A parser cannot safely build the tree unless these rules are satisfied.
| Rule | Correct | Incorrect |
| Exactly one root element | <report><title>Q1</title></report> |
<title>Q1</title><total>7</total> |
| Tags are case-sensitive | <name>Amina</name> |
<Name>Amina</name> |
| Elements must be properly nested | <a><b/></a> |
<a><b></a></b> |
| Attribute values must be quoted | <item id="7"/> |
<item id=7/> |
| A closing tag is required unless the tag is self-closing | <status/> |
<status> |
| Reserved characters must be handled correctly | <expr>3 < 5 & 7 > 2</expr> |
<expr>3 < 5 & 7 > 2</expr> |
In normal character data, < and & must be escaped because they begin markup or an entity reference. The > character is usually allowed as text, although > is often used for readability and the character sequence ]]> cannot appear in ordinary character data.
| Character | Write it as | Reason |
< |
< |
Avoids beginning a new tag |
> |
> |
Often clearer; required in some special contexts |
& |
& |
Avoids beginning an entity reference |
" |
" |
Needed when inside a double-quoted attribute value |
' |
' |
Needed when inside a single-quoted attribute value |
Example:
<rule expression="score >= 80">
Tom & Amina passed because 3 < 5.
</rule>
You will often see this first line:
<?xml version="1.0" encoding="UTF-8"?>
It identifies the XML version and the declared character encoding. It is a good convention for exchanged or stored documents, especially where encoding must be explicit. It is not mandatory in every XML 1.0 document; when it is present, it belongs at the beginning of the document.
A well-formed XML file obeys XML syntax. A valid XML file also follows a separate set of domain rules, usually described by a schema such as XSD or a DTD.
For example, this is well-formed XML:
<employee>
<favouritePlanet>Mars</favouritePlanet>
</employee>
But it may be invalid for an employee system that requires an id, a name, and a department.
flowchart TD
A["XML document"] --> B{"Well-formed?<br/>Correct XML syntax"}
B -- "No" --> C["Parser error:<br/>cannot build a reliable tree"]
B -- "Yes" --> D{"Valid?<br/>Matches schema rules"}
D -- "No" --> E["Validation error:<br/>tree is XML, but wrong shape or data"]
D -- "Yes" --> F["Accept for application use"]
A short way to remember the distinction:
| Term | Question being asked |
| Well-formed | “Is this legal XML syntax?” |
| Valid | “Does this legal XML conform to the structure and data rules my application expects?” |
Different systems may reuse the same element names for different meanings. For example, an HR vocabulary and an office-location vocabulary could both define a <department> element.
A namespace qualifies names using a URI:
<?xml version="1.0" encoding="UTF-8"?>
<employee xmlns:hr="https://example.com/hr"
xmlns:site="https://example.com/site"
id="E1001">
<name>Amina Rahman</name>
<hr:department code="ENG">Engineering</hr:department>
<site:department floor="4">Berlin Product Hub</site:department>
</employee>
The prefixes hr and site are convenient labels. The namespace URIs identify the vocabularies. They act as names; a parser does not need to visit those web addresses.
flowchart LR
A["department"] --> B["Collision:<br/>which meaning?"]
C["hr:department"] --> D["https://example.com/hr<br/>Engineering team"]
E["site:department"] --> F["https://example.com/site<br/>Building area"]
A prefixed namespace is declared like this:
xmlns:hr="https://example.com/hr"
A default namespace is declared like this:
xmlns="https://example.com/employees"
With a default namespace, unprefixed element names in its scope belong to that namespace:
<employee xmlns="https://example.com/employees">
<name>Amina Rahman</name>
</employee>
A common surprise is that an unprefixed attribute does not automatically join the default namespace. In the example above, employee and name are in the default namespace, while an attribute such as id="E1001" remains unqualified unless it has its own prefix.
XPath is a language for selecting nodes or calculating values from an XML tree.
Using this document:
<employees>
<employee id="E1001">
<name>Amina Rahman</name>
<department>Engineering</department>
<skills>
<skill level="advanced">Python</skill>
<skill level="intermediate">SQL</skill>
</skills>
</employee>
<employee id="E1002">
<name>Jonas Weber</name>
<department>Finance</department>
<skills>
<skill level="advanced">Excel</skill>
</skills>
</employee>
</employees>
Some useful XPath expressions are:
| XPath | Meaning |
/employees/employee |
All employee children of the document root |
/employees/employee/name |
Each employee's name element |
//skill |
Every skill element anywhere below the current document |
//skill[@level='advanced'] |
Skills whose level attribute is advanced |
/employees/employee[@id='E1002']/name/text() |
Text of the name for employee E1002 |
flowchart TD
R["employees"] --> E1["employee @id='E1001'"]
R --> E2["employee @id='E1002'"]
E1 --> N1["name: Amina Rahman"]
E1 --> S1["skill @level='advanced': Python"]
E2 --> N2["name: Jonas Weber"]
E2 --> S2["skill @level='advanced': Excel"]
Q["XPath: //skill[@level='advanced']"] -. selects .-> S1
Q -. selects .-> S2
Different libraries support different subsets or versions of XPath. Full XPath engines provide much more than simple path selection, including functions, comparisons, and expressions over values.
When namespaces are used, queries must identify the namespace, not merely the visible prefix written in a source document.
For example:
<employees xmlns="https://example.com/employees">
<employee id="E1001">
<name>Amina Rahman</name>
</employee>
</employees>
A namespace-aware Python query can assign its own convenient query prefix:
namespaces = {"e": "https://example.com/employees"}
names = root.findall("e:employee/e:name", namespaces)
The prefix e in Python does not have to match a prefix in the original XML. The namespace URI is the important identity.
The following example reads XML, selects data from the tree, and prints a small report. It uses defusedxml, which is a safer drop-in choice when XML may come from outside your trust boundary.
employees.xml<?xml version="1.0" encoding="UTF-8"?>
<employees>
<employee id="E1001">
<name>Amina Rahman</name>
<department>Engineering</department>
<skills>
<skill level="advanced">Python</skill>
<skill level="intermediate">SQL</skill>
</skills>
</employee>
<employee id="E1002">
<name>Jonas Weber</name>
<department>Finance</department>
<skills>
<skill level="advanced">Excel</skill>
</skills>
</employee>
</employees>
python -m pip install defusedxml
read_employees.pyfrom defusedxml import ElementTree as ET
from xml.etree.ElementTree import Element
def load_employees(filename: str) -> Element:
tree = ET.parse(filename)
return tree.getroot()
def print_summary(root: Element) -> None:
employees = root.findall("employee")
print(f"Employees: {len(employees)}")
for employee in employees:
employee_id = employee.get("id", "unknown")
name = employee.findtext("name", default="(no name)")
department = employee.findtext("department", default="(no department)")
advanced = [
skill.text
for skill in employee.findall("./skills/skill[@level='advanced']")
if skill.text
]
advanced_text = ", ".join(advanced) if advanced else "none"
print(f"{employee_id}: {name} — {department} — advanced: {advanced_text}")
if __name__ == "__main__":
root = load_employees("employees.xml")
print_summary(root)
python read_employees.py
Expected output
Employees: 2
E1001: Amina Rahman — Engineering — advanced: Python
E1002: Jonas Weber — Finance — advanced: Excel
What the code is doing:
sequenceDiagram
participant File as employees.xml
participant Parser as defusedxml parser
participant Tree as XML element tree
participant Code as Python code
File->>Parser: Read XML characters
Parser->>Tree: Build elements, attributes, text
Code->>Tree: findall("employee")
Code->>Tree: Find name, department, advanced skills
Tree-->>Code: Matched values
Code-->>Code: Print report
Schemas let you state rules such as:
A Document Type Definition is an older, compact schema language. It can describe element structure and attributes, but its datatype model and namespace support are limited.
A tiny DTD example:
<!ELEMENT employee (name, department)>
<!ATTLIST employee id ID #REQUIRED>
<!ELEMENT name (#PCDATA)>
<!ELEMENT department (#PCDATA)>
XML Schema Definition, commonly called XSD, is itself XML and supports typed values, namespaces, restrictions, and reusable complex structures.
Here is a small schema for a single employee:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="employee">
<xs:complexType>
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="department" type="xs:string"/>
<xs:element name="startYear" type="xs:positiveInteger"/>
</xs:sequence>
<xs:attribute name="id" type="xs:string" use="required"/>
</xs:complexType>
</xs:element>
</xs:schema>
Valid instance:
<employee id="E1001">
<name>Amina Rahman</name>
<department>Engineering</department>
<startYear>2022</startYear>
</employee>
Invalid instance:
<employee>
<department>Engineering</department>
<name>Amina Rahman</name>
<startYear>not-a-year</startYear>
</employee>
The invalid example has no required id, places the elements in the wrong sequence, and supplies text where the schema expects a positive integer.
| Feature | DTD | XSD |
| Written in XML syntax | No | Yes |
| Built-in datatype range | Limited | Rich, including numbers and dates |
| Namespace-aware schema design | Limited | Yes |
| Good fit for simple legacy document rules | Often | Possible |
| Good fit for typed enterprise interchange | Limited | Usually stronger |
Validation is separate from safe parsing. A schema can reject an incorrectly shaped document, but parser security controls are still needed when input is untrusted.
XML can be processed in more than one way. The right approach depends mainly on document size and whether your program needs random access to the entire tree.
flowchart LR
A["XML input"] --> B["Tree-based parser<br/>DOM / ElementTree style"]
A --> C["Event-based parser<br/>SAX"]
A --> D["Pull / streaming parser<br/>StAX or iterparse style"]
B --> E["Whole or substantial tree<br/>easy navigation and changes"]
C --> F["Callbacks while reading<br/>low retained memory"]
D --> G["Application asks for events<br/>stream-friendly control"]
| Approach | Main idea | Strengths | Trade-off |
| DOM / tree model | Load the document as nodes in a tree | Easy navigation, edits, repeated queries | Memory grows with document/tree size |
| SAX | Parser pushes events such as “start element” and “text” | Good for very large sequential processing | Callback-based code can be harder to organize |
| StAX / pull parsing | Application requests the next parsing event | Streaming with more application control | Not as convenient for arbitrary backward navigation |
Python iterparse style |
Iterate as elements finish parsing | Useful for large files and extraction tasks | Code must discard processed subtrees carefully |
Do not assume a streaming API automatically uses “constant memory” in every program. Memory stays low only if the application avoids retaining the whole parsed result.
XML has features that can be dangerous when enabled on attacker-controlled input. The best-known example is XML External Entity (XXE) processing.
A malicious document may attempt to define an entity that points to a local file or network address:
<?xml version="1.0"?>
<!DOCTYPE data [
<!ENTITY secret SYSTEM "file:///sensitive/local/file">
]>
<data>&secret;</data>
If a weakly configured parser resolves that external entity, it may disclose local data or make requests from the server's network position.
Another risk is excessive entity expansion, sometimes called a Billion Laughs attack, in which a small input expands dramatically and consumes resources.
flowchart TD
A["Untrusted XML input"] --> B{"Does parser allow<br/>DTD/external entity processing?"}
B -- "Yes" --> C["Risk: local-file reads,<br/>server-side requests,<br/>resource exhaustion"]
B -- "No / securely configured" --> D["Parse document content<br/>without external expansion"]
D --> E["Then validate and apply<br/>application rules"]
defusedxml for untrusted XML parsing.Safe parsing and schema validation solve different problems:
| Concern | Mitigation |
| Parser retrieves external resources | Disable external entities / DTD resolution or use a hardened parser |
| Input expands into excessive data | Disable risky expansions and apply resource limits |
| Missing or incorrectly typed fields | Validate against XSD or application rules |
| Unexpected business values | Apply domain validation in application code |
No format is automatically better in every situation.
| Question | XML | JSON | Protobuf binary |
| Human-readable text format | Yes | Yes | No |
| Handles rich document-style mixed content well | Yes | Less naturally | Not its main purpose |
| Built-in namespace ecosystem | Yes | No standard equivalent | Schema packages instead |
| Mature schema validation standards | DTD, XSD, others | JSON Schema ecosystem | .proto schema |
| Compact over the wire | Often verbose | Usually less verbose than XML | Usually compact |
| Convenient for browser/API payloads | Used, but less common for new simple APIs | Very common | Requires generated/runtime support |
| Common strength | Documents, validation-heavy interchange, legacy/enterprise integration | Straightforward web data interchange | Typed compact service messages |
Choose XML when document structure, namespaces, transformation, or formal schema validation are important. Choose JSON when simple human-readable web interchange is the main need. Choose Protobuf when compact typed binary messages and schema-controlled service communication matter.
<invoiceTotal> more readily than <value3>.mindmap
root((XML))
Document structure
One root element
Elements
Attributes
Text
Correctness
Well-formed syntax
Valid against schema
Tools
XPath queries
XSD validation
DOM and streaming parsers
Interoperability
Namespaces
UTF-8
Shared vocabularies
Safety
Untrusted input
Block XXE
Limit expansion
The most important ideas are:
These notes are based on the following primary and security-focused references: