Last modified: March 23, 2026

This article is written in: πŸ‡ΊπŸ‡Έ

XML Format

XML, or Extensible Markup Language, is a W3C-standardized markup language designed to encode documents in a format that is both human-readable and machine-readable. Unlike HTML, which has a fixed set of tags, XML lets you define your own vocabulary of elements to represent arbitrary data structures. It remains widely used in enterprise web services, configuration files, and document interchange where strict schema validation matters.

Basics of XML

  1. Elements are the building blocks of every XML document; they can nest other elements, hold text content, or carry attributes.
  2. Attributes provide supplementary metadata about an element, written as name-value pairs inside the opening tag.
  3. Text Content is the actual data between an opening and closing tag, which the parser delivers to the application.
  4. Comments follow the <!-- ... --> syntax and are ignored by parsers but useful for documenting intent inside the file.
  5. Processing Instructions such as <?xml-stylesheet ... ?> give directives to the application consuming the document.

XML Syntax Rules

  1. XML Declaration – a document should begin with <?xml version="1.0" encoding="UTF-8"?> to declare the version and character encoding.
  2. Single Root – every well-formed document must have exactly one root element that wraps all other elements.
  3. Case Sensitivity – tag names are case-sensitive, so <Name> and <name> are treated as different elements.
  4. Proper Nesting – elements must close in the reverse order they were opened, ensuring a balanced tree.
  5. Quoted Attributes – attribute values must always be enclosed in single or double quotation marks.
  6. Special Characters – reserved characters like <, >, and & must be written as entity references (&lt;, &gt;, &amp;).

Document Tree Structure

Every XML document forms a logical tree rooted at the single root element. Understanding this tree is key to querying and transforming XML.

[document]
                           |
                     <employee>
                    /    |     \
              <name>  <address>  <skills>
               |      /   |  \       |
          "John"  <street> <city> <zipCode>  <skill> <skill>
                    |       |       |          |       |
               "1234.."  "Any.."  "12345"   "Java"  "C#"

Example

Below is a complete XML document demonstrating nested elements, attributes, and namespaces:

<?xml version="1.0" encoding="UTF-8"?>
<employee id="E1001" xmlns:hr="http://example.com/hr">
  <name>John Doe</name>
  <age>30</age>
  <hr:department code="ENG">Engineering</hr:department>
  <address>
    <street>1234 Main St</street>
    <city>Anytown</city>
    <state>CA</state>
    <zipCode>12345</zipCode>
  </address>
  <skills>
    <skill level="senior">Java</skill>
    <skill level="mid">C#</skill>
    <skill level="junior">Python</skill>
  </skills>
</employee>

XML Namespaces

When multiple XML vocabularies are combined in one document, name collisions can occur. Namespaces solve this by qualifying element and attribute names with a URI.

<root xmlns:app="http://example.com/app"
      xmlns:db="http://example.com/db">
  <app:config>
    <db:connection host="localhost" port="5432"/>
  </app:config>
</root>

DTD vs XSD (Schema Validation)

Validation ensures an XML document conforms to a predefined structure. Two main technologies exist:

XML Document
       |
       v
+---------------+      +---------------------+
|  Well-formed? |--No->|  Reject (parse err) |
+-------+-------+      +---------------------+
        | Yes
        v
+---------------+      +---------------------+
| Valid against |--No->| Reject (validation  |
| DTD or XSD?   |      |        error)       |
+-------+-------+      +---------------------+
        | Yes
        v
  Accept Document

Feature DTD XSD
Syntax Non-XML, compact grammar XML-based
Data Types Limited (CDATA, ID, IDREF) Rich (int, date, regex, etc.)
Namespace Support None Full namespace awareness
Extensibility Low High (inheritance, groups)
Industry Adoption Legacy systems Modern enterprise standards

XPath Basics

XPath is a query language for selecting nodes from an XML document tree. It is used heavily in XSLT, XQuery, and many programming APIs.

XPath: /employee/address/city

  <employee>                    
       |                        
    <address>        <-- step 2  
       |                        
     <city>          <-- step 3  
       |                        
    "Anytown"        <-- result

Parsing XML: DOM vs SAX

Two dominant strategies exist for reading XML, each with distinct memory and performance trade-offs.

XML Document
                 |
       +---------+---------+
       |                   |
       v                   v
  DOM Parser           SAX Parser
  (tree in RAM)        (event stream)
       |                   |
       v                   v
  Full node tree      Callbacks fired
  available for       for each start-tag,
  random access       end-tag, and text

Aspect DOM SAX StAX
Memory High (full tree in RAM) Low (event-driven) Low (pull-based)
Access Pattern Random Sequential only Sequential only
Ease of Use Simple navigation API Callback-heavy Iterator-style
Write Support Yes (modify and serialize) Read-only Read and write
Best For Small docs, editing Large docs, filtering Streaming pipelines

Security Considerations (XXE)

XML parsers can be exploited if they process untrusted input without proper safeguards. The most critical vulnerability is XML External Entity (XXE) injection.

Attacker crafts XML with external entity:

  <?xml version="1.0"?>
  <!DOCTYPE foo [
    <!ENTITY xxe SYSTEM "file:///etc/passwd">
  ]>
  <data>&xxe;</data>

  Parser Flow:
  1. Parser reads DTD   -->  sees ENTITY declaration
  2. Resolves "file://" -->  reads /etc/passwd from disk
  3. Substitutes &xxe;  -->  sensitive data in <data> output

Benefits of XML

  1. Self-descriptive tags make the document readable without external documentation, unlike binary formats.
  2. Platform-independent text encoding ensures data can be exchanged between any operating system or programming language.
  3. Unicode support allows representation of virtually any human language or symbol set.
  4. Extensible vocabularies let each domain define its own element names without modifying the core specification.
  5. Mature tooling across every major language provides robust parsers, validators, and transformation engines.

Common Uses of XML

Best Practices for XML

  1. Meaningful Names – choose element and attribute names that clearly describe their content, improving long-term maintainability.
  2. Shallow Nesting – keep the tree depth reasonable to avoid unnecessary complexity and simplify XPath queries.
  3. Schema Validation – always validate documents against an XSD to catch structural errors before they reach application logic.
  4. Namespace Discipline – declare namespaces at the root element and use consistent prefixes to prevent collision across vocabularies.
  5. Secure Parsing – disable external entities and DTD processing on every parser that handles untrusted input.
  6. Encoding Declaration – explicitly set encoding="UTF-8" in the XML declaration to avoid character misinterpretation.
  7. Prefer Elements Over Attributes – use attributes for metadata like IDs and elements for data to keep the document intuitive.
  8. Version Your Schemas – include a version attribute or namespace so consumers can handle backward-compatible changes gracefully.