Introduction to the XML format
And why not to confuse it with HTML
By Martin Helm in Data Science ML Tools
September 11, 2021
Recently I have been going over several very common file formats, such as XML, JSON or YAML. Although I have used them before, I never took the time to really look into their complete syntax. In this small series of posts I will dig into how exactly each of the three formats is structured, for what they are used and, most importantly, what are the differences between them. So let’s get started with XML.
XML stands for eXtensible Markup Language and was introduced in 1998. It has the extension .xml
and is intended to exchange data between computers and over the internet. As the name already suggests, it gives a lot of freedom to the developer how to name the individual tags. A sample XML document might look like this:
<?xml version="1.0" encoding="UTF8"?>
<pokedex>
<pokemon>
<name>Bulbasaur</name>
<type>Grass</type>
<type>Poison</type>
</pokemon>
<pokemon>
<name>Charmander</name>
<type>Fire</type>
</pokemon>
<pokemon>
<name>Squirtle</name>
<type>Water</type>
</pokemon>
</pokedex>
As you can see, XML is a very simple format that one can intuitively grasp. There are only a handful of components that we need to consider:
- Elements
- Tags
- Comments
- Attributes
- Syntax
- XML declaration
- Namespaces
- Differences to HTML
Elements
Elements are the main and smallest unit of an XML document. They contain the actual data, in our case the name “Bulbasaur” would be an element. The data itself can be of anything that can be expressed with unicode characters (with some exceptions for control characters, see here). Typically UTF-8 is used, but UTF-16 is also supported, if you need it. This covers all typical types, such as strings, integers, and floats. Even binary data could be inside an element, but then one needs to encode the data as text, for example using base64 encoding.
Of course, elements can contain other elements, which are then called the child elements. This gives rise to the nested, hierarchical, tree-like structure of XML files.
Tags
To markup elements, they are surrounded by tags. Tags are denoted using <>
and they always come in pairs:
- a start-tag to begin an element:
<elementname>
- an end-tag at the end of the element, denoted by the leading forward slash:
</elementname>
The only exception to this are the empty-element tags, or self-closing tags, where the forward slash is at the end: <elementname/>
. This is a shorthand and equals <elementname></elementname>
.
You as the developer are almost completely free to choose whatever name you want for your tag.The name should be as self-describing as possible, to make it easier to infer meaning when reading the document. Alphanumeric characters, hyphens, underscores and periods are allowed, whereas white is not. You can also not start with the reserved xml, which is used for the xml declaration. The elementnames are case sensitive, so <root>
and <Root>
denote different elements!
Comments
Comments can appear anywhere in the document, but not before the XML declaration (the <?xml version="1.0"?>
part). They begin with an <!--
and end with -->
. Since double hyphens are used to denote comments, comments cannot contain --
within their comment text.
Attributes
Each element can have additional attributes that are defined inside its surrounding tag as a key-value pair. They key is always written as is, whereas the value need to be wrapped in single or double quotes.
These attributes add additional information to the tag. The main difference to the element data itself is that attributes are mainly used to store metadata, but several templates actually store all their information in the attributes, such as SVG.
One element can have multiple attributes, which are not delimited by commas but simply come after each other. But an element is not allowed to multiple instances of the same attribute name. Therefore the following is not possible:
<pokemon evolution="Charmeleon" evolution="Charizard">Charmander</pokemon>
Different elements can share the same attribute though:
<pokedex>
<pokemon evolution1="Ivysau" evolution2="Venusaur">Bulbasaur</pokemon>
<pokemon evolution1="Charmeleon" evolution2="Charizard">Charmander</pokemon>
</pokedex>
In case you want to store a several values for a single attribute, basically a list of values, then you need to define your own format to do so. You could for example separate the values using semicolons, but it up to you to define this, in the best case document it and also find a way to decode this afterwards. For example:
<pokemon weaknesses="Water;Ground;Rock">Charmander</pokemon>
In such a case, think about whether you could alternatively use an element instead of an attribute to store the information, since you can have multiple elements with the same name insade the parent tag:
<pokemon>
<name>Charmander</name>
<weakness>Water</weakness>
<weakness>Ground</weakness>
<weakness>Rock</weakness>
</pokemon>
Syntax
A XML document is called “well-formed” if it adheres to all the following rules:
There exists exactly one root element. The root element is the outermost element, its name does not need to be root. In our example the root element would be <pokedex>
.
All elements must have a start and an end tag (i.e. <pokemon></pokemon>
). The only exception here are the empty tags defined using <elementname/>
.
The tags need to be closed in the same order as they are opened. In other words, one cannot close the outer tag before closing the inner tag. A bad example that does not follow this would be the following:
<pokemon>
<name>Charmander</pokemon>
</name>
In this example, <pokemon>
is the outer tag and <name>
is the inner tag, but the end tag </pokemon>
appears before the end tag </name>
.
Note that there are no rules about indentation. It is good practice to include it, to make it more human readable, but you are very few how many whitespaces to use.
In case you are unsure whether your document adheres to all XML rules, you can validate it using one of many online tools, for example https://www.xmlvalidation.com/.
XML declaration
This is an optional line that describes the document. Sometimes it is also called Processing Instructions, because it tells the parser which XML version, UTF encoding etc. you use. If you add it to your document, which is good practice, it needs to be the very first line. It basically looks like a regular tag, but “element name” is enclosed in ?
and it does not need a corresponding closing tag. For example it could look like this:
<?xml version="1.0" encoding="UTF-8"?>
Namespaces
So far we have been dealing with only a single XML document. But what about when you want to combine data from several XML documents, which were developed by different people? Since XML offers so much freedom in naming the tags, it can easily happen that two different XML documents use the same elementname but with different meanings. For example consider the following two XML documents that could describe a store inventory:
XML1
<item>
<name>Chair</name>
<number>20</number>
<id>1</id>
</item>
XML2
<item>
<description>Sofa</description>
<stock>213</stock>
<number>12</number>
</item>
The documents both use the number tag, but with different meaning. In the first document, the number means the number of items on stock, whereas in the second document it is the internal id for this item. How can we now combine these documents? The solution to this are namespaces.
Namespaces define which tags belong together, and which are from another document, or namespace. First we need to define the namespace:
The namespace can be defined in any tag in the document and is then valid for this element and all its child elements. But you can also define them all in the root element, which makes it easier to look them up. To identify which tag belongs to which namespace, simply add the namespace identifier in front of the elementname, followed by a colon.
The Uniform Resource Identifier can be any identifier you want, similar to a attribute. But in practice it is often a pointer to a website where the developer describes the meaning of each elementname. Note that the URI is not looked up by a parser!
Our combined document could then look like this:
<table xmlns:a="http://www.superstore.org/someinfo" xmlns:b="http://www.beststore.com/moreinfo">
<a:item>
<a:name>Chair</a:name>
<a:number>20</a:number>
<a:id>1</a:id>
</a:item>
<b:item>
<b:description>Sofa</b:description>
<b:stock>213</b:stock>
<b:number>12</b:number>
</b:item>
</table>
Differences to HTML
Finally lets review some differences between XML and HTML:
XML | HTML |
---|---|
Designed to carry data | Designed to display data |
Emphasis on what type of data it is | Emphasis on how data looks |
Tags can be defined by user | Tags are predefined |
Summary
As we have seen, XML is a very simple, but versatile markup language. It is extensively used and several common formats build onto the general XML scheme, for example the popular graphics format SVG. Nonetheless, it’s requirement for end tags make it longish, so a lot of space is accumulated only for the tags, and not for the content itself. In my next post, I will explore the JSON format, which is more concise and nowadays very universally used. Stay tuned until then!
Resources
Photo by Markus Winkler on Unsplash