XML, JSON, YAML Key:Value Notation
1 Executive Summary of Python Methods for Data Serialization
1.1 XML
from xml.dom.minidom import parse, parseString dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>') datasource = open('c:\\temp\\mydata.xml') dom2 = parse(datasource) # parse an open file #################### import xml.etree.ElementTree as ET tree = ET.parse('acme-employees.xml') tree2 = ET.fromstring(acme-employees-long-string-of-xml) # if a string root = tree.getroot()
1.2 JSON
1.3 YAML
2 API Data formats
Data Serialization Language are YAML, JSON, XML
data format | most common use |
---|---|
XML | Transfomation with XSL |
applying XML schemas | |
JSON | communication server - web page |
configuratrion files | |
YAML | configuration files |
2.1 DSL have common characteristics
Data serialization languages have the following characteristics:
- format syntax
- concept of an object
- key-value notation
- array or list
- importance of whitespace
- case sensitivity
3 Common Data Formats (Key Value Notation)
To standardize on data formats several standards have emerged. They all have key and value pairs. Values can be objects, lists, strings, numbers, booleans. The keys … The common are:
3.1 "Key" : "Value"
The all have this key value pair concept. For XML the "key" is actually the tag
that has to have a closing tag. <tag> value </tag> or <key>value</key>
For JSON it is "Key":Value,
whitespace not significant, but curly braces are
as are commas.
For YAML it is Key:Value
no commas, but whitespace indentation defines the
structure. Your IDE will handle that for you, including emacs. You can also
have your IDE highlight those indentations as an option.
4 XML <key> value </keys>
EXtensible Markup Language.
4.1 Characteristics of XML
- whitespaces are ignored
- whitespaces are ignored
- whitespaces however are significant in certain cases.
- key value pairs can be nested
.xml
file ending- NO predefined tags. i.e. they are all user defined
- opening tags
<mytag>
- closing tags
</mytag>
# must match opening tags. comments
are<!-- this is a comment and it ends with a -->
- start with
XML prologue
wihch starts and ends with<?
and?>
respectively. For example<?xml version="1.0" encoding="UTF-8"?>
- special encoding (i.e. if you want to send a < as part of the data)
are used with the character's numeric encoding
- < is
<
> is>
- < is
XML also provides a good method of storing data that is universal becuase it is computer language neutral.
more complicated
but also more powerful than json, because not predefined. Any customized tags can be defined. (compare with HTML tags <h1>, <h2> etc. butless secure
due to its complexity- XLT XML Language Tags can be predefined that specify a
4.2 XML structure
- Every XML document has a root element containing one or more child elements
- Path is a way of addressing a particular element inside a tree
- namespaces provide you with name isolation for potentially duplicate names for an element. i.e. there could be many places where "address" will appear. namespaces lets you deal with this ambiguity/overlap, even when an XML document may be built from several YANG models, as they often are.
4.2.1 Document node / declaration <?xml version = "1.0" encoding = "UTF-8" ? >
4.2.2 elements
Each XML element must have a start-tag and end-tag, eg <int> ... </int>
An empty tag still needs closing <ipv6></ipv6>
4.2.3 data
<mytag>xyz</mytab>
4.2.4 xml attributes
XML lets you embed attributes within tags
to convey additional information —
in this case, the XML version number
and character encoding
. For example,
instead of including the ip address as elements, we might have included
them as attributes:
<address> <ip>172.17.18.19</ip> <netmask>255.255.255.0</netmask> </address> # vs as attributes <address ip=172.17.18.19 netmask=255.255.255.0 />
4.2.5 xml namespaces
If you XML messages need to reference a specific published standard namespace, you must specify that names space, known as a xmlns, in the message.
xml namespaces are documented by the IETF. So for example to specify that this rpc message conforms to the Netconf 1.0 standard, you would say
<rpc message-id="101" xmlns="urn:ietf:params:xml:ns:netconf:base:1.0"> stuff in the middle </rpc>
You can use XML namespaces, xmlns, to allow you to use the same tag and have it refer to different objects. It is like a prefix to our tags.
4.2.6 processing instructions
4.2.7 comments
<! ….. >
4.3 Sample XML
<?xml version = "1.0" encoding = "UTF-8" ? > <!-- hey dude, this comment was left by Zintis --> <!-- just to show what comments look like in xml --> <!-- ########################################### --> <Interface xmlns = "ietf-interfaces"> <name> Gigabitethernet2</name> <description> WAN firewall </description> <enabled> </enabled> <ipv1> <address> <ip>172.17.18.19</ip> <netmask>255.255.255.0</netmask> </address> <address><ip>172.16.16.1</ip><netmask>255.255.255.0</netmask> </address> </ipv4>
4.4 Sample XML from a YANG model
<interfaces xmlns="urn:ietf:params:xml:ns:yang:ietf-interfaces"> <!-- comment left by zintis --> <interface> <name>GigabitEthernet1</name> <type xmlns:ianaift="urn:ietf:params:xml:ns:yang:iana-if-type">ianaift:ethernetCsmacd</type> <enabled>true</enabled> <ipv4 xmlns="urn:ietf:params:xml:ns:yang:ietf-ip"> <address> <ip>198.18.133.212</ip> <netmask>255.255.192.0</netmask> </address> </ipv4> <ipv6 xmlns="urn:ietf:params:xml:ns:yang:ietf-ip"/> </interface> ... </interfaces>
4.5 External Resources:
5 Parsing xml with Python
There are several libraries that can parse XML.
5.1 xml (minidom and etree.ElementTree)
The xml
module is built-in to python It has two submodules,
minidom
andElementree
5.1.1 minidom
From pythonlibrary.org there is good examples and a tutorial on minidom
.
https://docs.python.org/3/library/xml.dom.minidom.html has the official
description of minidom. It states
that xml.dom.minidom
is a minimal implementation of the Document Object
Model interface, with an API similar to that in other languages. It is
intended to be simpler than the full DOM and also significantly
smaller.
Users who are not already proficient with the DOM should consider
using
the xml.etree.ElementTree
module for their XML processing instead
.
DOM appications start by parsing some XML into a DOM. This is done with
the parse method: i.e. ~from xml.dom.minidom import
parse, parseString
from xml.dom.minidom import parse, parseString dom1 = parse('c:\\temp\\mydata.xml') # parse an XML file by name datasource = open('c:\\temp\\mydata.xml') dom2 = parse(datasource) # parse an open file dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>')
parse()
can take either a string or an open file object.
here is a full example from the same source: minidom.py
Using minidom to "pretty print some xml":
import xml.dom.minidom dom = xml.dom.minidom.parseString(xmlString) xml = dom.toprettyxml() print(xml)
5.1.2 xml.etree.ElementTree
The official doucmentation is docs.python.org-xml.etree.ElementTree
The docs state that "The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data." Note that the older xml.etree.cElementTree module is deprecated, in case you come across it in some online forum.
5.1.3 Element & ElementTree
This library has two classes, ElementTree
class represents the whole xml
document as a tree
. While Element
represents a single node
in this tree. So
interactions to the whole document use ElementTree while single elements and
its sub-elements use Element.
5.1.4 Parsing xml into ElementTree
import xml.etree.ElementTree as ET tree = ET.parse('acme-employees.xml') tree2 = ET.fromstring(acme-employees-long-string-of-xml) # if a string root = tree.getroot()
Lots more good detail in the docs on docs.python.org-xml-etree.elementtree
5.1.5 XPath
More sophisticated searches in xml formatted data can be done with XPath
5.2 xml vulnerabilities
Also be aware that The xml.etree.ElementTree module is not secure against maliciously constructed data. See xml vulnerabilities link for *.
For example, this code, known as the 'billion laughs bomb' will expand out to 109 "lol"s
<?xml version="1.0"?> <!DOCTYPE lolz [ <!ENTITY lol "lol"> <!ELEMENT lolz (#PCDATA)> <!ENTITY lol1 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;"> <!ENTITY lol2 "&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;"> <!ENTITY lol3 "&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;"> <!ENTITY lol4 "&lol3; ;&lol3;"> <!ENTITY lol5 "&lol4; ;&lol4;"> <!ENTITY lol6 "&lol5; you get the idea. Don't do this yourself ;&lol5;"> <!ENTITY lol7 "&lol6; ;&lol6;"> <!ENTITY lol8 "&lol7; ;&lol7;"> <!ENTITY lol9 "&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;"> ]> <lolz>&lol9;</lolz> # this triggers the "bomb" and is simply 10 lol8s, which # in turn is 10 lol7s ... to the last "lol"
Any code that allows for references in itself is vulnerable, and so external sources of data should be scurbbed clean to prevent heavily nested entities.
Here is a YAML version: a: &a ["lol","lol","lol","lol","lol","lol","lol","lol","lol"] b: &b [*a,*a,*a,*a,*a,*a,*a,*a,*a] c: &c [*b,*b,*b,*b,*b,*b,*b,*b,*b] d: &d [*c,*c,*c,*c,*c,*c,*c,*c,*c] e: &e [*d,*d,*d,*d,*d,*d,*d,*d,*d] f: &f [*e,*e,*e,*e,*e,*e,*e,*e,*e] g: &g [*f,*f,*f,*f,*f,*f,*f,*f,*f] h: &h [*g,*g,*g,*g,*g,*g,*g,*g,*g] i: &i [*h,*h,*h,*h,*h,*h,*h,*h,*h]
5.2.1 python code example using xml.etree.ElementTree
Best to see it in python mode, so here is the xml.etree.ElementTree-eg.py
And here is another example from geeksforgeeks.org is this file here: xml-etree-ElementTree.py
5.3 xmltodict
This module is an additional model that needs to be pip installed. Here is the https://pypi.org/project/xmltodict/ link, which describes it as : "`xmltodict` is a Python module that makes working with XML feel like you are working with [JSON](http://docs.python.org/library/json.html), as in this ["spec"] (http://www.xml.com/pub/a/2006/05/31/converting-between-xml-and-json.html):"
5.4 lxml
This is a 3rd party package from codespeak. It uses ElementTree API.
The lxml
package has XPath
and XSLT
support
Here is some sample python code using lxml, from pythonlibrary.org
from lxml import etree def parseXML(xmlFile): """ Parse the xml """ with open(xmlFile) as fobj: xml = fobj.read() root = etree.fromstring(xml) for appt in root.getchildren(): for elem in appt.getchildren(): if not elem.text: text = "None" else: text = elem.text print(elem.tag + " => " + text) if __name__ == "__main__": parseXML("example.xml")
6 JSON "Key":Value,
JavaScript Object Notation Characteristics of JSON are:
Characteristics of JSON, JavaScript Object Notation are:
- JSON contains either
an array of values
or anobject
{}
curly braces define an object of{"key": value, "key": value}
pairs,
commas separateobject
entries of the same type, in aobject
[]
square brackets define arrays, or lists,
commas separatelist
entries of the same type, in alist
unordered set
ofname : value pairs
all
keys are strings, so"string1":
(need thecolon :
)- wrap
strings
indouble quotation
marks - whitespaces are ignored
{key1":"Cleese"}
="{ "key1" : "Cleese" }
dictionaries
can havesub dictionaries
.objects
can contain otherobjects
- values can be: -
strings
integers
floats
null
booleans
lists
- even other
json objects
- NO
comments
SUPPORTED .json
file ending- is JSON heirarchical? More accurate to say JSON can represent hierarchical data structures. List are not heirarchical, but an object can have sub- objects, and those are hierarchical.
- JSON is parsed in python with
json.loads()
for loading from a string.json.load()
for loading data from a file or file-like object.
- NO
round brackets
. EVER!
{ "ietf-interfaces: interfaces": { "name": "GigabitEthernet2", "description": "WAN firewall", "enabled": true, "ietf-ip:ipv4":{ "address": [ { "ip": "netmask": } ] } { "addresses": [ { "ip" : "172.16.17.18", "netmask" : "255.255.255.0" }, { "ip" : "172.16.16.1" "netmask" : "255.255.255.0" } ] } } }
Pay attention to the commas. The last element of a list has NO comma!
Another example straight from json.org :
{ "glossary": { "title": "example glossary", "GlossDiv": { "title": "S", "GlossList": { "GlossEntry": { "ID": "SGML", "SortAs": "SGML", "GlossTerm": "Standard Generalized Markup Language", "Acronym": "SGML", "Abbrev": "ISO 8879:1986", "GlossDef": { "para": "A meta-markup language, used to create markup languages such as DocBook.", "GlossSeeAlso": ["GML", "XML"] }, "GlossSee": "markup" } } } } }
Notice that lists have square brackets and elements of a list are separated by
commas. The elements can be items, lists, or dictionaries. These are all
valid json snippets
[item1, item2, item3]
# a list of 3 elements[[a1, a2, a3], item2, item3]
# a list of 3 elements,[item1, item2, {"key": value, "key": value, "key: value}]
# a list of 3 elements[item1, {"key": value, "key": value, "key": {"k1": v1, "k2": v2}}]
7 JSON in python
Here is a straight forward sampmle read (json.load) and write (json.dump) of json. code-examples/json.py
#!/usr/bin/env python ''' read json into a dictionary, then dump dictionary to json file''' import json from pprint import pprint with open('/Users/zintis/eg/code-examples/mydevices.json', 'r') as readdevices: ingest_dict = json.load(readdevices) print("My ingest_dict is a dictionary, with type ", type(ingest_dict)) pprint(ingest_dict) with open('/Users/zintis/eg/code-examples/sample.json', 'w') as dumpback2json: json.dump(ingest_dict, dumpback2json) print(" compare mydevices.json to sample.json for peace of mind that json ") print(" library is reading and writing json properly")
- json.load(filehandle) to read
- json.dump(jsonobject, filehandle) to write
A real-life example of an ip table read in from a router. The file is [file:///Users/zintis/eg/code-examples/show-ip-route-response.json]
After the json is parsed into a python dictionary, say X, then 172.16.17.18 could be retrieved as:
IPaddr = X["ietf-interfaces"]
8 YAML Key:Value
YAML Ain't Markup Language
(also Yet another markup language). YAML is an
indentation-based markup data serialization language
which aims to be both
easy to read
and easy to write.
Many projects use it because of its
readability, simplicity and good support for many programming languages.
YAML is so readable, the home page of yaml.org is itself a yaml file.
YAML is a superset of JSON
which means that any valid JSON file is also a
valid YAML file. Except possibly the whitespaces??? How is this possible?
I am not sure, but a YAML parse should understand a valid JSON file, but not
the other way around.
Characteristics of YAML are:
- minimalist
- .yml file type
- start with —
- end with … # rarely followed actually
- whitespaces are very important
- whitespaces are very important
dashes
are used for lists.colons
: separatekey:value
pairs- strings are rarely quoted.
- follow indendation
- NO
,
commas allowed - NO
,
commas allowed - NO
,
commas allowed - NO
,
commas allowed # this is a yaml comment for the remainder of the line
- booleans are true or false or null (left blank)
- newer YAML versions i.e. 1.2 can force a string interpretation with
!!str
- quotation marks not obligatory
--- addresses: - ip: 10.0.0.1 netmask: 255.255.255.0 - ip: 192.168.200.200 netmask: 255.255.255.0 - ip 172.17.17.17 netmask: 255.255.255.0 someboolean: true numbers: - 56 - 57 - 58 ...
8.0.1 Sequence of Scalars
- Paul Cook - Steve Jones - John Simon Ritche (a.k.a. Sid Vicious) - John Lydon (a.k.a. Johnny Rotten)
8.0.2 : for key value pairs
mapping key: value The mapping key does NOT have to be enclosed in double quotation marks.
Mark McGwire: {hr: 65, avg: 0.278} Sammy Sosa: { hr: 63, avg: 0.288 }
8.0.3 mapping scalars to squences
Sex Pistols:
- Paul Cook
- Steve Jones
- Johnny Rotten
- Sid Vicious
The Pogues:
- Shane MacGowan
- Jem Finer
- Andrew Ranken
- Phillip Chevron
- David Coulter
- Cait O'Riordan
The whitespaces are important in indentation. Indentation has meaning.
A great source to see YAML detail is the YAML documentation in ansible.com
8.1 YAML file A
layout file starts with three dashes
. These dashes indicate the start of a new
YAML document. YAML supports multiple documents, and compliant parsers will
recognize each set of dashes as the beginning of a new one.
Next example, we see the construct that makes up most typical YAML documents: a key-value pair. company is a key that points to a string value:
company: "acm explosvies ltd."
YAML supports more than just string values. The file starts with five
key-value pairs. They have four different data types
, each of which is shown
in the following example:
--- company: "acme explosives ltd." pi: 3.1415926535 happy: true iterations: 6 users: - superuser - operator - admin - enduser ietf-interfaces: - interface1: name: GigabitEthernet2 description: WAN firewall enabled: true ietf-ip: ipv4: address: ip: 172.16.17.18 netmask: 255.255.255.0 - interface2: name: GigabitEthernet6 description: load balancer enabled: true ietf-ip:ipv4: address: ip: 172.16.21.1 netmask: 255.255.255.0 - interface3: name: GigabitEthernet48 description: core routers enabled: true ietf-ip:ipv4: address: ip: 10.10.1.254 netmask: 255.255.255.0 - interface4: addressess: - ip: 172.16.17.18 netmask: 255.255.255.0 - ip: 172.16.16.1 netmask: 255.255.255.0 - ip: 192.168.17.1 netmask: 255.255.255.12
- company is a
string
. "acme explosives ltd." - pi is a
floating-point
number. 3.1415926535 - happy is a
boolean
. true - iterations an
integer
. 6 - users is a
list
(array). users has four elements inside it,- each denoted by an
opening dash
.
- each denoted by an
- ietf-interfaces is a
dictionary
You can enclose strings in single or double-quotes or no quotes at all. YAML recognizes unquoted numerals as integers or floating point.
user elements I have indented with two spaces. Indentation is how YAML
denotes nesting. The number of spaces can vary from file to file
, but
they must be consistent within a file. Tabs
are NOT allowed.
ietf-interfacers dictionary has four more elements inside it, each of them indented.
We can view ietf-interfaces as a dictionary that contains four other embedded dictionaries. YAML supports nesting of key-values, and mixing types.
8.2 YAML Structures (two types)
There are only two types of structures used in YAML, Lists
and Maps
.
8.2.1 Lists
YAML lists are literally a sequence of objects. For example:
args: - sleep - "1000" - message - "Bring back Firefly!" actors: - John Cleese - Michael Palin - Terry Jones - Eric Idle - Terry Gilliam - Graham Chapman
As you can see here, you can have virtually any number of items in a list,
which is defined as items that start with a dash (-)
and indented from the
parent. Compare the above with the equivalent JSON file:
{ "args": ["sleep", "1000", "message", "Bring back Firefly!"], "actors": ["John Cleese", "Michael Palin", "Terry Jones", "Eric Idle", "Terry Gilliam", "Graham Chapman"] }
And of course, members of the list can also be maps:
--- apiVersion: v1 kind: Pod metadata: name: rss-site labels: app: web spec: containers: - name: front-end image: nginx ports: - containerPort: 80 - sec_containerPort: 443 - name: rss-reader image: nickchase/rss-php-nginx:v1 ports: - containerPort: 88 - sec_containerPort: 443
So as you can see here, we have a list of 2 containers “objects”, each of
which consists of a name scalar
, an image scalar
, and a ports list
.
Each
of the two ports
list items is a scalar
(key: value pair)
The "value" of the containers: key is a list with 2 elements, each of which has three key:value pairs, the last one being a list with 2 scalar elements
ZP: this seems a bit awkward to me. I would like to see the above yaml converted to json and back again by some utility. –write myself a python script that does just that.
8.2.2 Maps (think of they as dictionary entries, "mapping" a key to a value.
Maps let you associate name-value pairs, which of course is convenient when you’re trying to set up configuration information. For example, you might have a config file that starts like this:
—
apiVersion
: v1
kind
: Pod
The first line is a separator, and is optional unless you’re trying to
define multiple structures in a single file. From there, as you can see, we
have two values, v1
and Pod
, mapped to two keys, apiVersion
and kind
.
This kind of thing is pretty simple, of course, and you can think of it in terms of its JSON equivalent:
{ "apiVersion": "v1", "kind": "Pod" }
Notice that in our YAML version, the quotation marks are optional; the processor can tell that you’re looking at a string based on the formatting.
You can also specify more complicated structures by creating a key that maps to another map, rather than a string, as in:
--- apiVersion: v1 kind: Pod metadata: name: rss-site labels: app: web
In this case, we have a key, metadata, that has as its value a map with 2 more keys, name and labels. The labels key itself has a map as its value. You can nest these as far as you want to.
#+BEGINSCR yaml — ietf-interfaces: interface: name: GigabitEthernet2 description: WAN firewall enabled: true ietf-ip:ipv4: address: ip: 172.16.17.18 netmask: 255.255.255.0
addresses:
- ip: 172.16.17.18 netmask: 255.255.255.0
- ip: 172.16.16.1 netmask: 255.255.255.0
- ip: 192.168.17.1 netmask: 255.255.255.12
#+ENDSRC
8.3 Long strings
YAML supports a "folding" syntax. i.e. linebreaks are presumed to be replaced by sapces
9 Parsing and Streaming
Over a network data is streamed character at a time, so it needs to be parsed at the end to turn it back into proper structure.
The opposite of that is the serialization of data from structure data into a stream for transmission across a network.
9.1 python serialization functions
several options exist:
- import dicttoxml
- xmlstring = dicttoxml(mypythdictionary)
9.2 python parsing back to structured dict or list or something
import untangle # xml parser library myresponsepython - untangle.parse(myresponse)
Here is a small example of parsing data to json, and yaml
import json import yaml from pprint import pprint with open('myfile.json','r') as json_file: ourjson = json.load(json_file) pprint(ourjson) print(ourjson['expires_in']) print("The access token from JSON is: %s" % ourjson['access_token']) print("\n\n---") # to add the required three --- to start a yaml file. print(yaml.dump(ourjson)) # yaml.dump takes json and outputs yaml
Where the file myfile.json looks like this:
{ "access_token":"ZDI3MGEyYzQtNmFlNS00NDNhLWFlNzAtZGVjNjE0MGU1OGZmZWNmZDEwN2ItYTU3", "expires_in":1209600, "refresh_token":"MDEyMzQ1Njc4OTAxMjM0NTY3ODkwMTIzNDU2Nzg5MDEyMzQ1Njc4OTEyMzQ1Njc4", "refreshtokenexpires_in":7776000 }
So basically we used:
- x = json.load(jsonfile)
- x = yaml.dump(ourjson)
- x = yaml.safeload(yamlfile)
Note: the difference between yaml.load and yaml.safeload:
yaml.safe_load(sys.stdin)
is the same asyaml.load(sys.stdin, Loader=yaml.SafeLoader)
And
yaml.full_load(sys.stdin)
is the same asyaml.load(sys.stdin, Loader=yaml.FullLoader)
import yaml with open('config.yaml') as f: try: dict = yaml.load(f, Loader=yaml.FullLoader) print(dict) except yaml.YAMLError as e: print(e)
is the same as:
import yaml with open('config.yaml') as f: try: dict = yaml.full_load(f) print(dict) except yaml.YAMLError as e: print(e)
PyYAML offers the SafeLoader. From the docs: "SafeLoader(stream) supports
only standard YAML tags and thus it does not construct class instances
and
probably
safe to use
with documents received from an untrusted source
. The
functions safe_load
and safe_load_all
use SafeLoader to parse a stream."