BitWorking

Converting XML Nodes into a Dictionary in Python using SAX

Recently I published a Python script for processing XML that turned a set of XML nodes into a dictionary, where the keys were the element names and all the namespaces were 'fixed'. In this case "fixed" means that 'dc' always maps to the Dublin Core namespace. That code worked but had a drawback that it used the DOM processing facilities on Python, which meant that the whole document had to be parsed and loaded into memory during processing. This is slower and more memory intensive than the XML processing performed by a SAX processor. So I just switched the code to use the SAX processor and came out with a handy enhancement to the interface.

In the old interface you supplied the list of nodes to be converted. I.e.:

sampleText = """
<item>
  <title>MetaData</title>
  <link>http://bitworking.org/news/8</link>
  <description>&lt;h1>This is a header&lt;/h1></description>
</item>"""
    item = minidom.parseString(sampleText).firstChild
    dict = convertNodesToDict(item.childNodes)

This produces the dictionary:

{
    u'link': u'http://bitworking.org/news/8', 
    u'description': u'<h1>This is a header</h1>',
    u'title': u'MetaData'
}

In the new interface you supply the XML document and the name of the parent element of all the elements you want extracted and put into a dictionary. The SAX interface made it easy to implement. Now you would call:

sampleText = """
<item>
  <title>MetaData</title>
  <link>http://bitworking.org/news/8</link>
  <description>&lt;h1>This is a header&lt;/h1></description>
</item>"""
    dict = convertNodesToDict(StringIO.StringIO(sampleText), 'item')

The nice side benefit of this implementation is that you can have your target item wrapped up in a couple of envelopes or parent elements and not need to ever care. For example the following two documents would return the same exact dictionary.

<item  xmlns:bc="http://purl.org/dc/elements/1.1/">
  <title>MetaData</title>
  <bc:date>2003-01-12T00:18:05-05:00</bc:date>
  <link>http://bitworking.org/news/8</link>
  <description>&lt;h1>This is a header&lt;/h1></description>
</item>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" >
<soap:Header>  
</soap:Header>
  <soap:Body>
    <item xmlns:bc="http://purl.org/dc/elements/1.1/">
      <title>MetaData</title>
      <bc:date>2003-01-12T00:18:05-05:00</bc:date>
      <link>http://bitworking.org/news/8</link>
      <description>&lt;h1>This is a header&lt;/h1></description>
    </item>
  </soap:Body>
</soap:Envelope>

Both get processed by 'convertNodesToDict' to produce the dictionary:

{
    u'link': u'http://bitworking.org/news/8', 
    u'dc:date': u'2003-01-12T00:18:05-05:00', 
    u'description': u'<h1>This is a header</h1>',
    u'title': u'MetaData'
}

Isn't that neat. Bet you can't guess where this will turn up handy.

And as a final note, XmlToDictBySAX.py is not only faster but is also less code than the old DOM based version.

For reference, the Perl module XML::Simple also does exactly as you describe, it was originally created for reading XML configuration files. A neat trick to easily work with namespaces is to use James Clark's namespace notation ('{http://namespace}element-name') for the dictionary keys, and this handy little namespace helper class: class Namespace: def __init__(self, uri): self.uri = uri def __getattr__(self, name): return '{' + self.uri + '}' + name then, in application code: DC = Namespace('http://purl.org/dc/elements/1.1/') print item[DC.date]

Posted by Ken MacLeod on 2003-03-22

I had a chance to update XmlToDictBySAX.py to show what I was getting at. Check it out at XmlToDictBySAXNS.py

Posted by Ken MacLeod on 2003-04-01

I fat-fingered that link somehow, http://bitsko.slc.ut.us/~ken/XmlToDictBySAXNS.py.txt

Posted by Ken MacLeod on 2003-04-01

2003-03-16