Recently I published a Python script for processing XML that turned a set of XML nodes into a dictionary, where the keys were the element names and all the namespaces were 'fixed'. In this case "fixed" means that 'dc' always maps to the Dublin Core namespace. That code worked but had a drawback that it used the DOM processing facilities on Python, which meant that the whole document had to be parsed and loaded into memory during processing. This is slower and more memory intensive than the XML processing performed by a SAX processor. So I just switched the code to use the SAX processor and came out with a handy enhancement to the interface.
In the old interface you supplied the list of nodes to be converted. I.e.:
sampleText = """
<item>
<title>MetaData</title>
<link>http://bitworking.org/news/8</link>
<description><h1>This is a header</h1></description>
</item>"""
item = minidom.parseString(sampleText).firstChild
dict = convertNodesToDict(item.childNodes)
This produces the dictionary:
{
u'link': u'http://bitworking.org/news/8',
u'description': u'<h1>This is a header</h1>',
u'title': u'MetaData'
}
In the new interface you supply the XML document and the name of the parent element of all the elements you want extracted and put into a dictionary. The SAX interface made it easy to implement. Now you would call:
sampleText = """
<item>
<title>MetaData</title>
<link>http://bitworking.org/news/8</link>
<description><h1>This is a header</h1></description>
</item>"""
dict = convertNodesToDict(StringIO.StringIO(sampleText), 'item')
The nice side benefit of this implementation is that you can have your target item wrapped up in a couple of envelopes or parent elements and not need to ever care. For example the following two documents would return the same exact dictionary.
<item xmlns:bc="http://purl.org/dc/elements/1.1/">
<title>MetaData</title>
<bc:date>2003-01-12T00:18:05-05:00</bc:date>
<link>http://bitworking.org/news/8</link>
<description><h1>This is a header</h1></description>
</item>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" >
<soap:Header>
</soap:Header>
<soap:Body>
<item xmlns:bc="http://purl.org/dc/elements/1.1/">
<title>MetaData</title>
<bc:date>2003-01-12T00:18:05-05:00</bc:date>
<link>http://bitworking.org/news/8</link>
<description><h1>This is a header</h1></description>
</item>
</soap:Body>
</soap:Envelope>
Both get processed by 'convertNodesToDict' to produce the dictionary:
{
u'link': u'http://bitworking.org/news/8',
u'dc:date': u'2003-01-12T00:18:05-05:00',
u'description': u'<h1>This is a header</h1>',
u'title': u'MetaData'
}
Isn't that neat. Bet you can't guess where this will turn up handy.
And as a final note, XmlToDictBySAX.py is not only faster but is also less code than the old DOM based version.
Posted by Ken MacLeod on 2003-04-01
Posted by Ken MacLeod on 2003-04-01
Posted by Ken MacLeod on 2003-03-22