fedaykin / xmltodict

Python module that makes working with XML feel like you are working with JSON

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

xmltodict

xmltodict is a Python module that makes working with XML feel like you are working with JSON, as in this "spec":

Build Status

>>> doc = xmltodict.parse("""
... <mydocument has="an attribute">
...   <and>
...     <many>elements</many>
...     <many>more elements</many>
...   </and>
...   <plus a="complex">
...     element as well
...   </plus>
... </mydocument>
... """)
>>>
>>> doc['mydocument']['@has']
u'an attribute'
>>> doc['mydocument']['and']['many']
[u'elements', u'more elements']
>>> doc['mydocument']['plus']['@a']
u'complex'
>>> doc['mydocument']['plus']['#text']
u'element as well'

It's very fast (Expat-based) and has a streaming mode with a small memory footprint, suitable for big XML dumps like Discogs or Wikipedia:

>>> def handle_artist(_, artist):
...     print artist['name']
>>> 
>>> xmltodict.parse(GzipFile('discogs_artists.xml.gz'),
...     item_depth=2, item_callback=handle_artist)
A Perfect Circle
Fantômas
King Crimson
Chris Potter
...

It can also be used from the command line to pipe objects to a script like this:

import sys, marshal
while True:
    _, article = marshal.load(sys.stdin)
    print article['title']
$ cat enwiki-pages-articles.xml.bz2 | bunzip2 | xmltodict.py 2 | myscript.py
AccessibleComputing
Anarchism
AfghanistanHistory
AfghanistanGeography
AfghanistanPeople
AfghanistanCommunications
Autism
...

Or just cache the dicts so you don't have to parse that big XML file again. You do this only once:

$ cat enwiki-pages-articles.xml.bz2 | bunzip2 | xmltodict.py 2 | gzip > enwiki.dicts.gz

And you reuse the dicts with every script that needs them:

$ cat enwiki.dicts.gz | gunzip | script1.py
$ cat enwiki.dicts.gz | gunzip | script2.py
...

You can also convert in the other direction, using the unparse() method:

>>> mydict = {
...     'page': {
...         'title': 'King Crimson',
...         'ns': 0,
...         'revision': {
...             'id': 547909091,
...         }
...     }
... }
>>> print unparse(mydict)
<?xml version="1.0" encoding="utf-8"?>
<page><ns>0</ns><revision><id>547909091</id></revision><title>King Crimson</title></page>

Ok, how do I get it?

You just need to

$ pip install xmltodict

There is an official Fedora package for xmltodict. If you are on Fedora or RHEL, you can do:

$ sudo yum install python-xmltodict

Donate

If you love xmltodict, consider supporting the author on Gittip.

About

Python module that makes working with XML feel like you are working with JSON

License:MIT License