dabeaz / python-cookbook

Code samples from the "Python Cookbook, 3rd Edition", published by O'Reilly & Associates, May, 2013.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Huge XML files

bradwood opened this issue · comments

Maybe its me, but this code seems to just not work for me:

https://github.com/dabeaz/python-cookbook/blob/master/src/6/incremental_parsing_of_huge_xml_files/example.py

my tag_stack == path_parts never fires and so the generator yields nothing?

Is there a typo somewhere here?

Offending XML:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE tv SYSTEM "xmltv.dtd">
<tv generator-info-name="xmltv.co.uk" source-info-name="xmltv.co.uk">
  <channel id="f3932e75f691561adbe3b609369e487b">
    <display-name>BBC One Lon</display-name>
    <icon src="/images/channels/f3932e75f691561adbe3b609369e487b.png"/>
  </channel>
  <channel id="a3c70f4c25110a9ca84f7c604023ee6c">
    <display-name>Dave</display-name>
    <icon src="/images/channels/a3c70f4c25110a9ca84f7c604023ee6c.png"/>
  </channel>
  <programme start="20181005060000 +0100" stop="20181005091500 +0100" channel="f3932e75f691561adbe3b609369e487b">
    <title lang="en">Break</title>
    <desc lang="en">The latest news, sport, business and weather from the BBC's Breakfast team. Also in HD. [S] Including regional news at 25 and 55 minutes past each hour.</desc>
  </programme>
  <programme start="20181005130000 +0100" stop="20181005133000 +0100" channel="f3932e75f691561adbe3b609369e487b">
    <title lang="en">BBC News at One</title>
    <desc lang="en">The latest national and international news stories from the BBC News team, followed by weather. Also in HD. [S]</desc>
  </programme>
...

Brad

Here is my code, which is identical, but for some extra logging calls:

def xml_parse_and_remove(filename, path):
    LOGGER = logging.getLogger(__name__)

    path_parts = path.split('/')
    doc = iterparse(filename, ('start', 'end'))
    next(doc)  # skip the root element
    tag_stack = []
    elem_stack = []
    for event, elem in doc:
        if event == 'start':
            LOGGER.debug('event == start')
            tag_stack.append(elem.tag)
            elem_stack.append(elem)
        elif event == 'end':
            LOGGER.debug('event == end')
            LOGGER.debug(f'tag stack = {tag_stack}')
            if tag_stack == path_parts:
                LOGGER.debug(elem)
                yield elem
                elem_stack[-2].remove(elem)
            try:
                tag_stack.pop()
                elem_stack.pop()
            except IndexError:
                pass

And here is the relevant snippet of log:

[2018-10-10 16:06:19,292] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,292] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,293] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,293] DEBUG:pyskyq.utils:tag stack = ['channel', 'display-name']
[2018-10-10 16:06:19,293] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,293] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,293] DEBUG:pyskyq.utils:tag stack = ['channel', 'icon']
[2018-10-10 16:06:19,293] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,293] DEBUG:pyskyq.utils:tag stack = ['channel']
[2018-10-10 16:06:19,294] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,294] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,295] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,296] DEBUG:pyskyq.utils:tag stack = ['channel', 'display-name']
[2018-10-10 16:06:19,297] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,298] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,298] DEBUG:pyskyq.utils:tag stack = ['channel', 'icon']
[2018-10-10 16:06:19,300] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,302] DEBUG:pyskyq.utils:tag stack = ['channel']
[2018-10-10 16:06:19,302] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,302] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,302] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,303] DEBUG:pyskyq.utils:tag stack = ['programme', 'title']
[2018-10-10 16:06:19,303] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,304] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,304] DEBUG:pyskyq.utils:tag stack = ['programme', 'desc']
[2018-10-10 16:06:19,304] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,304] DEBUG:pyskyq.utils:tag stack = ['programme']
[2018-10-10 16:06:19,305] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,306] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,306] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,306] DEBUG:pyskyq.utils:tag stack = ['programme', 'title']
[2018-10-10 16:06:19,307] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,307] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,308] DEBUG:pyskyq.utils:tag stack = ['programme', 'desc']
[2018-10-10 16:06:19,308] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,309] DEBUG:pyskyq.utils:tag stack = ['programme']
[2018-10-10 16:06:19,336] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,337] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,338] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,339] DEBUG:pyskyq.utils:tag stack = ['programme', 'title']
[2018-10-10 16:06:19,339] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,339] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,340] DEBUG:pyskyq.utils:tag stack = ['programme', 'desc']
[2018-10-10 16:06:19,341] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,341] DEBUG:pyskyq.utils:tag stack = ['programme']
[2018-10-10 16:06:19,342] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,342] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,342] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,342] DEBUG:pyskyq.utils:tag stack = ['programme', 'title']
[2018-10-10 16:06:19,342] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,342] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,342] DEBUG:pyskyq.utils:tag stack = ['programme', 'desc']
[2018-10-10 16:06:19,343] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,343] DEBUG:pyskyq.utils:tag stack = ['programme']
...

There never is a ['channel', 'channel'] logged and so the yield is never called? Any thoughts?

thanks again!

B

Sorry for a the stream-of-consciousness here, but I was invoking my call with path='channel/channel' like the way in the book it does it with row/row... I assumed these were opening and closing tags at the same nexted level, rather than nested levels of tags as the book seems to show with the pothole file which has a <row> nested in another <row>.

Feeling a bit stupid now...

When I invoke it with path='channel' it yields fine, unsurprisingly... However now I get

>                   elem_stack[-2].remove(elem)
E                   IndexError: list index out of range

...