Huge XML files
bradwood opened this issue · comments
Maybe its me, but this code seems to just not work for me:
my tag_stack == path_parts
never fires and so the generator yields nothing?
Is there a typo somewhere here?
Offending XML:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE tv SYSTEM "xmltv.dtd">
<tv generator-info-name="xmltv.co.uk" source-info-name="xmltv.co.uk">
<channel id="f3932e75f691561adbe3b609369e487b">
<display-name>BBC One Lon</display-name>
<icon src="/images/channels/f3932e75f691561adbe3b609369e487b.png"/>
</channel>
<channel id="a3c70f4c25110a9ca84f7c604023ee6c">
<display-name>Dave</display-name>
<icon src="/images/channels/a3c70f4c25110a9ca84f7c604023ee6c.png"/>
</channel>
<programme start="20181005060000 +0100" stop="20181005091500 +0100" channel="f3932e75f691561adbe3b609369e487b">
<title lang="en">Break</title>
<desc lang="en">The latest news, sport, business and weather from the BBC's Breakfast team. Also in HD. [S] Including regional news at 25 and 55 minutes past each hour.</desc>
</programme>
<programme start="20181005130000 +0100" stop="20181005133000 +0100" channel="f3932e75f691561adbe3b609369e487b">
<title lang="en">BBC News at One</title>
<desc lang="en">The latest national and international news stories from the BBC News team, followed by weather. Also in HD. [S]</desc>
</programme>
...
Brad
Here is my code, which is identical, but for some extra logging calls:
def xml_parse_and_remove(filename, path):
LOGGER = logging.getLogger(__name__)
path_parts = path.split('/')
doc = iterparse(filename, ('start', 'end'))
next(doc) # skip the root element
tag_stack = []
elem_stack = []
for event, elem in doc:
if event == 'start':
LOGGER.debug('event == start')
tag_stack.append(elem.tag)
elem_stack.append(elem)
elif event == 'end':
LOGGER.debug('event == end')
LOGGER.debug(f'tag stack = {tag_stack}')
if tag_stack == path_parts:
LOGGER.debug(elem)
yield elem
elem_stack[-2].remove(elem)
try:
tag_stack.pop()
elem_stack.pop()
except IndexError:
pass
And here is the relevant snippet of log:
[2018-10-10 16:06:19,292] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,292] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,293] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,293] DEBUG:pyskyq.utils:tag stack = ['channel', 'display-name']
[2018-10-10 16:06:19,293] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,293] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,293] DEBUG:pyskyq.utils:tag stack = ['channel', 'icon']
[2018-10-10 16:06:19,293] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,293] DEBUG:pyskyq.utils:tag stack = ['channel']
[2018-10-10 16:06:19,294] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,294] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,295] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,296] DEBUG:pyskyq.utils:tag stack = ['channel', 'display-name']
[2018-10-10 16:06:19,297] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,298] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,298] DEBUG:pyskyq.utils:tag stack = ['channel', 'icon']
[2018-10-10 16:06:19,300] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,302] DEBUG:pyskyq.utils:tag stack = ['channel']
[2018-10-10 16:06:19,302] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,302] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,302] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,303] DEBUG:pyskyq.utils:tag stack = ['programme', 'title']
[2018-10-10 16:06:19,303] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,304] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,304] DEBUG:pyskyq.utils:tag stack = ['programme', 'desc']
[2018-10-10 16:06:19,304] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,304] DEBUG:pyskyq.utils:tag stack = ['programme']
[2018-10-10 16:06:19,305] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,306] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,306] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,306] DEBUG:pyskyq.utils:tag stack = ['programme', 'title']
[2018-10-10 16:06:19,307] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,307] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,308] DEBUG:pyskyq.utils:tag stack = ['programme', 'desc']
[2018-10-10 16:06:19,308] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,309] DEBUG:pyskyq.utils:tag stack = ['programme']
[2018-10-10 16:06:19,336] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,337] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,338] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,339] DEBUG:pyskyq.utils:tag stack = ['programme', 'title']
[2018-10-10 16:06:19,339] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,339] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,340] DEBUG:pyskyq.utils:tag stack = ['programme', 'desc']
[2018-10-10 16:06:19,341] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,341] DEBUG:pyskyq.utils:tag stack = ['programme']
[2018-10-10 16:06:19,342] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,342] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,342] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,342] DEBUG:pyskyq.utils:tag stack = ['programme', 'title']
[2018-10-10 16:06:19,342] DEBUG:pyskyq.utils:event == start
[2018-10-10 16:06:19,342] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,342] DEBUG:pyskyq.utils:tag stack = ['programme', 'desc']
[2018-10-10 16:06:19,343] DEBUG:pyskyq.utils:event == end
[2018-10-10 16:06:19,343] DEBUG:pyskyq.utils:tag stack = ['programme']
...
There never is a ['channel', 'channel']
logged and so the yield
is never called? Any thoughts?
thanks again!
B
Sorry for a the stream-of-consciousness here, but I was invoking my call with path='channel/channel'
like the way in the book it does it with row/row
... I assumed these were opening and closing tags at the same nexted level, rather than nested levels of tags as the book seems to show with the pothole file which has a <row>
nested in another <row>
.
Feeling a bit stupid now...
When I invoke it with path='channel'
it yields fine, unsurprisingly... However now I get
> elem_stack[-2].remove(elem)
E IndexError: list index out of range
...