extract_text does not work on lxml XHTML element
keturn opened this issue · comments
Kevin Turner commented
I guess the docs do explicitly state lxml.html.HtmlElement
, but the lxml docs say
Note that XHTML is best parsed as XML, parsing it with the HTML parser can lead to unexpected results.
so I had been using lxml in XML-mode, and it failed with the not-so-obvious error:
…/python3.7/site-packages/html_text/html_text.py in parse_html(html)
47 XXX: mostly copy-pasted from parsel.selector.create_root_node
48 """
---> 49 body = html.strip().replace('\x00', '').encode('utf8') or b'<html/>'
50 parser = lxml.html.HTMLParser(recover=True, encoding='utf8')
51 root = lxml.etree.fromstring(body, parser=parser)
AttributeError: 'lxml.etree._Element' object has no attribute 'strip'
Test case:
def test_extract_text_from_xml_tree():
xhtml = (u'<html xmlns="http://www.w3.org/1999/xhtml"><head/><body>'
'<p>Hello, World!</p>'
'</body></html>')
text = u'Hello, World!'
assert extract_text(etree.fromstring(xhtml,parser=etree.XMLParser()),
guess_punct_space=False, guess_layout=False) == text
Konstantin Lopuhin commented
@keturn right, good catch - this is something we should fix. In the meantime, you can try calling html_text.etree_to_text
directly, that won't fail in parse_html
(but may fail later as I didn't check it). EDIT as I see you already tried that in #25.
Also I didn't experience issues with parsing XHTML with HTML parser, at least as far as html-text is concerned.