TeamHG-Memex / html-text

Extract text from HTML

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

extract_text fails with misleading error message when given bytes instead of unicode [py3]

keturn opened this issue · comments

The error is shown as "a bytes-like object is required, not str", but this is misleading, because the caller's error was that they did pass a bytes object.

Honestly not sure what the pythonic way to deal with this is.

  • Explicit assert isinstance type checking?
  • type annotations, and hope the user is running in an environment that will type check before they hit this exception?
html_text.extract_text(b'<html><body><p>Hello,   World!</p></body></html>')
…/python3.7/site-packages/html_text/html_text.py in parse_html(html)
     47     XXX: mostly copy-pasted from parsel.selector.create_root_node
     48     """
---> 49     body = html.strip().replace('\x00', '').encode('utf8') or b'<html/>'
     50     parser = lxml.html.HTMLParser(recover=True, encoding='utf8')
     51     root = lxml.etree.fromstring(body, parser=parser)

TypeError: a bytes-like object is required, not 'str'

I guess that, for this specific line, its whole goal is to convert a string to a bytes object, so parse_html could skip that line if html is already bytes.

Yeah, it's .replace method of bytestring which raises this error, and it is confusing for the user. For html-text, having an explicit type check in extract_text seems like a good usability improvement to me, but with raising TypeError instead of an assert.

I guess that, for this specific line, its whole goal is to convert a string to a bytes object, so parse_html could skip that line if html is already bytes.

That's also possible, but note that this must be a utf8-encoded html, so if it's just a raw response result in a different encoding, then it would not work correctly. Accepting only strings makes sure we don't have this error, and it seems that the time to do re-encoding is small compared to text extraction time. But maybe it's fine to support bytes if the error on non-utf8 html is not too obscure.