extract_text fails with misleading error message when given bytes instead of unicode [py3]

Question

extract_text fails with misleading error message when given bytes instead of unicode [py3]

keturn opened this issue 4 years ago · comments

The error is shown as "a bytes-like object is required, not str", but this is misleading, because the caller's error was that they did pass a bytes object.

Honestly not sure what the pythonic way to deal with this is.

Explicit assert isinstance type checking?
type annotations, and hope the user is running in an environment that will type check before they hit this exception?

html_text.extract_text(b'<html><body><p>Hello,   World!</p></body></html>')

…/python3.7/site-packages/html_text/html_text.py in parse_html(html)
     47     XXX: mostly copy-pasted from parsel.selector.create_root_node
     48     """
---> 49     body = html.strip().replace('\x00', '').encode('utf8') or b'<html/>'
     50     parser = lxml.html.HTMLParser(recover=True, encoding='utf8')
     51     root = lxml.etree.fromstring(body, parser=parser)

TypeError: a bytes-like object is required, not 'str'

I guess that, for this specific line, its whole goal is to convert a string to a bytes object, so parse_html could skip that line if html is already bytes.

Konstantin Lopuhin · Answer 1 · Mon Feb 10 2020 15:38:45 GMT+0800 (China Standard Time)

Yeah, it's .replace method of bytestring which raises this error, and it is confusing for the user. For html-text, having an explicit type check in extract_text seems like a good usability improvement to me, but with raising TypeError instead of an assert.

Konstantin Lopuhin · Answer 2 · Mon Feb 10 2020 15:42:22 GMT+0800 (China Standard Time)

I guess that, for this specific line, its whole goal is to convert a string to a bytes object, so parse_html could skip that line if html is already bytes.

That's also possible, but note that this must be a utf8-encoded html, so if it's just a raw response result in a different encoding, then it would not work correctly. Accepting only strings makes sure we don't have this error, and it seems that the time to do re-encoding is small compared to text extraction time. But maybe it's fine to support bytes if the error on non-utf8 html is not too obscure.