TeamHG-Memex / html-text

Extract text from HTML

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Handle non-breaking spaces and other special unicode characters

lopuhin opened this issue · comments

See discussion in #2 (comment)

not sure if this is the same issue, but I'm getting:

ERROR:scrapy.core.scraper:Spider error processing <GET http://www.magnetoinvestigators
.com/contact-us> (referer: http://www.magnetoinvestigators.com)
Traceback (most recent call last):
  File "/home/johannes/.virtualenvs/broadcrawl/lib/python3.5/site-packages/html_text/h
tml_text.py", line 77, in cleaned_selector
    tree = _cleaned_html_tree(html)
  File "/home/johannes/.virtualenvs/broadcrawl/lib/python3.5/site-packages/html_text/h
tml_text.py", line 33, in _cleaned_html_tree
    tree = lxml.html.fromstring(html.encode('utf8'), parser=parser)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udb9e' in position 785: sur
rogates not allowed

Apparently this is a new strictness introduced by python 3.
Possibly using surrogateescape flag in encode could help...?

Also see:

Thanks for report @codinguncut ! For now you can work around this issue by parsing the document yourself and passing lxml.html.HtmlElement into html_text.extract_text.

The issue is that Scrapy used Content-Type header to get the encoding ('utf-7'), while the site in fact seems to return utf-8. Then Scrapy decodes body using 'errors=replace' (w3lib_replace to be precise, see https://github.com/scrapy/w3lib/blob/34435d085c6adb14c94cd0188c23f6dc7d4da0f7/w3lib/encoding.py#L174) - and this produces an output which can't be encoded back to utf-8 for some reason.

I think the right place to fix it is probably w3lib. html-text can provide extra robustness by using surrogateescape, but it should be better to get a proper unicode body before passing it to html_text.

FTR, response.css / response.xpath also don't work for this website.