Invalid HTML from extract_quotations.

Question

Invalid HTML from extract_quotations.

nixypanda opened this issue 7 years ago · comments

Hi I was testing talon with some inputs and the following input:
<div dir="ltr">ha ha ha <span style="color:rgb(33,33,33);font-size:29px;white-space:pre-wrap">эй чувак, как ты </span>😁<span style="font-size:12.8px"> lol</span></div><div class="gmail_extra"><br><div class="gmail_quote">2017-09-04 18:08 GMT+05:30 Sherub Thakur <span dir="ltr"><<a href="mailto:sherub.thakur@kayako.com" target="_blank">sherub.thakur@kayako.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><span style="color:rgb(33,33,33);font-size:29px;white-space:pre-wrap">эй чувак, как ты </span>😁<span style="font-size:12.8px"> lol</span><br></div></blockquote></div><br></div>

does not lead to the following output.
<html><head></head><body><div dir="ltr">ha ha ha <span style="color:rgb(33,33,33);font-size:29px;white-space:pre-wrap">эй чувак, как ты </span>😁<span style="font-size:12.8px">Â lol</span></div><div class="gmail_extra"><br><br></div></body></html>

but leads to this output
<html><head></head><body><div dir="ltr">ha ha haÂ <span style="color:rgb(33,33,33);font-size:29px;white-space:pre-wrap">Ñ�Ğ¹ Ñ‡ÑƒĞ²Ğ°Ğº, ĞºĞ°Ğº Ñ‚Ñ‹ </span>ğŸ˜�<span style="font-size:12.8px">Â lol</span></div><div class="gmail_extra"><br><br></div></body></html>

Which looks wrong. Can you guide me if there is something that I am doing wrong here?

This is how I am using it.
ohtml = quotations.extract_from_html(html).encode('utf-8')

nixypanda · Answer 1 · Thu Dec 07 2017 10:31:56 GMT+0800 (China Standard Time)

#156 seems to have done the trick of solving the issue. Unsure if there is some setting in lxml html5parser that can achieve the same effect.