Invalid HTML from extract_quotations.
nixypanda opened this issue · comments
Hi I was testing talon with some inputs and the following input:
<div dir="ltr">ha ha ha <span style="color:rgb(33,33,33);font-size:29px;white-space:pre-wrap">эй чувак, как ты </span>😁<span style="font-size:12.8px"> lol</span></div><div class="gmail_extra"><br><div class="gmail_quote">2017-09-04 18:08 GMT+05:30 Sherub Thakur <span dir="ltr"><<a href="mailto:sherub.thakur@kayako.com" target="_blank">sherub.thakur@kayako.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><span style="color:rgb(33,33,33);font-size:29px;white-space:pre-wrap">эй чувак, как ты </span>😁<span style="font-size:12.8px"> lol</span><br></div></blockquote></div><br></div>
does not lead to the following output.
<html><head></head><body><div dir="ltr">ha ha ha <span style="color:rgb(33,33,33);font-size:29px;white-space:pre-wrap">эй чувак, как ты </span>😁<span style="font-size:12.8px"> lol</span></div><div class="gmail_extra"><br><br></div></body></html>
but leads to this output
<html><head></head><body><div dir="ltr">ha ha ha <span style="color:rgb(33,33,33);font-size:29px;white-space:pre-wrap">�й чувак, как ты </span>�<span style="font-size:12.8px"> lol</span></div><div class="gmail_extra"><br><br></div></body></html>
Which looks wrong. Can you guide me if there is something that I am doing wrong here?
This is how I am using it.
ohtml = quotations.extract_from_html(html).encode('utf-8')