Can't extract content from huffington post (?)
jice-lavocat opened this issue · comments
Jean-Christophe Lavocat commented
Hi all,
Got into trouble today when I tried to extract the content from some huffington posts.
Here are two URLs that don't receive a 'cleaned_text' after the extraction :
- http://www.huffingtonpost.com/ryan-scott/the-best-volunteer-progra_b_7566478.html?utm_hp_ref=small-business&ir=Small+Business
- http://www.huffingtonpost.com/steve-mariotti/how-to-get-from-here-to-t_b_7556636.html
I'll try to understand where it comes from, but no time at the moment.
Jean-Christophe Lavocat commented
I got a look at #224 (Goose fails on NYTimes) and applied the patch. But it doesn't work yet for Huff.Post.
matt commented
Hi, I'm not sure if you're still having this problem, but I was able to resolve it using the following fix:
from goose import Goose
from cookielib import CookieJar
import zlib
...
other imports and code
...
url = "http://www.huffingtonpost.com/steve-mariotti/how-to-get-from-here-to-t_b_7556636.html"
g = Goose({'enable_image_fetching':False}) # Optionally disable image fetching
yummy = CookieJar()
cookieSesh = urllib2.build_opener(urllib2.HTTPCookieProcessor(yummy))
raw_html = cookieSesh.open(url).read() # HuffPo and several other sites return compressed text.
raw_html = zlib.decompress(raw_html, 16 + zlib.MAX_WBITS) # Decompress it
b = g.extract(url=url, raw_html=raw_html)
print "{0} - {1}".format(b.title,b.cleaned_text)