grangier / python-goose

Html Content / Article Extractor, web scrapping lib in Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can't extract content from huffington post (?)

jice-lavocat opened this issue · comments

Hi all,

Got into trouble today when I tried to extract the content from some huffington posts.

Here are two URLs that don't receive a 'cleaned_text' after the extraction :

I'll try to understand where it comes from, but no time at the moment.

I got a look at #224 (Goose fails on NYTimes) and applied the patch. But it doesn't work yet for Huff.Post.

commented

Hi, I'm not sure if you're still having this problem, but I was able to resolve it using the following fix:

        from goose import Goose
        from cookielib import CookieJar
        import zlib
            ...
              other imports and code
            ...
        url = "http://www.huffingtonpost.com/steve-mariotti/how-to-get-from-here-to-t_b_7556636.html"
        g = Goose({'enable_image_fetching':False}) # Optionally disable image fetching
        yummy = CookieJar()
        cookieSesh = urllib2.build_opener(urllib2.HTTPCookieProcessor(yummy))
        raw_html = cookieSesh.open(url).read() # HuffPo and several other sites return compressed text.
        raw_html = zlib.decompress(raw_html, 16 + zlib.MAX_WBITS) # Decompress it
        b = g.extract(url=url, raw_html=raw_html)
        print "{0} - {1}".format(b.title,b.cleaned_text)