icy / google-group-crawler

[Deprecated] Get (almost) original messages from google group archives. Your data is yours.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fails on groups with adult content

RuinCakeLie opened this issue · comments

I'm trying to scrape a group (https://groups.google.com/forum/#!topic/3dprintertipstricksreviews/) with the adult content flag turned on. Unfortunately, even using cookies all the escaped_fragment requests only return:

Adult Content Warning

The Group you selected has been identified by its owner as containing adult content.

Interesting. I will take a look. Thanks for your reporting.

Google yields empty contents when escaped_fragement is specified, e.g.

https://groups.google.com/forum/?_escaped_fragment_=forum/3dprintertipstricksreviews

This is against (?) the standard. We need a different way to receive data from Google. This is a real challenge!

Google hides most email headers from the raw message. A raw message isn't actually raw ;)

See also https://groups.google.com/forum/message/raw?msg=3dprintertipstricksreviews/LDFZVHeC8Uk/2D1YhGqGDQAJ

Date: Sun, 20 Mar 2016 06:28:20 -0700 (PDT)
From: Rich Webb <ml...@rawebb.net>
To: 
    "3D Printer Tips, Tricks and Reviews" <3dprintertips...@googlegroups.com>
Message-Id: <d7e58e48-c160-436e-8bdf-10d86a0dc170@googlegroups.com>
Subject: Direction-dependent extrusion volume / track width?
MIME-Version: 1.0
Content-Type: multipart/mixed; 
    boundary="----=_Part_4198_351838098.1458480500604"

------=_Part_4198_351838098.1458480500604
Content-Type: multipart/alternative; 
    boundary="----=_Part_4199_1172380407.1458480500604"

------=_Part_4199_1172380407.1458480500604
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

It's impossible to use traditional method to fetch data from this group. We need to use some higher level tool like phantomjs.

Well, after days of trying scrolling method, I've finally found a way to automate the process. There are two other challenges, but they're definitely solvable.

Stay tuned!

I have some initial works on this issue, but (1) it's slow (2) it's undetermined. Maybe I am not good at selenium.

I'm expecting there's someone can help. I can raise a small fund to support you.

Thanks a lot