icy / google-group-crawler

[Deprecated] Get (almost) original messages from google group archives. Your data is yours.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deleted message: "500 Internal Server Error"

tschwinge opened this issue · comments

During the "wget" stage, I noticed a "500 Internal Server Error" happen that makes wget exit with code 8:

--2019-09-17 22:44:27--  https://groups.google.com/forum/message/raw?msg=polly-dev/ExZHA5VptKQ/B0aFnxg2uDEJ
Auflösen des Hostnamens groups.google.com (groups.google.com)… 2a00:1450:400c:c09::64, 64.233.166.102, 64.233.166.138, ...
Verbindungsaufbau zu groups.google.com (groups.google.com)|2a00:1450:400c:c09::64|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 500 Internal Server Error
2019-09-17 22:44:27 FEHLER 500: Internal Server Error.

..., and that will leave behind an invalid zero-size file:

$ wc ./polly-dev/mbox/m.ExZHA5VptKQ.B0aFnxg2uDEJ
0 0 0 ./polly-dev/mbox/m.ExZHA5VptKQ.B0aFnxg2uDEJ

That's https://groups.google.com/d/msg/polly-dev/ExZHA5VptKQ/B0aFnxg2uDEJ, and that one states "Diese Nachricht wurde gelöscht" ("This message was deleted").

So, a (permanent) "500 Internal Server Error" is Google Groups' way to signal that a message has been deleted?

Can we detect this during the "crawler" stage, and avoid trying to download such messages?

Looking at wget -O - 'https://groups.google.com/forum/?_escaped_fragment_=topic/polly-dev/ExZHA5VptKQ' | less, we see:

[...]
<br>Thanks again,
<br>Tobi
<br>
<br></div></div></div></td></tr> <tr><td class="subject"><a href="https://groups.google.com/d/msg/polly-dev/ExZHA5VptKQ/B0aFnxg2uDEJ" title=""></a></td>
<td class="author"><span>unk...@googlegroups.com</span></td>
<td class="lastPostDate">09.02.14 19:56</td>
<td class="snippet"><i>&lt;Diese Nachricht wurde gelöscht.&gt;</i></td></tr> <tr><td class="subject"><a href="https://groups.google.com/d/msg/polly-dev/ExZHA5VptKQ/KffArfskd8YJ" title="Re: PR17159 - &quot;Can not handle PHI node outside!&quot;">Re: PR17159 - &quot;Can not handle PHI node outside!&quot;</a></td>
<td class="author"><span>Tobias</span></td>
<td class="lastPostDate">09.02.14 21:01</td>
[...]

Unfortunately, the link for deleted message (polly-dev/ExZHA5VptKQ/B0aFnxg2uDEJ) is not on the same line as the "Diese Nachricht wurde gelöscht" note. So we'd need some kind of state machine to parse that, instead of the current simple grep '^https://'.

Or, is the title="" sufficient to detect that a message has been deleted? At least for this "polly-dev" group, this indeed only appears with the deleted message; every other message has a title corresponding to the messagesubject.

Would you accept a patch to change the grepping in that way, or is that deemed too unreliable?


By the way, specifying wget --header='Accept-Language: en' does not seem sufficient to consistently get that note returned as "This message has been deleted" instead of "Diese Nachricht wurde gelöscht" (in my case). Strange?

Unfortunately, the link for deleted message (polly-dev/ExZHA5VptKQ/B0aFnxg2uDEJ) is not on the same line as the "Diese Nachricht wurde gelöscht" note. So we'd need some kind of state machine to parse that, instead of the current simple grep '^https://'.

Or, is the title="" sufficient to detect that a message has been deleted? At least for this "polly-dev" group, this indeed only appears with the deleted message; every other message has a title corresponding to the messagesubject.

Was looking at the wrong files... There are legit messages with title="", see 7gf3qJET-SEJ in https://groups.google.com/forum/?_escaped_fragment_=topic/polly-dev/3RrIOur9vLM, for example.

I'll work on a different solution.

Can we detect this during the "crawler" stage, and avoid trying to download such messages?

You can use some hook to as in my example https://github.com/icy/google-group-crawler#the-hook, but that requires some small effort.

You can also scan the output directory for empty files and remove them. This is often easier.

I'm closing this ticket as there is a work-around with the hook. Feel free to re-open the ticket and/or submit a PR if you think it's good to have the deletion feature enabled by default. Thanks a lot.