icy / google-group-crawler

[Deprecated] Get (almost) original messages from google group archives. Your data is yours.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problems with exported email address.

steinwaywhw opened this issue · comments

Hi @icy, thanks for this awesome tools.

I noticed that the messages it exports (from a public group ats-session-types) only contain emails like this: <stei...@gmail.com>, not full email addresses.

I first tried the same request in Chrome, no full email.
I then tried the request with a suffix &authuser=0 in Chrome and it then shows the full email address.
I then tried the request with a suffix &authuser=0 in Chrome without logging in, no full email address.

I then though it might be the problem of cookies so I exported them to use with wget, plus the suffix of &authuser=0. It still doesn't work. I tried curl with cookies too, doesn't work.

Do you have any experience with such things?

Attachment, the url request I use.

https://groups.google.com/forum/message/raw?msg=ats-session-types/1qFgIUe0rww/1U_5LsjTAwAJ&authuser=0

Hi @icy, I used https://chrome.google.com/webstore/detail/cookiestxt/njabckikapfpffapmjgojcnbfjonfjfg?hl=en to export my cookies from Chrome, and specified wget options as instructed. I tried different user agent strings to match my Chrome. I also tried --load-cookies blabla as shown in the man page, and --load-cookies=blabla as shown in --help, and I even tried curl with -b option, non worked. I have no idea what's going wrong. :(

Hi @steinwaywhw,

I have looked and found a bad news: Google will not expose the original email in any cases unless you're a member and you're a manager of the group. Google changed the behavior since my lastest release of the script.

I'm so sorry but this is a Google problem. I will update the README to avoid any future confusion.

Thanks,

Hi @icy, sorry i forgot to mention that. I am actually the owner of that that google group. It works in my browser (with valid cookie I guess), but not my from your wget scripts. Even if the requests are from the same IP (my own machine). I think it's either something wrong with the cookie, or they have some way to distinguish a unique browser from any other client like curl and wget.

But anyway, I exported my user lists and run a separate ruby script to clean up "..." from the email address. It worked good enough for me. https://glot.io/snippets/efyndm7qs5 here's the actual script that I hope could help. The "load_users" and "match_user" methods are the actual function that do the work. You can have a look.

Hi @steinwaywhw,

Thanks a lot. The script is very useful and I will add them to the contrib/ directory.

A minor note is that the cookie doesn't affect the downloaded messages; that means you need to clean up your local directory after you set up cookie data.

There would be something wrong with cookie handle, and I would have written a test mechanism. I will add that soon.

Thanks again for your patience and the script.

Hi @icy, I'm very thankful for your GG export script, which is used extensively for Google Group to Discourse migrations.

I do however need to point out that the externally contributed code in https://github.com/icy/google-group-crawler/blob/master/contrib/fix_dot_in_users_emails.rb is almost entirely code taken from the Discourse open source forum project, link and you might need to amend the author and attribution details as such, change licensing details or consider removing the code from your repo.

Thanks again for icy/google-group-crawler and hope you don't mind me pointing out the above.

Hmm, right. Seems appropriate to include @eviltrout and @riking in the attribution note. As for the license, I think it'd suffice to include the GPL v2 license inline, just for that specific file? And maybe include a note in the README.md about this exception.

All of that being said, I don't think there's any reason why any of the Discourse import scripts have to be GPL v2 licensed, so I'll look into the possibility of changing these to the MIT license so this type of code sharing will be easier in the future.

There is also, I have just noticed, a fair bit of my googlegroups.rb importer in there, also GPLv2, also unattributed.

Hi @pacharanero and @erlend-sh,

Thanks a ton for your feedback. It's definitely my mistake when I didn't provide enough information for the script. I'm going to fix that. I will keep you posted.

Have a great day.

Sorry for my belated response. I will update the script information today. Thanks for your patience.