icy / google-group-crawler

[Deprecated] Get (almost) original messages from google group archives. Your data is yours.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

text formatting/encoding issue turning = signs to =3D

spacewaffle opened this issue · comments

I got a number of warnings during my scrape that spat out these messages. I could be wrong but it looks like '=' signs are getting acii values inserted into the strings.

WARNING: Could not parse (and so ignoring) '<p class=3D"MsoNormal" style=3D"margin-bottom:0in;margin-bottom:.0001pt;tex='
WARNING: Could not parse (and so ignoring) 'Register for our November 29 internship event at <a href=3D"http://w='
WARNING: Could not parse (and so ignoring) '<p class=3D"MsoNormal" style=3D"margin-bottom:0in;margin-bottom:.0001pt;tex='

@spacewaffle Would you mind sharing your group name (in case it's public)? If not, I'd like to know more logs and details (e.g, at which step do you get this warning).

Thanks

Unfortunately the group isn't public, but I'm trying to import groups data into discourse. here's some more logging.

/ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/utilities.rb:239:in to_crlf': Interrupt /ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/message.rb:1998:inraw_source='
/ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/message.rb:2121:in init_with_string' /ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/message.rb:129:ininitialize'
/.rvm/gems/ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/mail.rb:51:in new' /.rvm/gems/ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/mail.rb:51:innew'
/.rvm/gems/ruby-2.0.0-p648/gems/mail-2.6.4/lib/mail/mail.rb:188:in read_from_string' /sites/discourse/script/import_scripts/mbox.rb:86:inblock in all_messages'
/sites/discourse/script/import_scripts/mbox.rb:68:in each' /sites/discourse/script/import_scripts/mbox.rb:68:ineach_with_index'
/sites/discourse/script/import_scripts/mbox.rb:68:in all_messages' /sites/discourse/script/import_scripts/mbox.rb:190:increate_email_indices'
/sites/discourse/script/import_scripts/mbox.rb:23:in execute' from googlegroups.rb:49:inexecute'
/sites/discourse/script/import_scripts/base.rb:45:in perform' from googlegroups.rb:92:in

''

This is part of a project migrating google groups to Discourse. I'm following the instructions here:
https://meta.discourse.org/t/migration-of-google-groups-to-discourse/48012

I'm not author of the mbox.rb. Where do you get that script? There is probably a problem with encoding processing in this script.

I added the link to the script in the bottom of my previous post, but it's here for you to look at. Your scraper is a dependency so I figured maybe it's to do with the scraper and not the script.
https://meta.discourse.org/t/migration-of-google-groups-to-discourse/48012

link to repo:
https://github.com/pacharanero/discourse/blob/master/script/import_scripts/mbox.rb

Thank you @spacewaffle . How about using encoding instruction?

Current version

require 'sqlite3'
require File.expand_path(File.dirname(__FILE__) + "/base.rb")

New version / instruction

#!/usr/bin/env ruby
# encoding: utf-8

require 'sqlite3'
require File.expand_path(File.dirname(__FILE__) + "/base.rb")

I believe I fixed this issue by upgrading our ruby version to 2.3 which does utf-8 by default

Great to hear that. Thanks for your feedback.