activewarehouse / activewarehouse-etl

Extract-Transform-Load library from ActiveWarehouse

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

sub!': invalid byte sequence in UTF-8

epinault opened this issue · comments

I am using Ruby 1.9.3 and in some of my file I get the following error

sub!': invalid byte sequence in UTF-8

One way to fix it is to force the options on the line :38 of the csvparser to use encoding: "ISO8859-1"

from /home/emmanuel/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/csv.rb:1855:in block in shift' from /home/emmanuel/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/csv.rb:1849:inloop'
from /home/emmanuel/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/csv.rb:1849:in shift' from /home/emmanuel/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/csv.rb:1791:ineach'
from /home/emmanuel/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/csv.rb:1208:in block in foreach' from /home/emmanuel/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/csv.rb:1354:inopen'
from /home/emmanuel/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/csv.rb:1207:in foreach' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/parser/csv_parser.rb:38:inblock in each'
from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/parser/csv_parser.rb:30:in each' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/parser/csv_parser.rb:30:ineach'
from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/control/source/file_source.rb:45:in each' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/engine.rb:333:ineach_with_index'
from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/engine.rb:333:in block in process_control' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/engine.rb:327:ineach'
from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/engine.rb:327:in process_control' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/engine.rb:275:inprocess'
from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/engine.rb:272:in process' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/engine.rb:55:inprocess'
from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/commands/etl.rb:82:in block in execute' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/commands/etl.rb:80:ineach'
from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/commands/etl.rb:80:in execute' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/commands/etl.rb:90:in<top (required)>'
from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activesupport-3.2.8/lib/active_support/dependencies.rb:251:in require' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activesupport-3.2.8/lib/active_support/dependencies.rb:251:inblock in require'
from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activesupport-3.2.8/lib/active_support/dependencies.rb:236:in load_dependency' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activesupport-3.2.8/lib/active_support/dependencies.rb:251:inrequire'
from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/bin/etl:28:in <top (required)>' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/bin/etl:19:inload'

Nevermind with that for now... I forced mysql to export to UTF-8 some of the csv file. and it fixed the issue..

My understanding is that your file is encoded in ISO-8859-1 and that you work by default with UTF-8, is that right?

Based on Ruby 1.9 CSV doc, you will have to provide an :encoding option to tell the parser that the source is in ISO-8859-1, or to modify Encoding::default_external (but then it's a general setting affecting all your reads).

You should be able to pass the :encoding option without having to hack the source code (the options are propagated from the DSL to the line you pointed if I'm right).

Alternatively you may want to preprocess the file if you prefer (I tend to do that in a first pass).

Can you check if passing the :encoding option works for you? If it works, we'll close this issue and open a documentation issue instead, this will certainly become a FAQ.

I missed your comment while writing mine! Ok - I'll close this one (but it probably needs some documentation here).

Yes! adding to the doc would help for sure :) That would be a nice to know for the future :)