activewarehouse / activewarehouse-etl

Extract-Transform-Load library from ActiveWarehouse

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ruby 1.9.3 mysqlstream - invalid byte sequence in UTF-8

sgrgic opened this issue · comments

Hi,

We got this error after switching to ruby-1.9.3-p125. Error is in line 55 of mysql_stream.rb:
line = line.gsub("\n","")
It's weird because with same data I can't reproduce this problem in my local environment and on production
we have exception.
From exception log:
/lib/etl/control/source/mysql_streamer.rb:56:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)
I'm trying to catch exception on production and to get string that is causing this but so far no luck.
Maybe you dealt with this problem already.

Regards,
Sinisa.

Ok, in case someone look into this. I just located row which rises exception, it contain this word:
lógica
when I replace ó with o, ctl job finishes without exception. Now need to figure out how mysqlstream should
read this kind of stuff.

Sinisa.

One solution for this problem, /lib/etl/control/source/mysql_streamer.rb, line 53:
mysql_command = """mysql --quick -h #{host} -u #{username} -e "#{@query.gsub("\n","")}" -D #{database} --password=#{password} -B"""
replace with:
query_utf8 = "SET CHARACTER SET 'utf8'; " + @query.gsub("\n","")
mysql_command = """mysql --quick -h #{host} -u #{username} -e "#{query_utf8}" -D #{database} --password=#{password} -B"""
Hope there are no side effects, so far I don't see any, but let's double check with you.

Thanks,
Sinisa.

Same thing using command line option:

mysql_command = """mysql --quick -h #{host} -u #{username} -e "#{@query.gsub("\n","")}" -D #{database} --password=#{password} -B --default-character-set=utf8"""

Sinisa.

@sgrgic sorry - I missed your earlier comments! It is probably a mysql setup thing and your fix is probably what should be done. You should check by adding data with accents in MySQL and see what goes out for instance.

By default historically, mysql wasn't set up for UTF-8 but for LATIN1, and it's fairly common to see for instance data which is actually UTF-8, stored as what MySQL believes to be LATIN1 (first issue), or just to have the client set up for LATIN1.

Can you (out of curiosity) paste the output of what's there? http://stackoverflow.com/a/1049776/20302


Sure, here is output:
mysql> show variables like "character_set_database";
+------------------------+-------+
| Variable_name | Value |
+------------------------+-------+
| character_set_database | utf8 |
+------------------------+-------+
1 row in set (0.00 sec)

mysql> show variables like "collation_database";
+--------------------+-----------------+
| Variable_name | Value |
+--------------------+-----------------+
| collation_database | utf8_general_ci |
+--------------------+-----------------+
1 row in set (0.00 sec)

@sgrgic looks like you're safe then :) If you want to go further you could check out your my.cnf like advised here http://stackoverflow.com/a/3513812/20302 (my bet is that some non utf8 default may show up there).

It's perfectly ok to just pass the default-character-set like you did on the command line IMO.

@sgrgic can I close this one if you're ok with it?

As well I opened #91 to track adding some clean way to pass extra args here.

I will definitely merge a pull-request for that if you want to tackle this! (otherwise I'll do it myself, but not right now though).

Sure, we can close this. So far this looks good in our ctl jobs. If I notice something wrong will let you know.
Yeah, please add this fix when you catch some time. We have our branch for aw-etl and we add some stuff
there. We can discuss once about this on google groups and maybe merge those changes.

Thanks,
Sinisa.

Ok! And sure, please drop a line on the google group so we can discuss what could be merged back. Closing!