grosser / pru

Pipeable Ruby - forget about grep / sed / awk / wc ... use pure, readable Ruby!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

encoding issues

markburns opened this issue · comments

If I use a file with shift-jis encoding as input to pru I get an encoding error.

invalid byte sequence in UTF-8 (ArgumentError)

Could we have a way to specify encoding?

Something like --encoding XYZ ?
If you give me a file with this encoding Ill see what I can do.

Sounds good, thanks. OK Wasn't sure how to/if I can attach a file.
Do:
curl http://mark.dycept.com/euc_jp_example.txt > euc_jp_example.txt
When encoded correctly you should see:

刖 [げつ] /(n) (arch) (obsc) (See 剕) cutting off the leg at the knee (form of punishment in ancient China)/EntL2542160/

Otherwise you might see:
??? [????] /(n) (arch) (obsc) (See ???) cutting off the leg at the knee (form of punishment in ancient China)/EntL2542160/

curl http://mark.dycept.com/euc_jp_example.txt  > euc_jp_example.txt 

cat euc_jp_example.txt | pru '/obsc/'


/Users/mark/.rvm/gems/ruby-1.9.2-p290/gems/pru-0.1.6/lib/pru.rb:22:in `=~': invalid byte sequence in UTF-8 (ArgumentError)
    from /Users/mark/.rvm/gems/ruby-1.9.2-p290/gems/pru-0.1.6/lib/pru.rb:22:in `block in map'
    from /Users/mark/.rvm/gems/ruby-1.9.2-p290/gems/pru-0.1.6/lib/pru.rb:15:in `each_line'
    from /Users/mark/.rvm/gems/ruby-1.9.2-p290/gems/pru-0.1.6/lib/pru.rb:15:in `map'
    from /Users/mark/.rvm/gems/ruby-1.9.2-p290/gems/pru-0.1.6/bin/pru:72:in `<top (required)>'
    from /Users/mark/.rvm/gems/ruby-1.9.2-p290/bin/pru:19:in `load'
    from /Users/mark/.rvm/gems/ruby-1.9.2-p290/bin/pru:19:in `<main>'

Sorry that is not shift-jis encoding, my mistake, it's euc-jp. I'll update it to reflect that.

OK done.

Actually, I wonder if it's a problem with the shell and/or the cat command.

cat euc_jp_example.txt | grep "げつ"   #matches nothing

cat euc_jp_example.txt | grep "obsc"
??? [????] /(n) (arch) (obsc) (See ???) cutting off the leg at the knee (form of punishment in ancient China)/EntL2542160/

I am not sure if there is a sane way to support this, since it would mean detecting the input and command encoding and magically match them or something like this... :(