encoding issues
markburns opened this issue · comments
If I use a file with shift-jis encoding as input to pru I get an encoding error.
invalid byte sequence in UTF-8 (ArgumentError)
Could we have a way to specify encoding?
Something like --encoding XYZ
?
If you give me a file with this encoding Ill see what I can do.
Sounds good, thanks. OK Wasn't sure how to/if I can attach a file.
Do:
curl http://mark.dycept.com/euc_jp_example.txt > euc_jp_example.txt
When encoded correctly you should see:
刖 [げつ] /(n) (arch) (obsc) (See 剕) cutting off the leg at the knee (form of punishment in ancient China)/EntL2542160/
Otherwise you might see:
??? [????] /(n) (arch) (obsc) (See ???) cutting off the leg at the knee (form of punishment in ancient China)/EntL2542160/
curl http://mark.dycept.com/euc_jp_example.txt > euc_jp_example.txt
cat euc_jp_example.txt | pru '/obsc/'
/Users/mark/.rvm/gems/ruby-1.9.2-p290/gems/pru-0.1.6/lib/pru.rb:22:in `=~': invalid byte sequence in UTF-8 (ArgumentError)
from /Users/mark/.rvm/gems/ruby-1.9.2-p290/gems/pru-0.1.6/lib/pru.rb:22:in `block in map'
from /Users/mark/.rvm/gems/ruby-1.9.2-p290/gems/pru-0.1.6/lib/pru.rb:15:in `each_line'
from /Users/mark/.rvm/gems/ruby-1.9.2-p290/gems/pru-0.1.6/lib/pru.rb:15:in `map'
from /Users/mark/.rvm/gems/ruby-1.9.2-p290/gems/pru-0.1.6/bin/pru:72:in `<top (required)>'
from /Users/mark/.rvm/gems/ruby-1.9.2-p290/bin/pru:19:in `load'
from /Users/mark/.rvm/gems/ruby-1.9.2-p290/bin/pru:19:in `<main>'
Sorry that is not shift-jis encoding, my mistake, it's euc-jp. I'll update it to reflect that.
OK done.
Actually, I wonder if it's a problem with the shell and/or the cat command.
cat euc_jp_example.txt | grep "げつ" #matches nothing
cat euc_jp_example.txt | grep "obsc"
??? [????] /(n) (arch) (obsc) (See ???) cutting off the leg at the knee (form of punishment in ancient China)/EntL2542160/
I am not sure if there is a sane way to support this, since it would mean detecting the input and command encoding and magically match them or something like this... :(