encoding issues

Question

encoding issues

markburns opened this issue 13 years ago · comments

Mark Burns commented 13 years ago

If I use a file with shift-jis encoding as input to pru I get an encoding error.

invalid byte sequence in UTF-8 (ArgumentError)

Could we have a way to specify encoding?

Michael Grosser · Answer 1 · Sun Sep 18 2011 13:04:05 GMT+0800 (China Standard Time)

Something like --encoding XYZ ?
If you give me a file with this encoding Ill see what I can do.

Mark Burns · Answer 2 · Sun Sep 18 2011 15:18:15 GMT+0800 (China Standard Time)

Sounds good, thanks. OK Wasn't sure how to/if I can attach a file.
Do:
curl http://mark.dycept.com/euc_jp_example.txt > euc_jp_example.txt
When encoded correctly you should see:

刖 [げつ] /(n) (arch) (obsc) (See 剕) cutting off the leg at the knee (form of punishment in ancient China)/EntL2542160/

Otherwise you might see:
??? [????] /(n) (arch) (obsc) (See ???) cutting off the leg at the knee (form of punishment in ancient China)/EntL2542160/

curl http://mark.dycept.com/euc_jp_example.txt  > euc_jp_example.txt 

cat euc_jp_example.txt | pru '/obsc/'


/Users/mark/.rvm/gems/ruby-1.9.2-p290/gems/pru-0.1.6/lib/pru.rb:22:in `=~': invalid byte sequence in UTF-8 (ArgumentError)
    from /Users/mark/.rvm/gems/ruby-1.9.2-p290/gems/pru-0.1.6/lib/pru.rb:22:in `block in map'
    from /Users/mark/.rvm/gems/ruby-1.9.2-p290/gems/pru-0.1.6/lib/pru.rb:15:in `each_line'
    from /Users/mark/.rvm/gems/ruby-1.9.2-p290/gems/pru-0.1.6/lib/pru.rb:15:in `map'
    from /Users/mark/.rvm/gems/ruby-1.9.2-p290/gems/pru-0.1.6/bin/pru:72:in `<top (required)>'
    from /Users/mark/.rvm/gems/ruby-1.9.2-p290/bin/pru:19:in `load'
    from /Users/mark/.rvm/gems/ruby-1.9.2-p290/bin/pru:19:in `<main>'

Mark Burns · Answer 3 · Sun Sep 18 2011 15:27:01 GMT+0800 (China Standard Time)

Sorry that is not shift-jis encoding, my mistake, it's euc-jp. I'll update it to reflect that.

OK done.

Mark Burns · Answer 4 · Sun Sep 18 2011 15:35:50 GMT+0800 (China Standard Time)

Actually, I wonder if it's a problem with the shell and/or the cat command.

cat euc_jp_example.txt | grep "げつ"   #matches nothing

cat euc_jp_example.txt | grep "obsc"
??? [????] /(n) (arch) (obsc) (See ???) cutting off the leg at the knee (form of punishment in ancient China)/EntL2542160/

Michael Grosser · Answer 5 · Sun Sep 18 2011 19:01:39 GMT+0800 (China Standard Time)

I am not sure if there is a sane way to support this, since it would mean detecting the input and command encoding and magically match them or something like this... :(