Data is double-encoded

Question

Data is double-encoded

mjpieters opened this issue 6 years ago · comments

The data is double-encoded to UTF8. For example, line 5 of the first file contains, in part (non-ASCII bytes represented by \xhh escape sequences to help readability):

#StandForOurAnthem\xc3\xb0\xc2\x9f\xc2\x87\xc2\xba\xc3\xb0\xc2\x9f\xc2\x87\xc2\xb8

Those bytes are each UTF-8 sequences for UTF-8 bytes, decoding those bytes gives us:

#StandForOurAnthem\xf0\x9f\x87\xba\xf0\x9f\x87\xb8

which in turn can be decoded as UTF-8 to the text #StandForOurAnthem🇺🇸.

This double-encoding makes the files needlessly bigger and harder to work with.

Martijn Pieters · Answer 1 · Wed Aug 01 2018 03:51:32 GMT+0800 (China Standard Time)

Workaround is to use iconv:

for file in IRAhandle_tweets_*.csv; do
  echo -n "Converting $file... "
  iconv -f utf8 -t latin1 $file > $file.corrected &&
  mv -f $file.corrected $file
  echo "Done"
done

This decodes once then writes out the result as Latin-1 (mapping Unicode codepoints to bytes one-on-one). This gives us single-encoded UTF-8 data again.

This shaves of 10% of the total bytecount, dropping from 731MB to 656MB.

Dhrumil Mehta · Answer 2 · Wed Aug 01 2018 06:14:09 GMT+0800 (China Standard Time)

Thank you for your suggestion @mjpieters. We have updated the data to remove the double encoding using the script you suggested.

Evan Carroll · Answer 3 · Mon Aug 27 2018 15:07:36 GMT+0800 (China Standard Time)

Cool work, seems there is more to do though (if we can recover this) #20