Data is double-encoded
mjpieters opened this issue · comments
The data is double-encoded to UTF8. For example, line 5 of the first file contains, in part (non-ASCII bytes represented by \xhh
escape sequences to help readability):
#StandForOurAnthem\xc3\xb0\xc2\x9f\xc2\x87\xc2\xba\xc3\xb0\xc2\x9f\xc2\x87\xc2\xb8
Those bytes are each UTF-8 sequences for UTF-8 bytes, decoding those bytes gives us:
#StandForOurAnthem\xf0\x9f\x87\xba\xf0\x9f\x87\xb8
which in turn can be decoded as UTF-8 to the text #StandForOurAnthem🇺🇸
.
This double-encoding makes the files needlessly bigger and harder to work with.
Workaround is to use iconv
:
for file in IRAhandle_tweets_*.csv; do
echo -n "Converting $file... "
iconv -f utf8 -t latin1 $file > $file.corrected &&
mv -f $file.corrected $file
echo "Done"
done
This decodes once then writes out the result as Latin-1 (mapping Unicode codepoints to bytes one-on-one). This gives us single-encoded UTF-8 data again.
This shaves of 10% of the total bytecount, dropping from 731MB to 656MB.
Thank you for your suggestion @mjpieters. We have updated the data to remove the double encoding using the script you suggested.
Cool work, seems there is more to do though (if we can recover this) #20