fivethirtyeight / russian-troll-tweets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data is double-encoded

mjpieters opened this issue · comments

The data is double-encoded to UTF8. For example, line 5 of the first file contains, in part (non-ASCII bytes represented by \xhh escape sequences to help readability):

#StandForOurAnthem\xc3\xb0\xc2\x9f\xc2\x87\xc2\xba\xc3\xb0\xc2\x9f\xc2\x87\xc2\xb8

Those bytes are each UTF-8 sequences for UTF-8 bytes, decoding those bytes gives us:

#StandForOurAnthem\xf0\x9f\x87\xba\xf0\x9f\x87\xb8

which in turn can be decoded as UTF-8 to the text #StandForOurAnthem🇺🇸.

This double-encoding makes the files needlessly bigger and harder to work with.

Workaround is to use iconv:

for file in IRAhandle_tweets_*.csv; do
  echo -n "Converting $file... "
  iconv -f utf8 -t latin1 $file > $file.corrected &&
  mv -f $file.corrected $file
  echo "Done"
done

This decodes once then writes out the result as Latin-1 (mapping Unicode codepoints to bytes one-on-one). This gives us single-encoded UTF-8 data again.

This shaves of 10% of the total bytecount, dropping from 731MB to 656MB.

Thank you for your suggestion @mjpieters. We have updated the data to remove the double encoding using the script you suggested.

Cool work, seems there is more to do though (if we can recover this) #20