fivethirtyeight / russian-troll-tweets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Irregularities matching authors in the dataset with the 2017/2018 Congressional lists

bet4a opened this issue · comments

commented

I’ve made a public Google Sheet that attempts to match up account information from this dataset with the November 2017 and June 2018 lists published by the House Intelligence Committee. Perhaps others will find it helpful for a number of reasons…

  • Every author has multiple tweets. But some info (should) remain constant for a particular author across all their tweets: external_author_id, account_type, account_category and the new_june_2018 flag. This spreadsheet provides a summary of all the authors and their associated properties.
  • The November 2017 list contains user_ids, i.e. external_author_id. This can be used to resolve some of the floating point external_author_ids raised in issue #4. (It can’t be used for accounts that were added in the 2018 list because that PDF doesn’t list user_ids.)
  • The dataset uses all-caps for each author; the lists from Congress retain account names’ original capitalization. It’s often trivial, but the distinction can be semantically meaningful. For example, we can see that the CURTISBIGMAN account from the dataset actually had a Twitter handle of “CurtisBigMan”, not CurtisBigman or CurtIsBigMan.

However, there are two problems that I ran across:

  1. There are 17 authors in the dataset who each have two different external_author_ids. Example: in IRAhandle_tweets_1.csv, there are some rows with author 4MYSQUAD and external_author_id 4036537452. Other rows have the same author 4MYSQUAD, but a different external_author_id 3312143142. (FWIW, the Nov 2017 PDF shows 4MySquad having a user_id of 4036537452.)

    For some authors it appears this seems to be tied to floating point external_author_ids being rounded—such as KRISTINADRUCKER, who has some tweets with external_author_id 7.15893000000e+17 and others 7.16000000000e+17. But for other authors, such as the 4MYSQUAD example above, the discrepency doesn’t appear to be a rounding issue.

    Is there a legit reason these accounts are associated with two different external_author_ids, or is this a mistake?

  2. There are 5 authors in the dataset that are not listed in either of the PDFs from Congress:

    • BABCHENKOVA_EVA: the 2018 PDF does list the “babchenkova” handle, which also appears in the dataset, but neither PDF lists an account name of BABCHENKOVA_EVA.
    • JENNATRAVELLER: no similar account name is in either PDF.
    • KARUCZ_00: the 2018 PDF does list the “Karucz” handle, which also appears in the dataset, but neither PDF lists an account name of KARUCZ_00.
    • TERRAFORMA: the 2017 and 2018 PDFs do list the “taraformation” handle, which also appears in the dataset, but neither PDF lists an account name of TERRAFORMA.
    • TOURETTESN: no similar account name is in either PDF.

    It’s not clear why tweets from these five accounts are included in the dataset.

I’ve attached a screenshot below which shows all tweet authors affected by either issue. For a copy-pastable version, go to the Google Sheet, then click the Data menu → Filter views… → author problems.

screenshot 2018-08-02 at 09 43 30

This is super helpful! Thanks for all there care you've taken, here. A major reason we wanted to post all these data is because we knew more eyes would find errors we missed. Nice job. A couple responses.

  1. Multiplicity in external_author_id. We think these multiplicity comes from one of two sources (plus one more even more complicated way... see below):

1.1) The first is the rounding problem you mentioned, above. Sometime in the sequence from Twitter->Social Studio->CSV->STATA-> CSV, some very large integers got converted into scientific notation and truncated. If this happened for some tweets and not for others, you could get two account_ids. I need to go back and see if I can reconstruct some of these and will as soon as I can.

1.2) Some accounts came out of social studio with multiple external id's under the same in handle, either consecutively (like 4MYSQUAD) or interlaced (like MeggieONeil). For some of those, the follower counts, update counts, and behaviors indicated that these were, in fact, the same account, and we kept both in. For some, there were dramatic changes in stats and/or behavior, and we presumed that we simply had two accounts that shared the same handle, and we tried to include the one with the account number indicated in the Congressional release. When in doubt when included both to allow the users to decide, but care should be taken with these.

  1. Accounts not on the list. Three sorts of responses.

2.1 As you suspected, we simply overlooked BABCHENKOVA_EVA and KARUCZ_00 when cleaning our data. Our method of gathering including a keyword search using the handle, and those were accidentally swept up. They should be discarded.

2.2 JennaTraveller is a super-interesting case that we went back and forth about including. Jenn_Abrams is one of the best known of the trolls. We believe that JennaTraveller was the first handle of the Jenn_Abrams account. It shares the external_author_id, and the updates and follower counts transition smoothly as the handle changes. In a few cases (which we don't really understand), we were able to trace accounts back to old handles in this way.

Tourettesn is the same situation, the opening handle for PigeonToday (Interestingly... it also used the handle Politweecs, which shows up independently on the list, but with a different external_author_id). This is a yet another way you could get two external_author_id for the same handle.

We had actually meant to strip these early aliases out before posting the data, but now that it's out there, I have a few more I can post early next week.

2.3 Taraforma is a mixture of 1.2 and 2.2. The account "Taraformation" has three external_author_ids. There were two very early ones that we judged to be sufficiently different from the indicated account (external_author_id=1534083420) that we did not include them in our data. One of those accounts had an alias, Taraforma, that we failed to remove when we removed that version of Taraformation. It is not a troll and should be removed.

commented

Thanks so much for the detailed response! It definitely clears up the confusion I had about these unusual cases.

In case you find it useful, I should also mention that I’ve added some additional columns to my author summary Google Sheet. For each author, it now shows:

  • Number of tweets sent
  • The earliest and latest tweet published_date
  • Average tweets per day (derived from the above)
  • Most common language classification of author’s tweets
  • Number/percent of author’s tweets classifed as their primary language

It’s not really anything revolutionary in and of itself. But it may be useful as a jumping-off point for further data exploration (e.g., what are the general characteristics of the most active accounts?, etc.)

screenshot 2018-08-03 at 19 21 19

I can confirm that JennaTraveller was Jenn_Abrams original handle. When her material was still up, you could see cases where users had replied to her and the original handle was shown. Got a screenshot somewhere, I think.

Was never as certain on the Politweecs and PigeonToday thing. There was massive overlap in the handles, but I couldn't tell if it was because one was constantly retweeting the other and then getting replies. At some point, Politweecs definitely became its own thing, and was operational a good bit longer than PigeonToday.

commented

In a related discrepency, there are a handful of authors whose new_june_2018 column appears to be incorrect.

The following authors have new_june_2018 = 0 in the dataset. Their handles are listed in the June 2018 report but not in the November 2017 report—so it seems their new_june_2018 values should be 1 rather than 0:

  • AHMADYUSUFF03
  • ANKIDINOVAKIRA
  • BORIS_KOLOV
  • GLEBUSHKAGLEB
  • MASHAEMYASHEVA
  • MUZAAMURA
  • NATASHAPAVLOVAA
  • NEVNOVRU
  • NOVOSTIIZHEVSK
  • TONYWILLDO
  • TURANGALA
  • VIZYDYREGUJI
  • ZILILINYM

And these two authors have new_june_2018 = 1 in the dataset, even though their handles are in the November 2017 list:

  • MONEYFORM
  • VIKA_BERE
commented

Another related discrepency—there are 5 authors that appear in the Nov 2017 House Intel list PDF, but whose external_author_id in the dataset does not match the user id from the HPSCI list. (Ignoring differences that can be attributed to rounding errors.)

Phew, that was a mouthful. Sorry if that didn’t make sense. Hopefully the data itself is a bit more self-explanatory:

external_author_id (from dataset) author (from dataset) user id (from Nov 2017 PDF) handle (from Nov 2017 PDF)
914069096 BLK_VOICE 4217244274 Blk_Voice
2861691030 GLOED_UP 3312143142 gloed_up
4235669232 KHALEDBAKRI7 4590743632 khaledbakri7
2912754262 POLITWEECS 2570631118 Politweecs
2753146444 TODAYCLEVELAND 890944315396149250 todaycleveland

Digging a bit further, for two of these cases, it seems the external_author_id is associated with two different authors in the dataset. The other three are more alarming—on real-world Twitter, these external_author_ids correspond to the Twitter user IDs of accounts that aren’t suspended!

  • 914069096
    • In the dataset, this is the external_author_id for BLK_VOICE.
    • On Twitter, this is the user ID for the active (i.e., non-suspended) account @​VocabExceeded
  • 2861691030
    • In the dataset, this is the external_author_id for GLOED_UP.
    • On Twitter, this is the user ID for the active account @​Iccy_t
  • 4235669232
    • In the dataset, this is the external_author_id for KHALEDBAKRI7.
    • On Twitter, this is the user ID for the active account @​Fuckman73 (😑)
  • 2912754262
    • In the dataset, this is the external_author_id for POLITWEECS.
    • In the dataset, this is also the external_author_id for author PIGEONTODAY. (See @mortdtroll’s comment above.)
  • 2753146444
    • In the dataset, this is the external_author_id for TODAYCLEVELAND.
    • In the dataset, this is also the external_author_id for author ONLINECLEVELAND.