Irregularities matching authors in the dataset with the 2017/2018 Congressional lists
bet4a opened this issue · comments
I’ve made a public Google Sheet that attempts to match up account information from this dataset with the November 2017 and June 2018 lists published by the House Intelligence Committee. Perhaps others will find it helpful for a number of reasons…
- Every
author
has multiple tweets. But some info (should) remain constant for a particular author across all their tweets:external_author_id
,account_type
,account_category
and thenew_june_2018
flag. This spreadsheet provides a summary of all theauthor
s and their associated properties. - The November 2017 list contains user_ids, i.e.
external_author_id
. This can be used to resolve some of the floating pointexternal_author_id
s raised in issue #4. (It can’t be used for accounts that were added in the 2018 list because that PDF doesn’t list user_ids.) - The dataset uses all-caps for each
author
; the lists from Congress retain account names’ original capitalization. It’s often trivial, but the distinction can be semantically meaningful. For example, we can see that the CURTISBIGMAN account from the dataset actually had a Twitter handle of “CurtisBigMan”, not CurtisBigman or CurtIsBigMan.
However, there are two problems that I ran across:
-
There are 17
author
s in the dataset who each have two differentexternal_author_id
s. Example: in IRAhandle_tweets_1.csv, there are some rows withauthor
4MYSQUAD andexternal_author_id
4036537452. Other rows have the sameauthor
4MYSQUAD, but a differentexternal_author_id
3312143142. (FWIW, the Nov 2017 PDF shows 4MySquad having a user_id of 4036537452.)For some authors it appears this seems to be tied to floating point
external_author_id
s being rounded—such as KRISTINADRUCKER, who has some tweets withexternal_author_id
7.15893000000e+17 and others 7.16000000000e+17. But for other authors, such as the 4MYSQUAD example above, the discrepency doesn’t appear to be a rounding issue.Is there a legit reason these accounts are associated with two different
external_author_id
s, or is this a mistake? -
There are 5
author
s in the dataset that are not listed in either of the PDFs from Congress:- BABCHENKOVA_EVA: the 2018 PDF does list the “babchenkova” handle, which also appears in the dataset, but neither PDF lists an account name of BABCHENKOVA_EVA.
- JENNATRAVELLER: no similar account name is in either PDF.
- KARUCZ_00: the 2018 PDF does list the “Karucz” handle, which also appears in the dataset, but neither PDF lists an account name of KARUCZ_00.
- TERRAFORMA: the 2017 and 2018 PDFs do list the “taraformation” handle, which also appears in the dataset, but neither PDF lists an account name of TERRAFORMA.
- TOURETTESN: no similar account name is in either PDF.
It’s not clear why tweets from these five accounts are included in the dataset.
I’ve attached a screenshot below which shows all tweet authors affected by either issue. For a copy-pastable version, go to the Google Sheet, then click the Data menu → Filter views… → author problems.
This is super helpful! Thanks for all there care you've taken, here. A major reason we wanted to post all these data is because we knew more eyes would find errors we missed. Nice job. A couple responses.
- Multiplicity in external_author_id. We think these multiplicity comes from one of two sources (plus one more even more complicated way... see below):
1.1) The first is the rounding problem you mentioned, above. Sometime in the sequence from Twitter->Social Studio->CSV->STATA-> CSV, some very large integers got converted into scientific notation and truncated. If this happened for some tweets and not for others, you could get two account_ids. I need to go back and see if I can reconstruct some of these and will as soon as I can.
1.2) Some accounts came out of social studio with multiple external id's under the same in handle, either consecutively (like 4MYSQUAD) or interlaced (like MeggieONeil). For some of those, the follower counts, update counts, and behaviors indicated that these were, in fact, the same account, and we kept both in. For some, there were dramatic changes in stats and/or behavior, and we presumed that we simply had two accounts that shared the same handle, and we tried to include the one with the account number indicated in the Congressional release. When in doubt when included both to allow the users to decide, but care should be taken with these.
- Accounts not on the list. Three sorts of responses.
2.1 As you suspected, we simply overlooked BABCHENKOVA_EVA and KARUCZ_00 when cleaning our data. Our method of gathering including a keyword search using the handle, and those were accidentally swept up. They should be discarded.
2.2 JennaTraveller is a super-interesting case that we went back and forth about including. Jenn_Abrams is one of the best known of the trolls. We believe that JennaTraveller was the first handle of the Jenn_Abrams account. It shares the external_author_id, and the updates and follower counts transition smoothly as the handle changes. In a few cases (which we don't really understand), we were able to trace accounts back to old handles in this way.
Tourettesn is the same situation, the opening handle for PigeonToday (Interestingly... it also used the handle Politweecs, which shows up independently on the list, but with a different external_author_id). This is a yet another way you could get two external_author_id for the same handle.
We had actually meant to strip these early aliases out before posting the data, but now that it's out there, I have a few more I can post early next week.
2.3 Taraforma is a mixture of 1.2 and 2.2. The account "Taraformation" has three external_author_ids. There were two very early ones that we judged to be sufficiently different from the indicated account (external_author_id=1534083420) that we did not include them in our data. One of those accounts had an alias, Taraforma, that we failed to remove when we removed that version of Taraformation. It is not a troll and should be removed.
Thanks so much for the detailed response! It definitely clears up the confusion I had about these unusual cases.
In case you find it useful, I should also mention that I’ve added some additional columns to my author summary Google Sheet. For each author, it now shows:
- Number of tweets sent
- The earliest and latest tweet
published_date
- Average tweets per day (derived from the above)
- Most common
language
classification of author’s tweets - Number/percent of author’s tweets classifed as their primary language
It’s not really anything revolutionary in and of itself. But it may be useful as a jumping-off point for further data exploration (e.g., what are the general characteristics of the most active accounts?, etc.)
I can confirm that JennaTraveller was Jenn_Abrams original handle. When her material was still up, you could see cases where users had replied to her and the original handle was shown. Got a screenshot somewhere, I think.
Was never as certain on the Politweecs and PigeonToday thing. There was massive overlap in the handles, but I couldn't tell if it was because one was constantly retweeting the other and then getting replies. At some point, Politweecs definitely became its own thing, and was operational a good bit longer than PigeonToday.
In a related discrepency, there are a handful of authors whose new_june_2018
column appears to be incorrect.
The following authors have new_june_2018
= 0 in the dataset. Their handles are listed in the June 2018 report but not in the November 2017 report—so it seems their new_june_2018
values should be 1 rather than 0:
- AHMADYUSUFF03
- ANKIDINOVAKIRA
- BORIS_KOLOV
- GLEBUSHKAGLEB
- MASHAEMYASHEVA
- MUZAAMURA
- NATASHAPAVLOVAA
- NEVNOVRU
- NOVOSTIIZHEVSK
- TONYWILLDO
- TURANGALA
- VIZYDYREGUJI
- ZILILINYM
And these two authors have new_june_2018
= 1 in the dataset, even though their handles are in the November 2017 list:
- MONEYFORM
- VIKA_BERE
Another related discrepency—there are 5 authors that appear in the Nov 2017 House Intel list PDF, but whose external_author_id
in the dataset does not match the user id
from the HPSCI list. (Ignoring differences that can be attributed to rounding errors.)
Phew, that was a mouthful. Sorry if that didn’t make sense. Hopefully the data itself is a bit more self-explanatory:
external_author_id (from dataset) |
author (from dataset) |
user id (from Nov 2017 PDF) |
handle (from Nov 2017 PDF) |
---|---|---|---|
914069096 | BLK_VOICE | 4217244274 | Blk_Voice |
2861691030 | GLOED_UP | 3312143142 | gloed_up |
4235669232 | KHALEDBAKRI7 | 4590743632 | khaledbakri7 |
2912754262 | POLITWEECS | 2570631118 | Politweecs |
2753146444 | TODAYCLEVELAND | 890944315396149250 | todaycleveland |
Digging a bit further, for two of these cases, it seems the external_author_id
is associated with two different author
s in the dataset. The other three are more alarming—on real-world Twitter, these external_author_id
s correspond to the Twitter user IDs of accounts that aren’t suspended!
- 914069096
- In the dataset, this is the
external_author_id
for BLK_VOICE. - On Twitter, this is the user ID for the active (i.e., non-suspended) account @VocabExceeded
- In the dataset, this is the
- 2861691030
- In the dataset, this is the
external_author_id
for GLOED_UP. - On Twitter, this is the user ID for the active account @Iccy_t
- In the dataset, this is the
- 4235669232
- In the dataset, this is the
external_author_id
for KHALEDBAKRI7. - On Twitter, this is the user ID for the active account @Fuckman73 (😑)
- In the dataset, this is the
- 2912754262
- In the dataset, this is the
external_author_id
for POLITWEECS. - In the dataset, this is also the
external_author_id
forauthor
PIGEONTODAY. (See @mortdtroll’s comment above.)
- In the dataset, this is the
- 2753146444
- In the dataset, this is the
external_author_id
for TODAYCLEVELAND. - In the dataset, this is also the
external_author_id
forauthor
ONLINECLEVELAND.
- In the dataset, this is the