jonbruner / twitter-analysis

The original dataset for my 2013 article on Twitter's network patterns

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Twitter User Sample

In 2013, I wrote an article called "Tweets Loud and Quiet" in which I argued that, although Twitter is sometimes characterized as a powerfully egalitarian, democratizing force, it actually resembles a traditional broadcast medium with a handful of very influential celebrities on one side and many millions of people with just a couple of followers on the other.

In order to write that article, I scraped metadata for about 400,000 Twitter accounts. Since then, it's been cited in a handful of scholarly papers, and I've gotten regular requests from researchers for access to my original dataset. I'm pleased to share my full dataset here.

This dataset was created between September 17, 2013, and October 19, 2013. Since I built it, Twitter has changed the way it assigns IDs, and a repeat of this project is now impossible (see below for why). Although this data is no longer a contemporary snapshot of Twitter usage patterns, it's still interesting as a way to understand social dynamics.

Background and collection

Until roughly 2015, Twitter gave every account a unique ID from a reasonably compact namespace. When I compiled this data in 2013, existing accounts had unique ID numbers that fell between roughly 0 and 1.9 billion.

I generated this dataset by polling Twitter's API for random account IDs within that range. It turns out that about 63% of integers between 0 and 1.9 billion were associated with accounts; Twitter's ID-assignment process skipped some numbers. Nevertheless, Twitter's ID assignment was strictly monotonic: given two accounts, the one with the higher ID number was always opened later than the other.

One other quirk: Twitter appears to have changed its account ID assignment algorithm around July 2012; before then, it assigned IDs at a density of 86%, and after then at 49%, skipping a large block of potential ID numbers in between.

Around 2015, Twitter switched from its quasi-sequential ID assignment scheme to one that resembles UUID. Now that Twitter accounts are assigned a 64-bit ID generated in part by hashing a timestamp, the namespace you'd need to search in order to repeat this project is 264 = 1.8 x 1019, or roughly as many possible IDs as there are insects on earth. I try to avoid writing things like this, but here I'm sure enough that it'd exceed some fundamental constraints of modern computing: it can't be done.

Fields

In this repo you'll find both a CSV file and an R dataset. They are identical.

Field Remarks
account_id Unique account ID
handle Twitter handle, e.g., @JonBruner
name Full account name, e.g., "Jon Bruner"
description Long account description, e.g., "Person who loves revelatory datasets that speak to eternal social characteristics."
url URL associated with account description
language ISO language code selected by user
location Free-text location provided by user
account_created_at Date and time of account creation
account_created_at_interpolated See below
crawled_at Date and time when account was scraped for inclusion in this dataset
missing 0: account exists
1: account never existed with this ID
2: account closed (by user, or by Twitter for abuse)
protected 0: not protected
1: protected
followers_count Number of Twitter accounts following this account
following_count Number of Twitter accounts that this account follows
statuses_count Number of public posts created by this account
listed_count Number of lists on which this account appears
last_post_id Unique ID of this account's most recent post, as of crawled_at
last_post_text Text of this account's most recent post
last_post_lat Latitude, if available, for this account's most recent post
last_post_lon Longitude, if available, for this account's most recent post
last_post_place_id Twitter place ID, if available, for this account's most recent post
last_post_created_at Date and time of this account's most recent post
time_since_last_post Convenience variable: the difference in hours between crawled_at and last_post_created_at

Interpolated account creation time

For any account_id that I polled, there could be one of three results:

  1. An account associated with the account_id exists
  2. An account associated with the account_id existed at one time, but was closed, either voluntarily by the owner or (more commonly, I imagine) by Twitter for abuse
  3. The account_id has never been assigned to an account

In case 1, I was able to retrieve an account_created_at value. In case 3, the creation datetime is moot. In case 2, it might be interesting to know when the now-closed account was created, but Twitter's API doesn't return a creation datetime. So, I interpolated a creation datetime for each missing account on the assumption that Twitter's ID assignment scheme was roughly linear over small intervals. So, you'll see something like this:

account_id account_created_at account_created_at_interpolated
60453 2006-12-12 13:25:33 2006-12-12 13:25:33
63729 - 2006-12-13 11:39:21
72373 2006-12-15 22:18:44 2006-12-15 22:18:44

Legal

Twitter's developer terms now prohibit the use of its API "to monitor or measure the...usage statistics or results of Twitter Services," including for the compilation of "aggregate Twitter user metrics such as total number of active users, accounts, user engagements, or account engagements." However, they did not include this prohibition when I built the user dataset in 2013.

You may use this dataset for any purpose. If you do so, kindly cite Jon Bruner as the author of the data.

About

The original dataset for my 2013 article on Twitter's network patterns

License:Mozilla Public License 2.0