ramorel / COVID-19-TweetIDs

The repository contains an ongoing collection of tweets IDs associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


The repository contains an ongoing collection of tweets IDs associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020. We used the Twitter’s search API to gather historical Tweets from the preceding 7 days, leading to the first Tweets in our dataset dating back to January 22, 2020. We leveraged Twitter’s streaming API to follow specified accounts and also collect in real-time tweets that mention specific keywords. To comply with Twitter’s Terms of Service, we are only publicly releasing the Tweet IDs of the collected Tweets. The data is released for non-commercial research use.

The associated paper to this repository can be found here: #COVID-19: The First Public Coronavirus Twitter Dataset

Data Organization

The Tweet-IDs are organized as follows:

  • Tweet-ID files are stored in folders that indicate the year and month of the collection (YEAR-MONTH).
  • Individual Tweet-ID files contain a collection of Tweet IDs, and the file names all follow the same structure, with a prefix “coronavirus-tweet-id-” followed by the YEAR-MONTH-DATE-HOUR.
  • Note that Twitter returns Tweets in UTC, and thus all Tweet ID folders and file names are all in UTC as well.

Notes About the Data

A few notes about this data:

  • We are still working on processing the over 50 million Tweets that we have collected, and will be incrementally releasing all of the past Tweet IDs as the files finish processing and releasing newer Tweet IDs as the data becomes available to us.
  • There may be a few hours of missing data due to technical difficulties. We have done our best to recover as many Tweets from those time frames by using Twitter’s search API.
  • We will keep a running summary of basic statistics as we upload data in each new release.
  • The file keywords.txt and accounts.txt contains the updated keywords and accounts respectively that we tracked in our data collection. Each keyword and account will be followed by the date we began tracking them.
  • Consider using tools such as the Hydrator to rehydrate the Tweet IDs.

Data Usage Agreement

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript:

Emily Chen, Kristina Lerman, and Emilio Ferrara. 2020. #COVID-19: The First Public Coronavirus Twitter Dataset. arXiv:cs.SI/2003.07372, 2020

Statistics Summary (v1.0)

Number of Tweets : 8,919,411

Language Breakdown

Language ISO No. tweets % total Tweets
English en 5,508,304 61.76%
Spanish es 1,167,172 13.09%
French fr 388,481 4.36%
Thai th 352,902 3.96%
Italian it 219,572 2.46%
(undefined) und 208,908 2.34%
Indonesian in 201,821 2.26%
Portuguese pt 169,599 1.9%
Japanese ja 145,985 1.64%
Turkish tr 134,173 1.5%


If you have technical questions about the data collection, please contact Emily Chen at echen920[at]usc[dot]edu.

If you have any further questions about this dataset please contact Dr. Emilio Ferrara at emiliofe[at]usc[dot]edu.


The repository contains an ongoing collection of tweets IDs associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020.
