tweet_download

This program downloads tweets from Twitter APIs(v2) using Academic development accounts. All returned information are stored. Note the APIs are still evolving, so the code here may not work in the future.

There are four folders to store downloaded tweets:

raw_tweets: raw data from Twitter, including five parts of each request: data, media, places, users, and polls. Each part is stored as a individual .CSV file. Strings in these .CSV files are cleaned by removing newlines, tabs, and commas (, is replaced by ;). Please refer to Twitter API Documentation.

file name: oldest posted timelastest posted timetweets_count_part. For example:
line_tweets: Five parts are merged together, each row in this file contains all columns returned by the API. The result of each request is a single file. There are 100 lines (with context_annotation) or 500 lines (without context_annotation) per file. context_annotation is like a topic assigned to a tweet.

About context_annotation:

Context annotations: Derived from the analysis of a Tweet’s text and will include a domain and entity pairing which can be used to discover Tweets on topics that may have been previously difficult to surface. At present, we’re using a list of 50+ domains to categorize Tweets.
chunks_tweets: Merged line_tweets for better managements of Hadoop systems. Each merged .CSV files have about 50,000 - 500,000 tweets.
cluster_csvs: Converted chunks_tweets. Some fields are expanded for our Hadoop systems.

How to use

Prerequisite:

Develop account. Put your keys in tweet_api_keys.txt (put this file in the same folder as the download_tweet_APIv2.py).

Consumer API Key: copy_your_key_here
Consumer API Secret Key: copy_your_key_here
Bearer Token: copy_your_key_here
Access Token: copy_your_key_here
Access Token Secret: copy_your_key_here

Packages: tqdm, requests, pandas, vaderSentiment, emoji

Using download_tweet_APIv2.py to download tweets.

Edit the parameters in the fucntion of execute_download, such as saved_path, query, start_time, end_time, and chunk_size.

To build a query, you need to carefully read Building a query or How to write search queries, and understand the meaning of space (=AND), OR, "", and (). Again, carefully check the downloaded tweet to verify your queries. Here is some examples: vaccin OR vaccination OR vaccine OR vaccinate place_country:AU. It means to collect tweets posted in Australia contain words vaccin, vaccination, or vaccinate.

has:geo, from:cnn OR from:FoxNews, (wildfire OR earthquake). Note "OR" is capitalized.

Please pay attention to the tweet count returned at the begining. If it is not your expectation, just stop the program then refine the query.

A use example in the download_tweet_APIv2.py:

if __name__ == '__main__':

    query = '(Canada wildfire) OR (wildfire smoke) OR (canada smoke) OR (canada air)'
    saved_path = 'H:\Research\Canada_widefire'
    start_time = "2023-05-01T01:00:00Z"
    end_time = "2023-06-13T01:00:00Z"
    execute_download(query=query,
                     saved_path=saved_path,
                     start_time=start_time,
                     end_time=end_time,
                     )

Rate limit

According to official documentation, 300 requests are allowed per 15 minutes; 500 tweets per requets (or 100 tweet with context_annotation). So the highest speed is 600,000 tweets per hour. However, due to heavy converting computations, the current speed is about 100 request per 15 minutes, which is 200,000 tweets per hour.

Upated: after adding sub-processes to merge responses and to convert the tweet chunks, the downloading speed now is about 500,000 tweet/hour.

To do

Adaptive sleeping when breaking the rate limits.
Code refining.
Result merging and compressing.

gladcolor / tweet_download

tweet_download

How to use

Rate limit

Other solutions

To do

About

Languages