deepakuniyaliit / Covid19IRLTDataset

We have included 12 Indian languages in our dataset, IRLCov19 which also includes location information with a subset of tweets depending on the availability of information. We have also analyzed the dataset to compute the local or regional influencers or leaders on the basis of various influencing measures such as followers, retweet count, favorite count, and a number of mentions.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

COVID-19 Indian Regional Languages Tweets Dataset

The repository contains a collection of Indian regional languages tweets IDs related to novel coronavirus COVID-19. The dataset contains Tweets ids starting from February 2020 to July 2020. The Twitter search API was used to gather real-time tweets that contained specific keywords in the 12 languages. The languagues for which dataset is provided are Hindi (hi), Tamil (ta), Urdu (ur), Marathi (mr), Telugu (te), Gujarati (gu), Kannada (kn), Bengali (bn), Oriya (or), Malayalam (ml), Punjabi (pa), Sindhi (sd) in the decreasing order of number of tweets. To comply with Twitter’s Terms of Service, only ids of the tweets are provided in this dataset which can be used for non-commercial research purpose only.

Data Organization

  • As of Mar 26, 2021 we have tweets starting from February 01, 2020 to July 31, 2020.
  • Tweet-ID files are stored month wise inside directories of Indian Regional languages which are named as ISO 2 Alphabets Code of languages.
  • The Tweet-ID files contain the tweets ids where all files have similar structure iso2_Month_Date_Year.txt.

Dataset Collection

  • All the tweets irrespective of any particular language were collected from February 2020 to July 2020 based on some keywords and hashtags which are provided in the file keywords.txt.
  • For retrieving, the full object of the tweet consider the following tools Hydrator and twarc.

Licensing

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0).By using this dataset , you agree to the terms of the LICENSE, and to all Twitter’s Terms of Service, and cite our papers:

Contact

If you have any feedback, suggestions or queries, please do reach out to deepakuniyal(AT)geu(dot)ac(dot)in

About

We have included 12 Indian languages in our dataset, IRLCov19 which also includes location information with a subset of tweets depending on the availability of information. We have also analyzed the dataset to compute the local or regional influencers or leaders on the basis of various influencing measures such as followers, retweet count, favorite count, and a number of mentions.


Languages

Language:Shell 100.0%