Please follow the instructions below.
pip install -r requirements.txt
- pandas:
pip install pandas
- twint:
pip install twint
- sentence-transformers:
pip install sentence-transformers
- tweepy:
pip install tweepy
- langdetect:
pip install langdetect
- emoji:
pip install emoji
- ftfy:
pip install ftfy
Assume that we are interested in identifying misinformation about COVID-19 and antibiotics
-
Run
crawl.py
to download public tweet replies that satisfy a query comprising context-specific keywords.- You should change query, start date, and end date, depending on your interest.
- This script returns
{query}_{start date}_{end date}.csv
in the/dat
folder. - As Twitter often blocks repeated requests,
twint
library offers an optionResume
to resume search from the last scroll ID.crawl.py
savesresume.txt
for this purpose. If this script stops, wait a few minutes and run the script again. Do not deleteresume.txt
until you get the intended data. - If you want to run a script for a new query, delete the existing
resume.txt
. - We preset the query for collecting tweet replies posted on March 1, 2020.
-
Run
preprocess.py
that returns two filesreplies.csv
andnon_replies.csv
in the./res
folder. These files consist of four columns: id, user_id, created_at, and tweet. Mentions, emojis, and URLs in body texts are removed.- As Microsoft Excel does not fully recognize tweet and user IDs, we intentionally paste "_" in front of IDs and created_at.
-
Run
calculate_similarity.py
that returns a file./res/replies_with_sims.csv
. A new column "sims" will be added to the data ofreplies.csv
. This column stores cosine similarity between representation vectors of replies and the vector of official advice that we set as a reference of accurate information. Representation vectors are computed through the Sentence-BERT model (Reimers & Gurevych, 2019). You should change official advice incalculate_similarity.py
.- We set the official advice related to COVID-19 and antibiotics from the WHO (Visit https://www.who.int/emergencies/diseases/novel-coronavirus-2019/advice-for-public/myth-busters)
- Warning messages can be shown, like "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:467: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)])". If you encounter any warning messages, Just ignore them.
- You can observe that replies of high similarity have similar context with the official advice defined in
calculate_similarity.py
.
-
Save your Twitter API credential in
config.py
.-
Consumer API key as
ckey
in./config.py
-
Consumer API secret key as
csec
in./config.py
-
Access token as
akey
in./config.py
-
Access token secret as
asec
in./config.py
-
config.py
should have four lines as follows.ckey='' csec='' akey='' asec=''
-
-
Run
collect_parents.py
that saves JSON files for parents of selected replies. As Twitter API has a rate limit on searching a specific tweet ID, it takes much time if you want to collect parents of all replies. For this reason, we recommend to collect parents of a subset of replies of high similarity. The size of the subset can be set with-n
.- For example, if you run
python collect_parents.py -n 10
, only parents of top 10 replies in terms of similarity will be obtained. - JSON files will be stored in the folder
./parents
.
- For example, if you run
-
Run
merge.py
to concatenate tweet replies and their parents in a dataframe./res/merged.csv
. Now, it is time to examine whether misinformation about COVID-19 exists in parents of replies having similar context with accurate information.- Self-replies are excluded as we expect volunteer fact checkers correct other users' posts containing misinformation.
- For obtaining better results, parents related to COVID-19 and antibiotics should be examined. Searching parents that have context-specific keywords may help to reduce the search space.
- An example tweet containing misinformation about COVID_19 and antibiotics: (ID = 1234076122381418496) claimed that antibiotics work against COVID-19 because the new coronavirus is just a flu virus.
Reimers, N., & Gurevych, I. (2019, November). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3973-3983).