cabrau / FakeWhatsApp.Br

An annotated Corpus of WhatsApp messages in PT-BR for automatic detection of textual misinformation.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FakeWhatsApp.Br

An annotated Corpus of anonymized WhatsApp messages in PT-BR public groups for automatic detection of textual misinformation and malicious users. To get detailed information about the construction and experimentation of the corpus, check out our paper published in ICEIS 2021 conference:

Cabral, Lucas, et al. "Fakewhastapp. br: NLP and machine learning techniques for misinformation detection in brazilian portuguese whatsapp messages." Proceedings of the 23rd International Conference on Enterprise Information Systems, ICEIS. 2021.

If you use our corpus, please include a citation to our corresponding paper. For futher discussion and experiments, you can check out my master thesis (in portuguese): https://repositorio.ufc.br/handle/riufc/63379

Data

The data collected during 2018 brazilian presidential ellections is located at:

  • data/2018/fakeWhatsApp.BR_2018.csv

The data is stored in a CSV file, where each line is a message sent in a public group. The dictionary of variables is the following:

  • id: unique ID of a user
  • date: day of the year that the message was sent
  • ddi: international identifier
  • country: country assigned to the ddi
  • country_iso3: ISO3 code of country
  • ddd: regional brazilian telephone code
  • state: brazilian state
  • midia: boolean variable indicating if the message is a media file (1) or not (0)
  • url: boolean variable indicating if the message contains an url (1) or don't (0)
  • characters: number of characters in message's text
  • words: number of words in message's text
  • viral: boolean variable indicating if a message with the exactly same text and more of 5 words appears in the corpus (1) or don't (0). The viral messages were the ones manually labelled.
  • shares: number of times that a message with the exactly same text appears in the corpus
  • text: textual content of message
  • misinformation: manually assigned label if the message contains misinformation (1) or don't (1). The value -1 means that the message was not labelled.

Notebooks:

  • 1 - parser.ipynb
    This notebook parses the data collected in WhatsApp groups, converting from free text format to structured data in a CSV table.

  • 2 - labeling and anonymization.ipynb
    In this notebook we transfer the labels annotated manually in the viral messages to the entire corpus and remove personal data such as phone numbers present in the text.

  • 3 - exploratory analysis.ipynb
    Exploration and visualization of the data set.

  • 4 - compare corpora.ipynb
    Comparison with fake news corpus on Twitter to demonstrate the need for a corpus of WhatsApp texts.

  • 5 - misinformation detection ml.ipynb
    Experiments with classical machine learning models to classify textual misinformation.

  • 6 - deep learning char level cnn.ipynb
    Experiments with a character level convolutional neural network to classify textual misinformation.

  • 7 - user features.ipynb
    Exploiting user features to detect misinformation

  • 8 - user classification.ipynb
    Experiments classifying users as superspreaders

  • 9 - automatic dataset expansion.ipynb
    Experiments with automatic expansion of dataset using cosine similarity

  • 10 - user credibility.ipynb
    Modeling user credibility to improve misinformation detection

About

An annotated Corpus of WhatsApp messages in PT-BR for automatic detection of textual misinformation.

License:GNU General Public License v3.0


Languages

Language:Jupyter Notebook 65.1%Language:HTML 34.8%Language:Python 0.1%