FreSaDa: The French Satire Data Set

The FreSaDa data set contains regular and satirical samples of text collected from the French news domain.

1. License Agreement

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

You are free to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
NonCommercial — You may not use the material for commercial purposes.
ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

2. Citation

[1] Radu Tudor Ionescu, Adrian Gabriel Chifu. FreSaDa: A French Satire Data Set for Cross-Domain Satire Detection. In: The International Joint Conference on Neural Network, IJCNN 2021 (2021). (link to article)

BibTeX citation:

@inproceedings{IonescuChifu2021IJCNN,
author = {Ionescu, Radu-Tudor and Chifu, Adrian-Gabriel},
title = {FreSaDa: A French Satire Data Set for Cross-Domain Satire Detection},
year = {2021},
booktitle = {The International Joint Conference on Neural Network, IJCNN 2021},
series = {IJCNN2021}
}

3. Description

General Information

FreSaDa, the French Satire Data Set, is composed of 11,570 articles from the newsdomain. The news articles are of two types: satirical and regular. Two possible tasks may be considered on FreSaDa:

Cross-domain binary classification of full news articles into regular versus satirical examples
Cross-domain binary classification of headlines into regular versus satirical examples

The data set is divided into three subsets:

training (8,716 samples)
testing (2,854 samples)

Each sample contains the news article's title and text, as well as the corresponding label.

Data Organization

The data set is divided in two folders, train and test, corresponding to the two subsets for training and testing. In each folder there is a subfolder entitled texts, containing the texts of the news articles. Each folder also contains a file called summary.tsv:

The summary.tsv files contain one sample per row. The format of each row is the following:

text_id_1   title_1   label_1
text_id_2   title_2   label_2
...
text_id_n   title_n   label_n

where:

text_id is the filename for the file containing the article's text (from the texts folder)
title is the article's title
label is the label for the sample's corresponding class

The labels are associated as follows:

1 => Satiric News
-1 => Regular News

If the experiments require a validation subset, the test subset may be divided into two equal parts:

the samples with odd rank (1st, 3rd, 5th, ...) for validation (1,427 samples)
the samples with even rank (2nd, 4th, 6th, ...) for testing (1,427 samples)

4. Website

The FreSaDa data set is available at: https://github.com/adrianchifu/FreSaDa

adrianchifu / FreSaDa