Predicting Newsworthiness

This repository contains the data needed to replicate the findings of the our article From Crowd Ratings to Predictive Models of Newsworthiness to Support Science Journalism, published in the Proceedings of the ACM on Human-Computer Interaction, and presented at CSCW 2022.

`train.json`

Crowdsourced dataset of ratings for the news values of different arXiv articles (n=500). Used to train Extra Trees model Please refer to Section 5.1 of our paper for details about model training with this data. Contains the following fields:

arxiv_id: Unique identifiers for arXiv articles, sourced from arXiv API.
arxiv_url: URLs for arXiv articles, sourced from arXiv API.
title: Titles for arXiv articles, sourced from arXiv API.
summary: Abstracts for arXiv articles, sourced from arXiv API.
published: Date of publication for arXiv articles, sourced from arXiv API.
authors: Authors for arXiv articles, sourced from arXiv API.
arxiv_primary_category: Author-provided primary category for arXiv articles, sourced from arXiv API.
readability: Readability score for article's summary field, assigned by De-Jargonizer, and scaled to be from 0-1.
actuality: Score for actuality news value, assigned byt MTurk crowdworkers, range 1-5.
controversy: Score for controversy news value, assigned byt MTurk crowdworkers, range 1-5.
relevance_magnitude: Score for relevance_magnitude news value, assigned byt MTurk crowdworkers, range 1-5.
relevance_valence: Score for relevance_valence news value, assigned byt MTurk crowdworkers, range 1-5.
newsworthiness_crowd_sum: Average of the four news values - actuality, controversy, relevance_magnitude, relevance_valence, range 1-5. Binarized at the value of 3 for training the newsworthiness classification model.

`validate.json`

Crowdsourced dataset of ratings for the news values of different arXiv articles (n=55). Also contains expert evaluations of newsworthiness for this data. Used to evaluate Extra Trees model. Please refer to Section 5.2 of our paper for details on findings.

In addition to the fields found in train.json, this data also contains the following:

nw_expert1: Score for newsworthiness assigned by expert 1, range 1-5.
nw_expert2: Score for newsworthiness assigned by expert 2 , range 1-5.
newsworthiness_expert: Average of the both experts' ratings for newsworthiness, range 1-5.

comp-journalism / predicting_newsworthiness

Predicting Newsworthiness

`train.json`

`validate.json`

About