ML and NLP paper data

This repository contains the data crawled and processed for the post series on ML and NLP publications.

The project was created by Marek Rei (@MarekRei). The country annotation was contributed by Jonas Pfeiffer (@PfeiffJo) and Andrew Caines (@cainesap).

Conference proceedings

The papers directory contains json files for each of the crawled conferences. Take a look inside to see the available metadata.

annotated_orgs.tsv contains the following columns in tab-separated format:

id
org_name - the name of the organization, as crawled
paper_count - the number of papers that matched that name, after initial processing
is_org - manually annotated field, indicating whether this is an actual organization or crawling noise
canonical_org_name - a canonical name for this organization, to match together different versions
country - manually annotated country name for each organization
example1 - an example paper where this organization was crawled from
example2 - another example
example3 - another example

This dataset is made available under the CC BY-NC 4.0 license.