marekrei / ml_nlp_paper_data

Dataset of ML and NLP papers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ML and NLP paper data

This repository contains the data crawled and processed for the post series on ML and NLP publications.

The project was created by Marek Rei (@MarekRei). The country annotation was contributed by Jonas Pfeiffer (@PfeiffJo) and Andrew Caines (@cainesap).

Conference proceedings

The papers directory contains json files for each of the crawled conferences. Take a look inside to see the available metadata.

Country annotation

annotated_orgs.tsv contains the following columns in tab-separated format:

  • id
  • org_name - the name of the organization, as crawled
  • paper_count - the number of papers that matched that name, after initial processing
  • is_org - manually annotated field, indicating whether this is an actual organization or crawling noise
  • canonical_org_name - a canonical name for this organization, to match together different versions
  • country - manually annotated country name for each organization
  • example1 - an example paper where this organization was crawled from
  • example2 - another example
  • example3 - another example

License

This dataset is made available under the CC BY-NC 4.0 license.

About

Dataset of ML and NLP papers

License:Other