Dataset recognition resources

Original resources

Resources and install path of the resources

Dataseer corpus (dataseer/), biomedicine domain, focusing on identification of data sentences, annotations of implicit/explicit data mentions, data types and annotation of data acquisition devices (but missing annotation of explicit dataset names), non-public
https://github.com/xjaeh/ner_dataset_recognition (ner_dataset_recognition/), IR/ML/NLP domain, only explicitly named and reused datasets
https://www.kaggle.com/datasets/panhuitong/dmdd-corpus (dmdd/) is close to the previous one (Heddes et al., 2021), same IR/ML/NLP domain, only explicitly named and reused datasets, 450 manually annotated articles but false negative not manually corrected
oddpub dataset https://osf.io/yv5rx/ (oddpub-dataset/), biomedicine domain, only article screening (no annotation), only datasets with open access statements, only explicit datasets
transparency-indicators dataset https://osf.io/e58ws/ (transparency-indicators-dataset/), biomedicine domain, only article screening (no annotation)
Coleridge corpus (coleridge/), partial (only a very small subset of named "datasets" considered), no explicit annotation, no valid definition of datasets (e.g. research initiative name considered as "dataset")
SciREX, a dataset of 438 annotated arXiv documents only on ML domain, with identification of named datasets (label is "Material"), see https://github.com/allenai/SciREX (reported IAA on 5 documents is 95% average cohen-κ scores), one drawback is the pre-tokenized words which is destructive (because we lose the original delimiters and we can't reconstruct the original text)
EneRex (https://github.com/DiscoveryAnalyticsCenter/EneRex) has data sentences and dataset/software annotations (Brat format) for 147 full text files, however only arXiv computer domain and only named dataset/software.

Assemble resources

Survive in the python dependency marshlands:

virtualenv --system-site-packages -p python3.8 env
source env/bin/activate

Install dependencies

pip3 install -r requirements.txt

Assemble resources in the same JSON format:

python3 assemble.py --output combined/

This will create under combined/ one JSON file per orginal corpus in the same JSON format using span offsets.

Recycled and upcycled resources

sentences from https://github.com/xjaeh/ner_dataset_recognition have been reviewed, re-annotated to follow common dataset annotation principles: it covers now new dataset (not just reused ones) and annotation is at dataset level (avoid one annotation for a conjunction expression of datasets). They can be used to train public models for dataset name recognition.
sentences from dataseer: labeling of data sentences infomation. Other annotations are implicit data (it should be complete) and data acquisition devices (imcomplete), non-public: can be used for eval, but not for training public models (and can't be shared of course).

kermitt2 / dataset_recognition_resources

Dataset recognition resources

Original resources

Assemble resources

Recycled and upcycled resources

About

Languages