michaelfaerber / datarec

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dataset Search

Synopsis

In this repository, we provide the source code of our dataset search engine.

A large and growing number of datasets is available on the web. Dataset search engines, such as Google Dataset Search and Zenodo search have been provided to search for datasets. These dataset search engines use mainly faceted search or keyword search. However, exisiting keyword or faceted search are not suitable for very specific and comprehensive queries (e.g., given a research problem description). In addition, these systems rely on the datasets' metadata and are, thus, dependent on the availability and quality of the provided metadata.

We propose a new approach for dataset search relying on a text classification model that predicts relevant datasets for a user's input. The user input is a text that describes the research question that the user investigates. A trained classifier predicts all relevant datasets indexed in a given repository based on the entered text. The set of predicted datasets is ranked and sorted by its relevancy for the user's problem description.

Demo

See http://data-hunter.io and the associated repository.

Architecture

Schema

As the figure above demonstrates, the actual dataset search engine is based on a text classification model that is trained in a previos step. Therefore, a database of train and evaluation data consisting of scientific problem descriptions (paper abstracts or citation contexts) and corresponding datasets is created. The texts and labels are prepocessed to enhance the classification quality. In the following, several text classification models are trained and evaluated on this data. By comparing the evaluation results, the best model is selected, which is then utilized in the search engine. The search engine itself takes a scientific problem description, proceeds the preprocessing steps and then the selected, pretrained classifier predicts a list of dataset which are relevant for the given problem description. These dataset are then recommended.

Structure of this project

In this repository, we provide the python files for training and evaluating the text classification models we examined.

The files which perform the actual training and evaluation are collected under the classification-model folder.

The finetuning of BERT model, the fastText classifier are handled in the same named files separately. The basic classification models, which are classification based on tfidf similarity and classification based on BM25 values, are trained and evaluated in the basic_classification.py file.

The following models can be trained on different text representations:

  • Linear SVM
  • Random Forest
  • Logistic Regression
  • Gaussian Naive Bayes
  • CNN
  • LSTM
  • Simple RNN
  • CNN-LSTM
  • Bidirectional-LSTM

All of the above mentioned models are trained based on five different text representations in the so named files:

  • tfidf_evaluation.py for tfidf values
  • doc2vec_evaluation.py for doc2vec embeddings
  • fasttext_evaluation.py for fastText embeddings
  • scibert_evaluation.py for SCIBERT embeddings
  • transformerxl_evaluation.py for Transformer-XL embeddings

In the helpers folder functions for text preprocessing, embeddings computation, evaluation metrics calculation, creation of confusion matrices, and tfidf similarity or BM25 values calculation are defined. These functions are used in the training and evaluation of text classifiers and therefore the preprocessing.py, evaluation.py and for basic models also the similarity_metrics.py file need to be imported.

Apart from that, some further investigation regarding sampling strategies, validation method and time component were conducted. The files for these experiments can be found in the additional_investigation folder.

Finally, the data that was used can be found in this folder.

Database

The database that is used for training and testing the classification models can be found in the folder data. In total, the database contains 1,691 datasets with rich metadata from the DSKG. Moreover we are using more than 88,000 paper abstracts and more than 265,000 citation contexts referencing those datasets. These forms the training and testing data, which is stored in the files Abstracts_New_Database and Citation_New_Database.

The datasets most often referenced datasets in those collections are:

Contact

Michael Färber and Ann-Kathrin Leisinger

Feel free to contact us.

How to Cite

Please cite our paper (published at CIKM'21) as follows:

@inproceedings{Faerber2021CIKM,
  author    = {Michael F{\"{a}}rber and
               Ann-Kathrin Leisinger},
  title     = "{Recommending Datasets Based on Scientific Problem Descriptions}",
  booktitle = "{Proceedings of the 30th ACM International Conference on Information and Knowledge Management}",
  location  = "{Virtual Event}",
  year      = {2021}
}

About


Languages

Language:Python 100.0%