tdhd / dssg2017

March 2017 http://dssg-berlin.org hack for DKG (https://www.krebsgesellschaft.de)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data Science for Social Good 2017

March 2017 DSSG hack for the Deutsche Krebsgesellschaft (DKG).

We aim at building a multi-label model that is able to predict the labels for a given RIS article, that includes features like:

  • abstract
  • title
  • authors
  • ...

Webservice

We've built a django webservice that allows the DKG to interact with our model via RIS file uploads.

The service has the following features at the moment:

  • upload training RIS file - triggers model selection on given data.
  • upload test RIS file - produces keyword predictions for the given articles.

On each of the predictions, a user can give either positive or negative feedback, e.g. add another label or remove a predicted label respectively.

Active learning

In order to improve the mult-label model, the service is able to receive the feedback of a user.

We've implemented different strategies of prioritization for this active learning setting. See this article for a survey of active learning.

Files from the hack weekend

The module cleaning_classification_labels implements a cleaning pipeline for classifications which should be applied to rectify the labels a bit.

Also there is a notebook which we added in the beginning of the hack, features.ipynb, looking at different attributes of the features and also doing initial classification on useful label.

About

March 2017 http://dssg-berlin.org hack for DKG (https://www.krebsgesellschaft.de)


Languages

Language:Jupyter Notebook 94.1%Language:Python 4.9%Language:HTML 1.0%Language:Shell 0.0%