tandembank / data-science.dataset-labeller

Web-based tool for labelling datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dataset labeller

This is a web-based tool we developed to label datasets quickly at Tandem. It is based on Python, Django, React and run through Docker Compose.

Creating a new dataset and starting to label

Installation and Running

The easiest way to get this application running is via Docker Compose. Once you have this working, run the following commands to install.

git clone https://github.com/tandembank/data-science.dataset-labeller.git
cd data-science.dataset-labeller
cp docker-compose.example.yml docker-compose.yml
docker-compose build

You should now have the Docker image built. To run it, along with it's database server, run this from the same loaction:

docker-compose up

After a few seconds you should be able to access it via http://localhost:8080/ in your browser.

Usage

This is the process for labelling a new dataset:

  1. Upload a CSV file containing rows that you want to label.
  2. Give it a name.
  3. Select the columns that should be displayed to a person labelling.
  4. Define the possible category labels and keyboard shortcuts to make things faster.
  5. Decide how many people need to label each row datapoint – this is useful if you want to get a consensus.
  6. Save dataset.
  7. Get your team to login and label it.
  8. View job progress on the dashboard.
  9. Download the labelled dataset as a CSV – it'll have an extra column with the labels

Features

  • Import and export data in the format that you're comfortable with – no need to pre-process data, just select the columns to display for labelling.
  • Each user has their own account so you can see who labelled what.
  • Labellers can access the tool remotely or within a corporate network using just their web browser.
  • Slick and quick user interface while you're labelling – the next few datapoints are already loaded in your browsers so they're ready to show as soon as you've labelled the current one.
  • Multiple users can be labelling at once as we use locks to avoid collisions.
  • If some datapoints are tricky to label or your team are going at break-neck speed you can choose to get a consensus from an odd number of users, say 3 or 5.
  • Database included and configured in the Docker Compose file.
  • Cell content such as JSON lists gets displayed nicely formatted. We aim to extend this to identify other formats and image URLs.

About

Web-based tool for labelling datasets

License:MIT License


Languages

Language:JavaScript 42.8%Language:Python 40.8%Language:CSS 12.0%Language:HTML 2.8%Language:Dockerfile 1.6%