Project - Auto Labeller

Introduction

Background

It can be useful for businesses to label text data for identification, sorting or strategic purposes. Traditionally, businesses employ word matching (building a label dictionary from scratch) or manual labour to put labels on their existing storage. However, this tends to be resource heavy and can be cumbersome to implement.

This project aims to ease this process and make labelling easy. With this [semi] auto labelling tool, users simply have to pick from a list of recommended words to form their label dictionary and allow the model to form an enriched dictionary. The model will then utilise the enriched dictionary to label the input text dataset.

Possible applications

Data	Use Case
Email messages from suppliers or customers	- Better archive and store email messages on local file system - Group suppliers or customers to better understand collaboration partners
Service tickets for customer complaints	- Group customer complaints to identify problematic areas - Group service approaches to identify the best service approaches
Customer feedback for products or services	- Identify potential new product categories - Group feedbacks with labels to identify performance of each product label

Getting Started

Environment Setup

You will require the following system set up.

Install python3 here
Install pip3 for your windows or linux
Install python virtualenv here
Install git here

Step-by-step setup

Clone project to the local and cd into project

git clone [repository]

Create a python virtual environment within project folder

virtualenv -p python3 env

Activate your virtual environment

# Linux
source env/bin/activate

# Windows
env\Scripts\activate

Install python dependencies

pip3 install -r requirements.txt

# or
pip install -r requirements.txt

Run Jupyter Notebook

jupyter notebook

Run the following code within Jupyter to install nltk

import nltk   
nltk.download('all')

Notebook Instruction

Walk through the demo in bricks_demo_auto_label.ipynb to gain an intuition of the steps required to operate this auto labelling tool.
Walk through the sample notebook in bricks_auto_label.ipynb. This notebook allows you to experiment with the labelling tool and evaluate its usefulness for your company.

Technical Documentation

Folder Structure

bricks_demo_auto_label.ipynb - demo code using the original example for users to get an intuition for the applications of this auto labelling tool.
bricks_auto_label.ipynb - base code to allow users to play with and experiment with the auto labeller

Labels Dictionary

It is important to identify your desired keys and labels

Manually mix and match keywords to create the dictionary labels.csv with desired categories, with a list of keywords for each category
Notebook takes in data/labels.csv to proceed with the semi-supervised labeling

How to use this Model

The primary function of this model takes in an input (news.csv) and labels it using the labels.csv. It outputs labelled.csv and if ground truth is available, the score.csv.

inputs:
- news.csv - dataset containing string to be labelled
  - can contain as many row as needed (recommended less than 10k rows)
  - contains text to be labelled
- labels.csv - labels for different identified classification (e.g. finance, sports, politics)
  - contains number of columns corresponding to categories
  - contains keys in each column relating to each category
outputs:
- labelled.csv - labelled dataset containing labels for the specified input labels from labels.csv.
  - Contains the same number of rows as news.csv
- score.csv - model performance for the labels. Only available if you have ground truth for the input.

Code Tested

Code tested with python 3.5.5 running on Azure Data Science Virtual Machine (Ubuntu 16.04)

Credits

Author

Lin Laiyi, Senior AI Apprentice at AI Singapore, NUS MSBA 2017/2018

LinkedIn: https://www.linkedin.com/in/laiyilin/

Portfolio of selected analytics project: https://drive.google.com/file/d/1fVntFEvj6us_6ERzRmbU85EOeZymFxEm/view

Edited by Jway Jin Jun on Aug 2019, AI Engineer at AI Singapore.

Find the original presentation slide [here](https://docs.google.com/presentation/u/1/d/1hQED4ZZqzcwgq6-jgtw3MbRWRPN6CRTOs_zbVQQu_YU/edit#slide=id.p)

Additional Notes

Project is editted for the purpose of the Bricks project to demonstrate and enable AI Technologies

Find the original project [here](https://github.com/lylin17/auto_label)

Software License

Dataset sources

news - github, medium
toxic comments - kaggle
movies - kaggle

aisingapore / SemiAutoLabeller-Bricks