SeanLee97/clfzoo

/ clfzoo /

Eng / CN

clfzoo is a toolkit for text classification. We have implemented some baseline models, such as TextCNN, TextRNN, RCNN, Transformer, HAN, DPCNN. And We have designed a unified and friendly API to train / predict / test the models. Looking forward to your code contributions and suggestions.

Requiements

python3+
numpy
sklearn
tensorflow>=1.6.0

Installation

git clone https://github.com/SeanLee97/clfzoo.git
cd clfzoo

Overview

project
│    README.md
│
└─── docs
│
└─── clfzoo    # models
│   │  base.py       # base model template
│   │  config.py     # default configure
│   │  dataloader.py
│   │  instance.py   # data instance
│   │  vocab.py      # vocabulary
│   │  libs          # layers and functions
│   │  dpcnn         # implement dpcnn model
│   │   │  __init__.py  # model apis
│   │   │  model.py     # model
│   │  ...           # implement other models
└───examples
    │   ...

Data Prepare

Each line is a document. The line format is "label \t sentence". The default word tokenizer is split by blank space, so words in sentence should split by blank space.

for english sample

greeting    how are you.

for chinese sample

打招呼  你 最近 过得 怎样 啊 ？

Usage

train

# import model api
import clfzoo.textcnn as clf  

# import model config
from clfzoo.config import ConfigTextCNN

"""define model config

You can assign value to hy-params defined on base model config (here is ConfigTextCNN)
"""

class Config(ConfigTextCNN):
    def __init__(self):
        # it is required to implement super() function
        super(Config, self).__init__()

    # it is required to provide dataset
    train_file = '/path/to/train'
    dev_file = '/path/to/test'
    
    # ... other hy-params

# `training` is flag to indicate train mode.
clf.model(Config(), training=True)

# start to train
clf.train()

The train log will output to log.txt, the model weights and checkpoint summaries will output to models folder.

predict

Predit the labels and probability scores.

import clfzoo.textcnn as clf
from clfzoo.config import ConfigTextCNN

class Config(ConfigTextCNN):
    def __init__(self):
        super(Config, self).__init__()
    
    # the same hy-params as train

# inject config to model
clf.model(Config())

"""
Input: a list
    each item in list is a sentence string split by blank space (for chinese sentence you should prepare your input data first)
"""
datas = ['how are u ?', 'what is the weather ?', ...]

"""
Return: a list
    [('label 1', 'score 1'), ('label 2', 'score 2'), ...]
"""
preds = clf.predict(datas)

test

Predit the labels and probability scores and get result metrics. In order to calculate metrics you should provide ground-truth label.

import clfzoo.textcnn as clf
from clfzoo.config import ConfigTextCNN

class Config(ConfigTextCNN):
    def __init__(self):
        super(Config, self).__init__()
    
    # the same hy-params as train

# inject config to model
clf.model(Config())

"""
Input: a list
    each item in list is a sentence string split by blank space (for chinese sentence you should prepare your input data first)
"""
datas = ['how are u ?', 'what is the weather ?', ...]
labels = ['greeting', 'weather', ...]

"""
Return: a tuple
    - predicts: a list
        [('label 1', 'score 1'), ('label 2', 'score 2'), ...]
    - metrics: a dict
        {'recall': '', 'precision': '', 'f1': , 'accuracy': ''}
"""
preds, metrics = clf.test(datas, labels)

Benchmark Results

here we use smp2017-ECDT dataset as an example, which is a multi-label (31 labels)、short-text and chinese dataset.

We train all models in 20 epochs, and calculate metrics by sklearn metrics functions. As we all know fasttext is a strong baseline in text-classification, so here we give the result on fasttext

Models	Precision	Recall	F1
fasttext	0.81	0.81	0.81
TextCNN	0.83	0.84	0.83
TextRNN	0.84	0.83	0.82
RCNN	0.86	0.85	0.85
DPCNN	0.87	0.85	0.85
Transformer	0.74	0.67	0.68
HAN	TODO	TODO	TODO

Attention! It seems that Transformer and HAN can`t perform well now, We will fix bugs and update their result later.

Contributors

sean lee
- a single coder
- seanlee97@github.io
x.m. li
- a undergraduate student from Shanxi University
- holahack@github
...

Refrence

Some code modules from

Papers

TextCNN: Convolutional Neural Networks for Sentence Classification
DPCNN: Deep Pyramid Convolutional Neural Networks for Text Categorization
Transformer: Attention Is All You Need
HAN: Hierarchical Attention Networks for Document Classification

Contact Us

Any questions please mailto xmlee97#gmail.com

SeanLee97 / clfzoo