Text classification benchmarks with scikit-learn and spaCy v3.
This project assumes that there is a data.csv
file in the assets
folder
that contains data for a basic text classification task. This
file needs to have one column named "text"
and another one named "label"
.
It can also contain an optional column called "set"
which can contain
either the value "valid"
and "train"
. If this column is missing this project
will make a random 50% split.
This data will be modelled in four different ways.
- a spaCy bag of words model.
- a spaCy CNN model.
- a scikit-learn countvectorizer logistic classifier
- a scikit-learn tf/idf logistic classifier
To set up this project, you only need to clone the spaCy project.
python -m spacy project clone --repo https://github.com/koaning/classycookie --branch main classification <folder-name>
The project.yml
defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
spaCy projects documentation.
The following commands are defined by the project. They
can be executed using spacy project run [name]
.
Commands are only re-run if their inputs have changed.
Command | Description |
---|---|
prepare |
Prepare the .csv for the next steps |
convert |
Convert the data to spaCy's binary format |
train-spacy |
Train the spaCy models |
train-sklearn |
Train the spaCy models |
evaluate |
Evaluate the model and export metrics |
The following workflows are defined by the project. They
can be executed using spacy project run [name]
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.
Workflow | Steps |
---|---|
all |
prepare β convert β train-spacy β train-sklearn β evaluate |
The following assets are defined by the project. They can
be fetched by running spacy project assets
in the project directory.
File | Source | Description |
---|---|---|
assets/data.csv |
Local | Original .csv file. |
This project comes with a copy of the clinc dataset that you can replace at will.
The final result of the project will be a summary table like this one:
ββββββββββββββββββββββββββββββ³ββββββββββββββ³ββββββββββββββ³ββββββββββββββββββββββ
β Model β Train Score β Valid Score β Time Taken β
β‘βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β sklearn-trained/pipe_tfidβ¦ β 0.9262 β 0.8582 β 0.15783333778381348 β
β sklearn-trained/pipe_cv.jβ¦ β 0.9866 β 0.8955 β 0.14619827270507812 β
β training/cnn/model-best β 0.9867 β 0.8723 β 3.962278127670288 β
β training/bow/model-best β 0.9867 β 0.8723 β 4.037751197814941 β
ββββββββββββββββββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄ββββββββββββββββββββββ