Xrenya/DocClassificationApp

Russian document classification application

Installation

pip install -r requirements.txt

⚠️ transformer library has some issue with truncate features when using pipeline, if you could not run the model and getting error due to exceding number of tokens pass (> 512), you have change it manually in the libary's file (specifically, in self.vectorizer) to truncate: truncate=True, model_max_length=512

Usage:

python app.py

In order to share the project you have to set up: demo.launch(share=True)

Reports:

RuBert model: report_bert.ipynb
BoW model: report_v3.ipynb

Application UI:

Simple user interface

UI for prediction

Example UI explainablity:

Final UI

Progress:

Developed:

Supported files extensions:
- pdf
- rtf
- doc
- docx
Inference:
- Prediction label
- Model explainablity: words weights/attention
UI:
- Allow upload user files
- Visualize predicted label
- Visualzie model explainability of its prediction.
Model analysis using eli5:
- Identified keywords which model using to classify documents, only '1' and '2' classes have bias as a top feature, which probably should be tackles on the next stage.
SHAP words highlight based on bert output:

Model metrics on test dataset (20%):

Metrics
accuracy_score	0.9583
precision_score	0.9583
f1_score	0.9583
recall_score	0.9583

About

Russian document classification application

https://huggingface.co/spaces/Xrenya/TreatyClassifier

Languages

Language:HTML 82.8%Language:Jupyter Notebook 16.5%Language:Python 0.7%