Xrenya / DocClassificationApp

Russian document classification application

Home Page:https://huggingface.co/spaces/Xrenya/TreatyClassifier

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Russian document classification application

Installation

pip install -r requirements.txt

⚠️ transformer library has some issue with truncate features when using pipeline, if you could not run the model and getting error due to exceding number of tokens pass (> 512), you have change it manually in the libary's file (specifically, in self.vectorizer) to truncate: truncate=True, model_max_length=512

Usage:

python app.py

In order to share the project you have to set up: demo.launch(share=True)

Reports:

  1. RuBert model: report_bert.ipynb
  2. BoW model: report_v3.ipynb

Application UI:

Simple user interface

image

UI for prediction

image

Example UI explainablity:

image

Final UI

image

Progress:

Developed:

  1. Supported files extensions:
    • pdf
    • rtf
    • doc
    • docx
  2. Inference:
    • Prediction label
    • Model explainablity: words weights/attention
  3. UI:
    • Allow upload user files
    • Visualize predicted label
    • Visualzie model explainability of its prediction.
  4. Model analysis using eli5:
    • Identified keywords which model using to classify documents, only '1' and '2' classes have bias as a top feature, which probably should be tackles on the next stage.
  5. SHAP words highlight based on bert output:
    image

Model metrics on test dataset (20%):

Metrics
accuracy_score 0.9583
precision_score 0.9583
f1_score 0.9583
recall_score 0.9583

About

Russian document classification application

https://huggingface.co/spaces/Xrenya/TreatyClassifier


Languages

Language:HTML 82.8%Language:Jupyter Notebook 16.5%Language:Python 0.7%