Hyprnx / Text-Classification

Final Project for Deep Learning course at NEU

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Product category classification

This project is a source code for graduation project "Multi-stages Deep Learning based E-commerce Product Categorization with advanced text processing techniques".

Developed by To Duc Anh student of DSEB61, MFE, member of NEU's DSLab, and Data Management Associate at Techcombank Vietnam.

The project offers a solution for product category classification problem. The dataset is crawled from Shopee.

This project features a deep learning model built with PyTorch. The embedding data were calculated with Vietnamese sBERT.

The training process can be found in notebook directory.

The project is now LIVE and accessible via this link.

About Dataset:

The dataset contains roughly 1,000,000 products with 4 different categories. The categories are:

  • Electronics
  • Cosmetics
  • Fashion
  • Mom & Baby
  • Others

About the model and steps:

Access the introduction page here to get information on how the model is built and how the steps are done.

How to run the project on local machine:

This project was developed and deployed on a Python 3.10 machine. To run the project on your local machine, you should have Python 3.10 installed. The project was developed with the help of Streamlit, a Python library for building web apps.

Getting Started on your local machine:

  1. Clone the project to your local machine:
git clone https://github.com/Hyprnx/Text-Classification
  1. Setup and activate a virtual environment: Setup:
python -m venv <envname>

Activate:

  • On Mac:
    source <envname>/bin/activate
  • On Windows:
    <envname>\Scripts\activate
  1. Install requirements.txt
pip install -r requirement.txt
  1. Run the project:
  streamlit run streamlit_app.py
  1. Open the link provided by Streamlit in your browser.

This link should be run on port 8501 (eg: http://localhost:8501). If you want to change the port, you can do so by referencing the streamlit documentaion here.

The project should look like this: image

  1. Enjoy the project!

Others

The project also included a model (and also the ONNX version of the model) that can be used for inference. It located here: model.

Experimental:

ONNX and GPU acceleration:

The inference time could be accelerated with the help of ONNX Runtime. In the provided model, the embedding time is pretty slow due to the fact that the embedding module is purely taken from transformer without any optimization. The reason is that, firstly, the sentence embedding that we use have some problem with the vocab size, hence the ONNX conversion cannot be completed. We probably can retrain the model with a smaller vocab size to make it work later on. Secondly, the embedding process run entirely on CPU, which might not be so efficient for parallel computing. Provide a GPU will definitely speed up the process. The embedding process with ~1M sentences takes around 10 minutes to complete on a nVidia P100 GPU, kindly provided by Google on Kaggle.

DataFrame Library:

The project currently use Pandas (Pre 2.0 releases) and it multi-threading variance Modin to speed up data processing. But, these libraries are, still, not the fastest available. We could try to use Polars - a DataFrame manipulation framework written in Rust. Which provide a blazing fast data processing speed. Equipped with blazing-ly fast and memory-efficient property of Rust. But, due to it lack of features since its still in early development, we cannot use it for this project. But, we could try to use it in the future.

Deployment:

We use the free tier of Streamlit cloud, which limited to 1GB of resources. This limitation is the reason you don't see phoBERT classifier on the web app. The model is too big to be deployed on Streamlit cloud.

We can possibly rent a cloud base service from AWS, GCP or Azure to deploy the model. But you know, we are students, there are no financial benefit from doing that for demonstration purposes.

About

Final Project for Deep Learning course at NEU

License:Apache License 2.0


Languages

Language:Jupyter Notebook 88.8%Language:Python 11.2%