Machine Learning Operations Exam Project: Group 27

This GitHub Repository contains the main components of our exam project. The group consists of the following students from the Technical University of Denmark:

Jonas Poulsen (s194243)
Andreas Hornemann Nielsen (s194236)
Christian Vestergaard Djurhuus (s194244)
Xiang Bai (s213120)

Using the repository

A brief giude of how to install and use the repository.

Using the repository locally and run Dockerfiles

Clone the repository
Navigate into the cloned repository
pip install -r requirements.txt
pip install -e .

Using Google Cloud Project (GCP) and make it run with GitHub Actions

First of, we recommend that you fork the repository into your own GitHub account. By doing this, you will be able to use the GitHub actions later on to automatically push Dockefiles to your GCP. After having forked the repository

GCP Setup
Setting up GitHub Secrets

Project description

The purpose of the following project is to become acquainted with the production machine learning life (ML) cycle (Design, Model development and Operations) with a particular focus on the operation stage. Thus, the primary goal of the project is to learn how to manage a production ML life cycle through the usage of good practices and the tools presented in the course “Machine Learning Operations - 02476”.

The model used for this project is BERT (Bidirectional Encoder Representations from Transformers) published by Google AI Language and is a part of the Transformer framework built by the Huggingface group.

The motivation for using BERT is that, despite its simplicity, it is a very powerful tool that has reached state-of-the-art results on several NLP tasks. Furthermore, it supports PyTorch which is in line with what is used in the course. Therefore, BERT fits perfectly into the goal of the project. To use machine learning operation tools - not designing cool AI models.

The main task of our model is to perform binary sentiment classification using text on the IMDb dataset. For the training, a fine-tuning method approach has been chosen. It is a wise trade-off， considering more time can be devoted to the use of the tools that our course provides. The model will be fine-tuned on a labelled sub-dataset after it having been pre-trained on a large unlabelled dataset to achieve the effect of training the model faster. There is a Trainer API in the Transformers library, which allows for easy logging, gradient accumulation, mixed precision and some evaluations for the training.

The dataset used for the sentiment project is the following from hugging face: “ https://huggingface.co/datasets/imdb ” The IMDB dataset consists of 100.000 plain text comments regarding movies. 50.000 of which are labelled as a binary dataset using the label of either “neg” or “pos”. The other 50.000 data points are however unlabelled. Initially, the 50.000 labelled data points will be used for training and testing, however, the unlabelled set may be used for potential further pretraining, if seen fit.

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── utils          <- util files such as deployment etc.  
│   │
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Project checklist

Please note that all the lists are exhaustive meaning that I do not expect you to have completed every point on the checklist for the exam.

Week 1

Week 2

Week 3

Deployed your model locally using TorchServe (TorchServe not compatible with huggingface transformers yet. Hence, we used a different approach)
Checked how robust your model is towards data drifting
Deployed your model using gcp
Monitored the system of your deployed model
Monitored the performance of your deployed model

Additional

Revisit your initial project description. Did the project turn out as you wanted?
Make sure all group members have a understanding about all parts of the project
Create a presentation explaining your project
Uploaded all your code to github
(extra) Implemented pre-commit hooks for your project repository
(extra) Used Optuna to run hyperparameter optimization on your model

About

The following repository is the output of the exam project in the DTU course Machine Learning Operations 02476

Other

Languages

Language:Python 84.9%Language:Makefile 8.6%Language:Dockerfile 6.5%