bert data-annotation fine-tuning gpt-3-5-turbo llm nlp dvc mlflow active-learning cleanlab confident-learning gpt gpt-4 noisy-labels

Large Language Models for Efficient Data Annotation and Model Fine-Tuning with Iterative Active Learning

This framework combines human expertise with the efficiency of Large Language Models (LLMs) like OpenAI's GPT-3.5 to simplify dataset annotation and model improvement. The iterative approach ensures the continuous improvement of data quality, and consequently, the performance of models fine-tuned using this data. This not only saves time but also enables the creation of customized LLMs that leverage both human annotators and LLM-based precision.

Architecture

Features

Dataset Uploading and Annotation
- Upload CSV datasets.
- Leverage GPT-3.5 to automatically annotate datasets.
- Preview the annotations, highlighting low-confidence score rows.
Manual Annotation Corrections
- Display the annotated dataset for user-based corrections.
- User can update labels for specific rows.
CleanLab: Confident Learning Approach
- Utilizes confident learning to identify and rectify label issues.
- Automatically displays rows with potential label errors for user-based corrections.
Data Versioning and Saving
- Merge user corrections with the annotated dataset.
- Advanced data versioning ensures unique dataset versions are saved for every update.
Model Training
- Train a BERT model on the cleaned dataset.
- Track and reproduce model versions seamlessly using MLflow.

Setup

Prerequisites

Install the required packages:
```
pip install -r requirements.txt
```

Running the Tool

Start the FastAPI backend:
```
uvicorn app:app --reload
```
Run the Streamlit app:
```
streamlit run frontend.py
```
Launch MLflow UI: To view models, metrics, and registered models, you can access the MLflow UI with the following command:
```
mlflow ui
```
Access the provided links in your web browser:
- For the main application, access the Streamlit link.
- For MLflow's tracking interface, by default, you can navigate to http://127.0.0.1:5000.
Follow the on-screen prompts to upload, annotate, correct, and train on your dataset.

Why Confident Learning?

Confident learning has emerged as a groundbreaking technique in supervised learning and weak-supervision. It aims at characterizing label noise, finding label errors, and learning efficiently with noisy labels. By pruning noisy data and ranking examples to train with confidence, this method ensures a clean and reliable dataset, enhancing the overall model performance.

License

This project is open-sourced under the MIT License.

About

Use Large Language Models like OpenAI's GPT-3.5 for data annotation and model enhancement. This framework combines human expertise with LLMs, employs Iterative Active Learning for continuous improvement, and integrates CleanLab (Confident Learning) to ensure high-quality datasets and better model performance

bert data-annotation fine-tuning gpt-3-5-turbo llm nlp dvc mlflow active-learning cleanlab confident-learning gpt gpt-4 noisy-labels

MIT License

Languages

Language:Python 93.6%Language:HTML 4.8%Language:Dockerfile 1.6%