leomaurodesenv / dvc-luigi-nlp

This is a learning repository about DVC Data Version Control and Luigi Pipelines

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NLP Pipeline using DVC and Luigi

GitHub MIT license GitHub Workflow Status

This is a project study to create a NLP pipeline using DVC and Luigi. The pipeline consists of several tasks that process text data, including preprocessing, feature extraction, and model training. Each task is defined as a Luigi task, which allows for easy tracking of dependencies and parallel execution. The pipeline also uses DVC to manage data versioning and ensure reproducibility. The resulting model can be used for text classification or other NLP tasks.

Note: This project contains a top-50 solution on the competition.


Code

Download or clone this repository.

Data

  1. Setup your Kaggle API to download the data.
  2. Now, you can run the code using luigi!

Running

## Create a Python environment
$ python -m venv .venv
$ source .venv/bin/activate

## Install requirements
$ pip install -r src/requirements.txt
## Install pre-commit [optional for development]
$ pre-commit install

## Download the dataset
$ kaggle competitions download -c sentiment-analysis-on-movie-reviews -p data

## Running
$ cd source && python -m luigi --module model Predict --local-scheduler
## Output:
# DEBUG: Checking if Predict() is complete
# INFO: Informed scheduler that task   Predict__99914b932b   has status   PENDING
# INFO: Informed scheduler that task   TrainModel__99914b932b   has status   PENDING
# INFO: Informed scheduler that task   Preprocessing__99914b932b   has status   PENDING
# [...]
# INFO: Done scheduling tasks
# INFO: Running Worker with 1 processes
# DEBUG: Asking scheduler for work...
# DEBUG: Pending tasks: 4
# INFO: [pid 13975] Worker Worker(salt=677210727, workers=1, host=CL-PE08WLYF, username=leonardo-moraes, pid=13975) running   ExtractRawData()
# INFO: [pid 13975] Worker Worker(salt=677210727, workers=1, host=CL-PE08WLYF, username=leonardo-moraes, pid=13975) done      ExtractRawData()
# DEBUG: 1 running tasks, waiting for next task to finish
# INFO: Informed scheduler that task   ExtractRawData__99914b932b   has status   DONE
# DEBUG: Asking scheduler for work...
# DEBUG: Pending tasks: 3
# INFO: [pid 13975] Worker Worker(salt=677210727, workers=1, host=CL-PE08WLYF, username=leonardo-moraes, pid=13975) running   Preprocessing()
# INFO: [pid 13975] Worker Worker(salt=677210727, workers=1, host=CL-PE08WLYF, username=leonardo-moraes, pid=13975) done      Preprocessing()
# DEBUG: 1 running tasks, waiting for next task to finish
# INFO: Informed scheduler that task   Preprocessing__99914b932b   has status   DONE
# DEBUG: Asking scheduler for work...
# [...]

Also look ~

About

This is a learning repository about DVC Data Version Control and Luigi Pipelines

License:MIT License


Languages

Language:Python 100.0%