leomaurodesenv / kedro-sklearn-nlp

This is a learning repository about Kedro, NLP and Pipelines

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Kedro using Sklearn and NLP

GitHub MIT license GitHub Workflow Status

This repository is learning code for designing a solution using Kedro. Kedro is an open sourced Python framework for creating maintainable and modular data science code as pipelines. In the project, we are going to design a solution for competition Detection of Disaster Tweets using Natural Language Processing techniques.

Note: This project contains the best performing solution I've gotten in this competition so far.


Solution Architecture

All the code generate output files, such as model.pickle and data.csv; you can access them in data folder.

  • Preprocessing - Clean and tranform the text into vectors.
  • Training - Train many models, using k-fold cross validation and grid search.
  • Selection - Select best model according to a specific metric.


Code

Download or clone this repository.

Data

  1. Download the dataset in Natural Language Processing with Disaster Tweets
  2. Extract all the files in ./kedro-sklearn/data/01_raw/ folder
  3. Now, you can run the code using kedro!

Running

## Create a Python environment
$ python -m venv .venv
$ source .venv/bin/activate

## Access the Kedro pipelines folder
$ cd kedro-sklearn

## Install requirements
$ pip install -r src/requirements.txt

## Running
$ kedro run
$ kedro run --runner=ParallelRunner # or, run in parallel
## Output:
# 2023-03-28 16:42:31,283 - kedro.framework.session.session - INFO - Kedro project kedro-sklearn
# 2023-03-28 16:42:33,769 - kedro.io.data_catalog - INFO - Loading data from 'train' (CSVDataSet)...
# 2023-03-28 16:42:33,815 - kedro.pipeline.node - INFO - Running node: preprocess_train_node: preprocess_train([train]) -> [train_vectorizer,train_X]
# 2023-03-28 16:42:34,180 - kedro_sklearn.pipelines.preprocessing.nodes - INFO - ## Train preprocessing
# 2023-03-28 16:42:34,180 - kedro_sklearn.pipelines.preprocessing.nodes - INFO - corpus size: 7613
# [...]

## Visualizing pipelines
$ kedro viz
# Open browser: http://127.0.0.1:4141/

Also look ~

About

This is a learning repository about Kedro, NLP and Pipelines

License:MIT License


Languages

Language:Python 95.1%Language:Batchfile 2.7%Language:Makefile 2.2%