kedro-starters-sklearn

This repository provides the following starter templates for Kedro 0.18.14.

Pipeline visualized by Kedro-viz

`sklearn-iris` template

Iris dataset is included and used in default.

Modification: for each species, setosa is encoded to 0, versicolor is encoded to 1, and virginica samples were removed.
Split: for each species, first 25 samples were included in train.csv, and last 25 samples were included in test.csv.

Install dependencies.

pip install 'kedro==0.18.14' pandas scikit-learn

Generate your Kedro starter project from sklearn-iris directory.
```
kedro new --starter https://github.com/Minyus/kedro-starters-sklearn.git --directory sklearn-iris
```
As explained by Kedro's documentaion, enter project_name, repo_name, and python_package.

Note: As your Python package name, choose a unique name and avoid a generic name such as "test" or "sklearn" used by another package. You can see the list of importable packages by running python -c "help('modules')".
Change the current directory to the generated project directory.
```
cd /path/to/project/directory
```
Run the project.
```
kedro run
```

Download Kaggle Titanic dataset
Replace train.csv and test.csv in /path/to/project/directory/data/01_raw directory
Modify /path/to/project/directory/base/parameters.yml to set parameters appropriate for the dataset (commented out in default)

This template integrates MLflow to Kedro using PipelineX. Even without writing MLflow code. You can:

configure MLflow Tracking
log inputs and outputs of Python functions set up as Kedro nodes as parameters (e.g. features used to train the model) and metrics (e.g. F1 score).
log execution time for each Kedro node and DataSet loading/saving as metrics.
log artifacts (e.g. models, execution time Gantt Chart visualized by Plotly, parameters.yml file)

In this template, MLflow logging is configured in Python code at src/<python_package>/mlflow/mlflow_config.py

See here for details.

Install dependencies.

pip install 'kedro==0.18.14' pandas scikit-learn mlflow 'pipelinex>=0.7.7' plotly

Generate your Kedro starter project from sklearn-mlflow-iris directory.

kedro new --starter https://github.com/Minyus/kedro-starters-sklearn.git --directory sklearn-mlflow-iris

To access the MLflow web UI, launch the MLflow server.

mlflow server --host 127.0.0.1 --port 8080 --backend-store-uri sqlite:///mlruns/sqlite.db --default-artifact-root ./mlruns

Logged metrics shown in MLflow's UI

Gantt chart for execution time, generated using Plotly, shown in MLflow's UI