edvardlindelof / powerful-sklearn

Sample project showing how to flexibly and efficiently train machine learning models with scikit-learn.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Powerful scikit-learn workflow

Sample project showing how to flexibly and efficiently train machine learning models with scikit-learn.

Why is this useful?

Because during investigative prototyping of machine learning models, delegating appropriately to the framework can save hundreds of development hours and reduce the amount of code to maintain by thousands of lines.

Training configuration

The code in train.py implements a multivariate classification experiment and illustrates how to apply feature-specific preprocessing,

ColumnTransformer(transformers=[
    ('scaler', StandardScaler(), ['Insulin', 'BloodPressure']),
    ('passthrough', 'passthrough', ['Glucose'])
])

how to use a custom scoring metric,

def roc_auc_without_uncertain_samples(y_true, y_score):
    uncertain = (
        (y_score > np.quantile(y_score, 0.45)) &
        (y_score < np.quantile(y_score, 0.55))
    )
    return roc_auc_score(y_true[~uncertain], y_score[~uncertain])

and how to optimize multiple models and corresponding hyperparameters in a single execution.

[
    (
        RandomForestClassifier(random_state=123),
        {'bootstrap': [True, False], 'max_depth': [3, 20]}
    ),
    (
        SVC(probability=True, random_state=123),
        {'kernel': ['rbf', 'linear'], 'C': [0.1, 10]}
    )
]

Workflow

Get dataset

curl https://raw.githubusercontent.com/npradaschnor/Pima-Indians-Diabetes-Dataset/master/diabetes.csv \
    | sed s/Outcome/y/g > data.csv

This downloads the public diabetes dataset used to illustrate the workflow.

Train model

python train.py

This runs model training and saves a GridSearchCV object in a file named search.joblib which contains detailed information about the execution.

Inspect results

report.py shows how to inspect the object stored in search.joblib to acquire rich detail about the training execution, namely the winning model instance,

search.best_estimator_
Pipeline(steps=[
    (
        'columntransformer',
        ColumnTransformer(transformers=[
            ('scaler', StandardScaler(), ['Insulin', 'BloodPressure']),
            ('passthrough', 'passthrough', ['Glucose'])
        ])
    ),
    (
        'classifier',
        RandomForestClassifier(max_depth=3, random_state=123)
    )
])

the hyperparameter optimization results,

df = pd.DataFrame(search.cv_results_)
model2str = lambda model: type(model).__name__
df['param_classifier'] = df['param_classifier'].apply(model2str)
df.filter(regex='param_|mean_test_selection-criterion')
param_classifier param _classifier __bootstrap param _classifier __max_depth param _classifier __C param _classifier __kernel mean_test _selection-criterion
0 RandomForestClassifier 1 3 nan nan 0.786414
1 RandomForestClassifier 1 20 nan nan 0.743816
2 RandomForestClassifier 0 3 nan nan 0.777103
3 RandomForestClassifier 0 20 nan nan 0.696485
4 SVC nan nan 0.1 rbf 0.804755
5 SVC nan nan 0.1 linear 0.798838
6 SVC nan nan 10 rbf 0.779612
7 SVC nan nan 10 linear 0.799157

and what cross validation folds were used.

[test_idx for train_idx, test_idx in search.cv]
[array([  0,   3,   6, ... ]), array([  1,  14,  15, ... ]), array([  2,   4,   5, ... ])]

Generate predictions

predict.py shows how to generate predictions, for illustrative purposes using the training data.

On custom and third-party modules

For projects that require models or preprocessing modules that are implemented in-house or as part of other frameworks, it is often a viable strategy to wrap them inside scikit-learn objects.

Dependency versions used for development

  • Python 3.8.8
  • scikit-learn 0.24.2
  • pandas 1.2.3

About

Sample project showing how to flexibly and efficiently train machine learning models with scikit-learn.

License:MIT License


Languages

Language:Python 100.0%