log_classification_metric calculates wrong set of labels when np.nan present in columns

Question

log_classification_metric calculates wrong set of labels when np.nan present in columns

FelipeAdachi opened this issue a year ago · comments

Description

The following code:

import pandas as pd
import whylogs as why
import numpy as np
df = pd.DataFrame({"label_output":[0.0,0.0,1.0,1.0,np.nan]*10000,"pred_output":[0.0,1.0,1.0,0.0,np.nan]*10000,"pd_output":[0.2,0.3,0.4,0.5,np.nan]*10000})

why.log_classification_metrics(df,
                            target_column="label_output",
                            prediction_column="pred_output",
                            score_column="pd_output",
                            log_full_data=True
                            )

Will hit the following error:

ValueError: The initialized confusion matrix has 20002 labels and the resulting confusion matrix will be larger than is supported by whylogs current representation of the model metric for a confusion matrix of this size, selectively log the most important labels or configure the threshold of {MODEL_METRICS_MAX_LABELS} higher by setting MODEL_METRICS_MAX_LABELS.

This happens because when calculating the labels on model_performance_metrics.py with labels = sorted(list(set(targets + predictions))) the NaN values are included multiple times in the resulting list of labels.

I have reviewed the Guidelines for Contributing and the Code of Conduct.