log_classification_metric calculates wrong set of labels when np.nan present in columns
FelipeAdachi opened this issue · comments
Description
The following code:
import pandas as pd
import whylogs as why
import numpy as np
df = pd.DataFrame({"label_output":[0.0,0.0,1.0,1.0,np.nan]*10000,"pred_output":[0.0,1.0,1.0,0.0,np.nan]*10000,"pd_output":[0.2,0.3,0.4,0.5,np.nan]*10000})
why.log_classification_metrics(df,
target_column="label_output",
prediction_column="pred_output",
score_column="pd_output",
log_full_data=True
)
Will hit the following error:
ValueError: The initialized confusion matrix has 20002 labels and the resulting confusion matrix will be larger than is supported by whylogs current representation of the model metric for a confusion matrix of this size, selectively log the most important labels or configure the threshold of {MODEL_METRICS_MAX_LABELS} higher by setting MODEL_METRICS_MAX_LABELS.
This happens because when calculating the labels on model_performance_metrics.py
with labels = sorted(list(set(targets + predictions)))
the NaN values are included multiple times in the resulting list of labels.
- I have reviewed the Guidelines for Contributing and the Code of Conduct.