Speed up Dashboard joblib/yaml export

Question

Speed up Dashboard joblib/yaml export

Rov7 opened this issue 7 months ago · comments

Hey, I've been trying to create an easy to understand ml dashboard and luckily I found your library. The problem appears when I tried to deploy this dashboard to a web application in a production environment with large amount of data.

I tried to follow all your recommendations to reduce memory usage, but it's still slow.

In the part where I exported the joblib/yaml, the function takes more than 2 hours to execute for 1-2 MM rows. Do you know any way to speed up the process, or some way to solve this problem?

thanks a lot

This is the function -->

for i,j,k in zip(dfs,models,tar):
            blob = bucket.blob(j)
            search = pickle.loads(blob.download_as_bytes())

            df = pd.read_csv('gs://sas-dd-udd-t1-ml-dtf-vertex/'+i
                             ,header=0 
                             ,skiprows=lambda i: i>0 and random.random() > percen_rows)

            df.set_index(df['huyt'],inplace=True)

            model = search.best_estimator_

            y = df[k]
            df = df.reindex(columns=model.get_booster().feature_names,fill_value=-999)
            X = df

            explainer = ClassifierExplainer(model, X, y,precision='float32')
            db = ExplainerDashboard(explainer
                                   ,whatif = False
                                   ,contributions = False
                                   ,decision_trees  = False
                                   ,shap_dependence = False
                                   ,shap_interaction = False
                                   ,no_permutations = True)

            db.to_yaml(f"explainer_{k}.yaml",explainerfile=f"explainer_{k}.joblib",dump_explainer=True)
            

            blob = bucket_joblibs.blob(path+f"explainer_{k}.joblib")
            blob1 = bucket_joblibs.blob(path+f"explainer_{k}.yaml")

            file= str(f"explainer_{k}.joblib")
            file1= str(f"explainer_{k}.yaml")
            
            blob.upload_from_filename(file)
            blob1.upload_from_filename(file1)
            
        return None

Nicolas Avendaño · Answer 1 · Sat Dec 16 2023 04:15:28 GMT+0800 (China Standard Time)

do you know if your bottleneck is by any chance the shap values calculation? if so, I recommend you to directly calculate the shap values properly using SHAP library and set approximate to True, at least thats what I did for cases with many rows, like this

import shap
shap_explainer = shap.Explainer(predictor, X_transformed)
shap_values = shap_explainer.shap_values(X_transformed, check_additivity=False, approximate=True)
base_values = shap_explainer.expected_value
        

if model_type == REGRESSOR:
    explainer = RegressionExplainer(model=predictor, X=X_transformed, n_jobs=-1, index_name="Block ID", 
                                            precision="float32", target="Log(DEPVAR)")
    explainer.set_shap_values(base_values, shap_values)

This greatly improved my shap calculation values in a 500k rows dataset from days to a couple seconds

Oege Dijk · Answer 2 · Sun Dec 17 2023 20:51:12 GMT+0800 (China Standard Time)

You could also pass shap_kwargs=dict(approximate=True) to the RegressionExplainer for the same effect. Will add this to the README!

Rov7 · Answer 3 · Tue Dec 19 2023 23:26:51 GMT+0800 (China Standard Time)

Thank you both for your answers. That sped up the process exponentially. I'll add that to my script.

I'll close the topic, again, thank you very much!