oegedijk / explainerdashboard

Quickly build Explainable AI dashboards that show the inner workings of so-called "blackbox" machine learning models.

Home Page:http://explainerdashboard.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Speed up Dashboard joblib/yaml export

Rov7 opened this issue · comments

commented

Hey, I've been trying to create an easy to understand ml dashboard and luckily I found your library. The problem appears when I tried to deploy this dashboard to a web application in a production environment with large amount of data.

I tried to follow all your recommendations to reduce memory usage, but it's still slow.

In the part where I exported the joblib/yaml, the function takes more than 2 hours to execute for 1-2 MM rows. Do you know any way to speed up the process, or some way to solve this problem?

thanks a lot

This is the function -->

for i,j,k in zip(dfs,models,tar):
            blob = bucket.blob(j)
            search = pickle.loads(blob.download_as_bytes())

            df = pd.read_csv('gs://sas-dd-udd-t1-ml-dtf-vertex/'+i
                             ,header=0 
                             ,skiprows=lambda i: i>0 and random.random() > percen_rows)

            df.set_index(df['huyt'],inplace=True)

            model = search.best_estimator_

            y = df[k]
            df = df.reindex(columns=model.get_booster().feature_names,fill_value=-999)
            X = df

            explainer = ClassifierExplainer(model, X, y,precision='float32')
            db = ExplainerDashboard(explainer
                                   ,whatif = False
                                   ,contributions = False
                                   ,decision_trees  = False
                                   ,shap_dependence = False
                                   ,shap_interaction = False
                                   ,no_permutations = True)

            db.to_yaml(f"explainer_{k}.yaml",explainerfile=f"explainer_{k}.joblib",dump_explainer=True)
            

            blob = bucket_joblibs.blob(path+f"explainer_{k}.joblib")
            blob1 = bucket_joblibs.blob(path+f"explainer_{k}.yaml")

            file= str(f"explainer_{k}.joblib")
            file1= str(f"explainer_{k}.yaml")
            
            blob.upload_from_filename(file)
            blob1.upload_from_filename(file1)
            
        return None

do you know if your bottleneck is by any chance the shap values calculation? if so, I recommend you to directly calculate the shap values properly using SHAP library and set approximate to True, at least thats what I did for cases with many rows, like this

import shap
shap_explainer = shap.Explainer(predictor, X_transformed)
shap_values = shap_explainer.shap_values(X_transformed, check_additivity=False, approximate=True)
base_values = shap_explainer.expected_value
        

if model_type == REGRESSOR:
    explainer = RegressionExplainer(model=predictor, X=X_transformed, n_jobs=-1, index_name="Block ID", 
                                            precision="float32", target="Log(DEPVAR)")
    explainer.set_shap_values(base_values, shap_values)

This greatly improved my shap calculation values in a 500k rows dataset from days to a couple seconds

You could also pass shap_kwargs=dict(approximate=True) to the RegressionExplainer for the same effect. Will add this to the README!

commented

Thank you both for your answers. That sped up the process exponentially. I'll add that to my script.

I'll close the topic, again, thank you very much!