Speed up Dashboard joblib/yaml export
Rov7 opened this issue · comments
Hey, I've been trying to create an easy to understand ml dashboard and luckily I found your library. The problem appears when I tried to deploy this dashboard to a web application in a production environment with large amount of data.
I tried to follow all your recommendations to reduce memory usage, but it's still slow.
In the part where I exported the joblib/yaml, the function takes more than 2 hours to execute for 1-2 MM rows. Do you know any way to speed up the process, or some way to solve this problem?
thanks a lot
This is the function -->
for i,j,k in zip(dfs,models,tar):
blob = bucket.blob(j)
search = pickle.loads(blob.download_as_bytes())
df = pd.read_csv('gs://sas-dd-udd-t1-ml-dtf-vertex/'+i
,header=0
,skiprows=lambda i: i>0 and random.random() > percen_rows)
df.set_index(df['huyt'],inplace=True)
model = search.best_estimator_
y = df[k]
df = df.reindex(columns=model.get_booster().feature_names,fill_value=-999)
X = df
explainer = ClassifierExplainer(model, X, y,precision='float32')
db = ExplainerDashboard(explainer
,whatif = False
,contributions = False
,decision_trees = False
,shap_dependence = False
,shap_interaction = False
,no_permutations = True)
db.to_yaml(f"explainer_{k}.yaml",explainerfile=f"explainer_{k}.joblib",dump_explainer=True)
blob = bucket_joblibs.blob(path+f"explainer_{k}.joblib")
blob1 = bucket_joblibs.blob(path+f"explainer_{k}.yaml")
file= str(f"explainer_{k}.joblib")
file1= str(f"explainer_{k}.yaml")
blob.upload_from_filename(file)
blob1.upload_from_filename(file1)
return None
do you know if your bottleneck is by any chance the shap values calculation? if so, I recommend you to directly calculate the shap values properly using SHAP library and set approximate to True, at least thats what I did for cases with many rows, like this
import shap
shap_explainer = shap.Explainer(predictor, X_transformed)
shap_values = shap_explainer.shap_values(X_transformed, check_additivity=False, approximate=True)
base_values = shap_explainer.expected_value
if model_type == REGRESSOR:
explainer = RegressionExplainer(model=predictor, X=X_transformed, n_jobs=-1, index_name="Block ID",
precision="float32", target="Log(DEPVAR)")
explainer.set_shap_values(base_values, shap_values)
This greatly improved my shap calculation values in a 500k rows dataset from days to a couple seconds
You could also pass shap_kwargs=dict(approximate=True)
to the RegressionExplainer
for the same effect. Will add this to the README!
Thank you both for your answers. That sped up the process exponentially. I'll add that to my script.
I'll close the topic, again, thank you very much!