h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Home Page:http://h2o.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fix as_data_frame and not use csv as a medium

wendycwong opened this issue · comments

Per discussion with @tomasfryda, I am going to do the following to as_data_frame:

  1. add multi_thread flag and default to False;
  2. if multi_thread flag is set to True, will use export file to parquet and read it back to pandas frame.

Here is an extract from my conversation with Tomas:

I think the approach of making our work easier by using other libraries was wrong (since it brought more issues) and we should have done something like exporting parquet/pyarrow/feather from the backend and then loading it in pandas. It would be much faster and less error prone than writing CSV and reading CSV (and it would likely require less RAM due to being binary format).

I briefly looked as this. However, the following error will occur and I believe is caused by this:

exportFile = tempfile.NamedTemporaryFile(suffix=".h2oframe2Convert.csv", delete=False)

This file will be created with zero bytes and this yields this error:

if (! H2O.getPM().isEmptyDirectoryAllNodes(path)) {
  throw new H2OIllegalArgumentException(path, "exportFrame", "Cannot use path " + path +
          " to store part files! The target needs to be either an existing empty directory or not exist yet.");

More time need to be invested here.