Fix as_data_frame and not use csv as a medium

Question

Fix as_data_frame and not use csv as a medium

wendycwong opened this issue 2 months ago · comments

Per discussion with @tomasfryda, I am going to do the following to as_data_frame:

add multi_thread flag and default to False;
if multi_thread flag is set to True, will use export file to parquet and read it back to pandas frame.

Here is an extract from my conversation with Tomas:

I think the approach of making our work easier by using other libraries was wrong (since it brought more issues) and we should have done something like exporting parquet/pyarrow/feather from the backend and then loading it in pandas. It would be much faster and less error prone than writing CSV and reading CSV (and it would likely require less RAM due to being binary format).

wendycwong · Answer 1 · Wed May 29 2024 04:21:51 GMT+0800 (China Standard Time)

I briefly looked as this. However, the following error will occur and I believe is caused by this:

exportFile = tempfile.NamedTemporaryFile(suffix=".h2oframe2Convert.csv", delete=False)

This file will be created with zero bytes and this yields this error:

if (! H2O.getPM().isEmptyDirectoryAllNodes(path)) {
  throw new H2OIllegalArgumentException(path, "exportFrame", "Cannot use path " + path +
          " to store part files! The target needs to be either an existing empty directory or not exist yet.");

More time need to be invested here.