queryverse / ParquetFiles.jl

FileIO.jl integration for Parquet files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reading Parquet to DataFrame is slow

tclements opened this issue · comments

Reading a parquet file into a DataFrame is ~170 slower than using CSV.read with the same data. Not sure I can help improve performance but this is limiting my use of ParquetFiles.jl

MWE:

(@v1.4) pkg> st
Status `~/.julia/environments/v1.4/Project.toml`
  [6e4b80f9] BenchmarkTools v0.5.0
  [336ed68f] CSV v0.6.2
  [a93c6f00] DataFrames v0.21.2
  [626c502c] Parquet v0.4.0
  [46a55296] ParquetFiles v0.2.0
using ParquetFiles, BenchmarkTools, CSV, DataFrames
CSV.read("data.csv")
DataFrame(load("data.parquet"))

Loading times for ParquetFiles

@benchmark DataFrame(load("data.parquet"))
BenchmarkTools.Trial: 
  memory estimate:  45.66 MiB
  allocs estimate:  961290
  --------------
  minimum time:     287.492 ms (0.00% GC)
  median time:      290.843 ms (0.00% GC)
  mean time:        296.344 ms (1.64% GC)
  maximum time:     326.041 ms (8.46% GC)
  --------------
  samples:          17
  evals/sample:     1

Loading times for CSV:

@benchmark CSV.read("data.csv")
BenchmarkTools.Trial: 
  memory estimate:  758.14 KiB
  allocs estimate:  2299
  --------------
  minimum time:     1.690 ms (0.00% GC)
  median time:      1.735 ms (0.00% GC)
  mean time:        1.772 ms (1.43% GC)
  maximum time:     14.096 ms (63.93% GC)
  --------------
  samples:          2817
  evals/sample:     1

As compared to pandas:

import pandas as pd
%timeit pd.read_parquet("data.parquet")                                                                                                                                          
# 3.61 ms ± 25.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit pd.read_csv("data.csv")                                                                                                                                                  
# 4.73 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Data are included in zip file:
data.zip

I think one of the reasons is that ParquetFiles.jl doesn't have the interface Tables.columns implemented, which makes DataFrame(...) go to the fallback solution, that is, row by row appending.