Optimize creating DataFrame/struct column from a list of tuples
wenleix opened this issue · comments
Wenlei Xie commented
Motivation example (the actual dataset has two struct columns, with 13 and 26 fields respectively) :
dtype = dt.Struct(
[
dt.Field("labels", dt.int8),
dt.Field("dense_features", dt.Struct([dt.Field("int_1", dt.int32), dt.Field("int_2", dt.int32)])),
]
)
df = ta.DataFrame(
[
(1, (0, 1)),
(0, (10, 11)),
# ~100 rows
],
dtype=dtype)
The current implementation will first create an empty DataFrame, and then append each tuple:
https://github.com/facebookresearch/torcharrow/blob/f4fcfde9dde488276d11b04aa2a57df2835fee0b/torcharrow/scope.py#L292-L296
There are two inefficiencies with it:
- For each tuple, it will first converted into a dict (https://github.com/facebookresearch/torcharrow/blob/f4fcfde9dde488276d11b04aa2a57df2835fee0b/torcharrow/velox_rt/dataframe_cpu.py#L228-L232)
- Velox vector is immutable, thus each time a new Vector with size
n+1
will be created, causingO(n^2)
behavior.
One idea is to first create list per each column, and then calling the Column constructor when creating DataFrame/struct column. We still need to "transpose" the data in Python, but at least avoid expensive O(n^2)
behavior.