Optimize creating DataFrame/struct column from a list of tuples

Question

Optimize creating DataFrame/struct column from a list of tuples

wenleix opened this issue 2 years ago · comments

Motivation example (the actual dataset has two struct columns, with 13 and 26 fields respectively) :

dtype = dt.Struct(
    [
        dt.Field("labels", dt.int8),
        dt.Field("dense_features", dt.Struct([dt.Field("int_1", dt.int32), dt.Field("int_2", dt.int32)])),
    ]
)

df = ta.DataFrame(
    [
        (1, (0, 1)),    
        (0, (10, 11)), 
        # ~100 rows
    ],
    dtype=dtype)

The current implementation will first create an empty DataFrame, and then append each tuple:
https://github.com/facebookresearch/torcharrow/blob/f4fcfde9dde488276d11b04aa2a57df2835fee0b/torcharrow/scope.py#L292-L296

There are two inefficiencies with it:

For each tuple, it will first converted into a dict (https://github.com/facebookresearch/torcharrow/blob/f4fcfde9dde488276d11b04aa2a57df2835fee0b/torcharrow/velox_rt/dataframe_cpu.py#L228-L232)
Velox vector is immutable, thus each time a new Vector with size n+1 will be created, causing O(n^2) behavior.

One idea is to first create list per each column, and then calling the Column constructor when creating DataFrame/struct column. We still need to "transpose" the data in Python, but at least avoid expensive O(n^2) behavior.