pytorch / torcharrow

High performance model preprocessing library on PyTorch

Home Page:https://pytorch.org/torcharrow/beta/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Efficient column construction from tuple

wenleix opened this issue · comments

Column construction from list is optimized with native C++ code (for scalar types), e.g.

import torcharrow as ta
a = ta.Column([1, 2, 3])

This optimization is not done for tuple (so construction from tuple still has O(n^2) behavior ):

import torcharrow as ta
a = ta.Column((1, 2, 3))

Both Pandas and PyArrow supports that, so a feature we do want to keep:

>>> import pandas as pd
>>> a = pd.Series((1, 2, 3))
>>> a
0    1
1    2
2    3
dtype: int64

This is actually quite useful since sometimes user may create the data from a list of tuple using zip, e.g.

>>> a = [("a", 1), ("b", 2), ("c", 3)]
>>> list(zip(*a))
[('a', 'b', 'c'), (1, 2, 3)]

I guess the easiest way would be to convert Tuple to list in Python. Not sure the performance comparing with handle tuple in C++ directly.

pybind11 exposes a py::tuple type on the C++ side, so this should probably be trivial for us to support in the same way we do for lists. I'll investigate.