HazyResearch / meerkat

Creative interactive views of any dataset.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] Indexing into DataPanel changes custom column type

dhatcher8 opened this issue · comments

Bug Description
When indexing to get a subset of rows from a DataPanel with a complex custom column type, the type of that column is being changed to a ListColumn in the new subset DataPanel

To Reproduce
May be difficult to reproduce as it's only occurring for one custom column type that we have.

  1. Create complex custom column type (ours is a column where each cell is a time series with categorical values and subclasses mk.CellColumn)
  2. Create a DataPanel instance (dp) that has the above column and some data inside of it
  3. Index into the DataPanel (dp_subset = dp[0:1])
  4. The column type for that specific column in dp_subset has changed to a ListColumn

System Information

  • OS: MacOS

Thanks for the issue. Depending on the implementation of cell.get, this might be expected behavior.

Consider this example,

import torchaudio

class TimeSeriesCell(mk.AbstractCell):
    
    def __init__(self, path):
        self.path = path
        
    def get(self):
        return torchaudio.load(self.path)[0]
        
cell = TimeSeriesCell(path="/Users/sabrieyuboglu/data/datasets/yesno/waves_yesno/0_0_0_0_1_1_1_1.wav")

dp = mk.DataPanel({
    "index": range(10),
    "cell": [cell] * 10
})

When you index a cell column like dp[:5], this is a "materializing" index. This means that we will call cell.get() on each cell in the column. In this case, get loads the time series from disk and returns it as a torch tensor. Meerkat then infers what the new column type should be (in this case a torch TensorColumn). In your case, I imagine you might be returning a python object that meerkat doesn't have a special column for, so it just defaults to a ListColumn.

Now, if you'd like to index the datapanel without materializing the cells (i.e. keep the CellColumn a CellColumn), you can do this with a "lazy" index: dp.lz[:5]

Let us know if this makes sense in your context, and if not, you can share the CellColumn implementation and we can dive deeper

Awesome, for our current situation lazy indexing should fulfill our needs. Thank you for the insight!