databricks / koalas

Koalas: pandas API on Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error when filtering a Series using a condition from a DataFrame

lamesjaidler opened this issue · comments

I'm wanting to filter down a Koalas Series based on a condition from a related Koalas DataFrame

X = ks.DataFrame({
    'A': [1,2,3,4,5]
})
y = ks.Series([1,0,1,0,1])
y[X['A']>3]

However, running the last line gives the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/var/folders/hw/y20855_146x8gbvbpqmtpdp00000gq/T/ipykernel_27136/1423666562.py in <module>
      3 })
      4 y = ks.Series([1,0,1,0,1])
----> 5 y[X['A']>3]

~/venvs/argov2_dev/lib/python3.8/site-packages/databricks/koalas/series.py in __getitem__(self, key)
   6134                 # with ints, searches based on index values when the value is int.
   6135                 return self.iloc[key]
-> 6136             return self.loc[key]
   6137         except SparkPandasIndexingError:
   6138             raise KeyError(

~/venvs/argov2_dev/lib/python3.8/site-packages/databricks/koalas/indexing.py in __getitem__(self, key)
    419 
    420                 kdf[temp_col] = key
--> 421                 return type(self)(kdf[self._kdf_or_kser.name])[kdf[temp_col]]
    422 
    423             cond, limit, remaining_index = self._select_rows(key)

~/venvs/argov2_dev/lib/python3.8/site-packages/databricks/koalas/frame.py in __getitem__(self, key)
  11705 
  11706         if key is None:
> 11707             raise KeyError("none key")
  11708         elif isinstance(key, Series):
  11709             return self.loc[key.astype(bool)]

KeyError: 'none key'

This syntax works as expected with Pandas Series/DataFrames:

X = pd.DataFrame({
    'A': [1,2,3,4,5]
})
y = pd.Series([1,0,1,0,1])
y[X['A']>3]

Gives:

3    0
4    1
dtype: int64

Note that I have the following option set:

ks.set_option('compute.ops_on_diff_frames', True)

This seems like a bug? Or does it need to be carried out in a different way?

Unfortunately, the exact use case above is not supported.

However, there is workaround by making the original series(y) have the same name as the conditioning series work.

So in the above example,

>>> X = ks.DataFrame({
...     'A': [1,2,3,4,5]
... })
>>> y = ks.Series([1,0,1,0,1], name='A')
>>> y[X['A']>3]
3    0
4    1
Name: A, dtype: int64
>>> 

We will improve that in pandas API on Spark, under https://issues.apache.org/jira/browse/SPARK-36394.

Thanks for letting us know!

FYI @ueshin @HyukjinKwon @itholic