Error when filtering a Series using a condition from a DataFrame
lamesjaidler opened this issue · comments
I'm wanting to filter down a Koalas Series based on a condition from a related Koalas DataFrame
X = ks.DataFrame({
'A': [1,2,3,4,5]
})
y = ks.Series([1,0,1,0,1])
y[X['A']>3]
However, running the last line gives the following error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/var/folders/hw/y20855_146x8gbvbpqmtpdp00000gq/T/ipykernel_27136/1423666562.py in <module>
3 })
4 y = ks.Series([1,0,1,0,1])
----> 5 y[X['A']>3]
~/venvs/argov2_dev/lib/python3.8/site-packages/databricks/koalas/series.py in __getitem__(self, key)
6134 # with ints, searches based on index values when the value is int.
6135 return self.iloc[key]
-> 6136 return self.loc[key]
6137 except SparkPandasIndexingError:
6138 raise KeyError(
~/venvs/argov2_dev/lib/python3.8/site-packages/databricks/koalas/indexing.py in __getitem__(self, key)
419
420 kdf[temp_col] = key
--> 421 return type(self)(kdf[self._kdf_or_kser.name])[kdf[temp_col]]
422
423 cond, limit, remaining_index = self._select_rows(key)
~/venvs/argov2_dev/lib/python3.8/site-packages/databricks/koalas/frame.py in __getitem__(self, key)
11705
11706 if key is None:
> 11707 raise KeyError("none key")
11708 elif isinstance(key, Series):
11709 return self.loc[key.astype(bool)]
KeyError: 'none key'
This syntax works as expected with Pandas Series/DataFrames:
X = pd.DataFrame({
'A': [1,2,3,4,5]
})
y = pd.Series([1,0,1,0,1])
y[X['A']>3]
Gives:
3 0
4 1
dtype: int64
Note that I have the following option set:
ks.set_option('compute.ops_on_diff_frames', True)
This seems like a bug? Or does it need to be carried out in a different way?
Unfortunately, the exact use case above is not supported.
However, there is workaround by making the original series(y
) have the same name as the conditioning series work.
So in the above example,
>>> X = ks.DataFrame({
... 'A': [1,2,3,4,5]
... })
>>> y = ks.Series([1,0,1,0,1], name='A')
>>> y[X['A']>3]
3 0
4 1
Name: A, dtype: int64
>>>
We will improve that in pandas API on Spark, under https://issues.apache.org/jira/browse/SPARK-36394.
Thanks for letting us know!