Out of Synchronization operations with shift
oozut opened this issue · comments
I have seen that the order of operation sequences for the cluster to compute the shift operation is not correct.
df = ks.read_csv("/databricks-datasets/COVID/coronavirusdataset/Case.csv")
display(df)
df['prev_city'] = df['city'].shift()
df['next_city'] = df['city'].shift(-1)
display(df)
only_group_false = df.loc[df['group'] == 'false']
display(only_group_false)
On the first display(df)
, the city is correctly displayed as expected without any null
on the second display, the 'prev_city' and 'next_city' have a null value respectively for the first row and the last row. This is correct behaviour that I expected.
However, for a subdataframe only_group_false
, the 'prev_city' and 'next_city' have a null value respectively for the first row and the last row. It seems that the shift operation is applied after the selection is performed.
I have run the same example using pandas original library and I have the behaviour that I expect with no null value present at the top and bottom.
df = ks.read_csv("/databricks-datasets/COVID/coronavirusdataset/Case.csv").to_pandas()
display(df)
df['prev_city'] = df['city'].shift()
df['next_city'] = df['city'].shift(-1)
display(df)
only_group_false = df.loc[df['group'] == False] ## change to bool
display(only_group_false)
I suppose the shift operation is not blocked in the compute order and pass it down to the last operation.
Cluster is:
Cluster Mode: Standard
Databricks Runtime Version: 7.6 ML (includes Apache Spark 3.0.1, Scala 2.12)
Worker Type: Standard_D12_v2
No init scrip.
@oozut Thanks for the report! Sounds like it's a bug. I'll be working on it soon.