Out of Synchronization operations with shift

Question

Out of Synchronization operations with shift

oozut opened this issue 3 years ago · comments

I have seen that the order of operation sequences for the cluster to compute the shift operation is not correct.

df = ks.read_csv("/databricks-datasets/COVID/coronavirusdataset/Case.csv")
display(df)
df['prev_city'] = df['city'].shift()
df['next_city'] = df['city'].shift(-1)
display(df)
only_group_false = df.loc[df['group'] == 'false']
display(only_group_false)

On the first display(df), the city is correctly displayed as expected without any null on the second display, the 'prev_city' and 'next_city' have a null value respectively for the first row and the last row. This is correct behaviour that I expected.

However, for a subdataframe only_group_false, the 'prev_city' and 'next_city' have a null value respectively for the first row and the last row. It seems that the shift operation is applied after the selection is performed.

I have run the same example using pandas original library and I have the behaviour that I expect with no null value present at the top and bottom.

df = ks.read_csv("/databricks-datasets/COVID/coronavirusdataset/Case.csv").to_pandas()
display(df)
df['prev_city'] = df['city'].shift()
df['next_city'] = df['city'].shift(-1)
display(df)
only_group_false = df.loc[df['group'] == False] ## change to bool
display(only_group_false)

I suppose the shift operation is not blocked in the compute order and pass it down to the last operation.

Cluster is:
Cluster Mode: Standard
Databricks Runtime Version: 7.6 ML (includes Apache Spark 3.0.1, Scala 2.12)
Worker Type: Standard_D12_v2

No init scrip.

Takuya UESHIN · Answer 1 · Sat Mar 06 2021 04:20:54 GMT+0800 (China Standard Time)

@oozut Thanks for the report! Sounds like it's a bug. I'll be working on it soon.