possible bug in sparklyr logical planner (selections pushed down too far)
smacke opened this issue · comments
Reporting an Issue with sparklyr
Hi there, I'm seeing what looks like a bug in the sparklyr logical planner on tbl_spark
objects. The tl;dr is that it looks like, in certain cases, selections can be pushed too far down in the logical plan (i.e., past groupby
and mutate
operations that can introduce new columns). The below example, which works fine for vanilla R dataframes, fails for tbl_spark
dataframes:
# Install and load the dplyr package if not already installed
# install.packages("dplyr")
library(dplyr)
library(sparklyr)
# Create a data frame with some sample data
df <- data.frame(
Name = c("Alice", "Bob", "Alice", "Bob", "Charlie"),
Subject = c("Math", "Math", "English", "English", "Math"),
Score = c(90, 85, 88, 92, 78)
)
sc <- spark_connect(master="local")
spark_df <- sparklyr::copy_to(sc, df, "spark_df", overwrite=TRUE)
print(spark_df)
# Group the data frame by the "Name" column
# ------------------ NOTE: Switch between `df` and `spark_df` below to contrast execution behavior for native and spark dataframes
grouped_df <- spark_df %>%
group_by(Name)
# Use mutate to add a new column "AvgScore" (won't actually contain the average score within each group; just a fake column for testing)
result_df <- grouped_df %>%
mutate(AvgScore = 1)
# Print the original and result data frames
# print("Original Data Frame:")
# print(df)
# print("Grouped Data Frame:")
# print(grouped_df)
# print("Result Data Frame:")
# print(result_df)
# bug in sparklyr, but not vanilla R dataframes
arranged <- result_df %>% ungroup() %>% arrange(AvgScore, Name)
print(arranged)
print(arranged %>% select(Name))