sparklyr / sparklyr

R interface for Apache Spark

Home Page:https://spark.rstudio.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

possible bug in sparklyr logical planner (selections pushed down too far)

smacke opened this issue · comments

Reporting an Issue with sparklyr

Hi there, I'm seeing what looks like a bug in the sparklyr logical planner on tbl_spark objects. The tl;dr is that it looks like, in certain cases, selections can be pushed too far down in the logical plan (i.e., past groupby and mutate operations that can introduce new columns). The below example, which works fine for vanilla R dataframes, fails for tbl_spark dataframes:

# Install and load the dplyr package if not already installed
# install.packages("dplyr")
library(dplyr)
library(sparklyr)

# Create a data frame with some sample data
df <- data.frame(
  Name = c("Alice", "Bob", "Alice", "Bob", "Charlie"),
  Subject = c("Math", "Math", "English", "English", "Math"),
  Score = c(90, 85, 88, 92, 78)
)

sc <- spark_connect(master="local")

spark_df <- sparklyr::copy_to(sc, df, "spark_df", overwrite=TRUE)

print(spark_df)

# Group the data frame by the "Name" column
# ------------------ NOTE: Switch between `df` and `spark_df` below to contrast execution behavior for native and spark dataframes
grouped_df <- spark_df %>%
  group_by(Name)

# Use mutate to add a new column "AvgScore" (won't actually contain the average score within each group; just a fake column for testing)
result_df <- grouped_df %>%
  mutate(AvgScore = 1)

# Print the original and result data frames
# print("Original Data Frame:")
# print(df)

# print("Grouped Data Frame:")
# print(grouped_df)

# print("Result Data Frame:")
# print(result_df)

# bug in sparklyr, but not vanilla R dataframes
arranged <- result_df %>% ungroup() %>% arrange(AvgScore, Name)
print(arranged)

print(arranged %>% select(Name))