pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

Home Page:https://docs.pola.rs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`.last()` can't be used on LazyGroupBy

jmakov opened this issue · comments

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

lf = polars.LazyFrame({
    "time": polars.datetime_range(
        start=datetime.datetime(2021, 12, 16),
        end=datetime.datetime(2021, 12, 16, 3),
        interval="30m",
        eager=True),
    "n": range(7),
    "m": range(7)})
lf.group_by_dynamic("time", every="1h", closed="right").last().collect()

Log output

No response

Issue description

.last() can't be used on LazyGroupBy

Expected behavior

According to the docs, it should work

Installed versions

--------Version info---------
Polars:               0.20.31
Index type:           UInt32
Platform:             Linux-6.6.32-1-MANJARO-x86_64-with-glibc2.39
Python:               3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           0.3.3
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.6.0
gevent:               <not installed>
hvplot:               0.10.0
matplotlib:           3.7.3
nest_asyncio:         1.6.0
numpy:                1.25.2
openpyxl:             <not installed>
pandas:               2.0.3
pyarrow:              14.0.2
pydantic:             1.10.16
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                2.1.2.post300
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

Just to expand a bit, this is specific to group_by_dynamic and the problem is index_column being duplicated.

(lf.group_by_dynamic("time", every="1h", closed="right")
   .last()
   .collect()
)
# DuplicateError: column with name 'time' has more than one occurrences

pl.all() in this case includes the index_column (which differs to how .group_by() and by= behaves)

(lf.group_by_dynamic("time", every="1h", closed="right")
   .agg(pl.all().last())
   .collect()
)
# DuplicateError: column with name 'time' has more than one occurrences

In this case index_column needs to be excluded.

(lf.group_by_dynamic("time", every="1h", closed="right")
   .agg(pl.exclude("time").last())
   .collect()
)

# shape: (4, 3)
# ┌─────────────────────┬─────┬─────┐
# │ time                ┆ n   ┆ m   │
# │ ---                 ┆ --- ┆ --- │
# │ datetime[μs]        ┆ i64 ┆ i64 │
# ╞═════════════════════╪═════╪═════╡
# │ 2021-12-15 23:00:00 ┆ 0   ┆ 0   │
# │ 2021-12-16 00:00:00 ┆ 2   ┆ 2   │
# │ 2021-12-16 01:00:00 ┆ 4   ┆ 4   │
# │ 2021-12-16 02:00:00 ┆ 6   ┆ 6   │
# └─────────────────────┴─────┴─────┘

(I'm not entirely sure if these GroupBy.foo() shorthand methods are supposed to be allowed for group_by_dynamic)

Thanks for that. The type is polars.lazyframe.group_by.LazyGroupBy so I assumed it should work (according to the docs). It would help if e.g. your example would be part of the docs to help understand where the DuplicateError comes from.