rapidsai / cudf

cuDF - GPU DataFrame Library

Home Page:https://docs.rapids.ai/api/cudf/stable/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] Cannot find maxima of a categorical series

rjzamora opened this issue · comments

Describe the bug
While debugging some strange dask-expr + cudf sorting behavior, I realized that I cannot call ser.min() when ser is a categorical Series. This is a problem for the sort_values logic used in dask.

Steps/Code to reproduce bug

import cudf

ser = cudf.Series(range(10), dtype="category")
ser.min()
# Same problem with `ser.cat.as_ordered().min()`
...
TypeError: Cannot interpret 'CategoricalDtype(categories=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], ordered=False, categories_dtype=int64)' as a data type

Expected behavior
I'd expect for min/max to work (assuming the dtype is "ordered").

Hmmm. Can you try:

diff --git a/python/cudf/cudf/core/column/categorical.py b/python/cudf/cudf/core/column/categorical.py
index e3e7303504..fc996e6b6a 100644
--- a/python/cudf/cudf/core/column/categorical.py
+++ b/python/cudf/cudf/core/column/categorical.py
@@ -515,6 +515,10 @@ class CategoricalColumn(column.ColumnBase):
     dtype: cudf.core.dtypes.CategoricalDtype
     _codes: Optional[NumericalColumn]
     _children: Tuple[NumericalColumn]
+    _VALID_REDUCTIONS = {
+        "max",
+        "min",
+    }
     _VALID_BINARY_OPERATIONS = {
         "__eq__",
         "__ne__",
@@ -699,6 +703,27 @@ class CategoricalColumn(column.ColumnBase):
             ),
         )
 
+    def _reduce(
+        self,
+        op: str,
+        skipna: Optional[bool] = None,
+        min_count: int = 0,
+        *args,
+        **kwargs,
+    ) -> ScalarLike:
+        # Only valid reductions are min and max
+        if not self.ordered:
+            raise TypeError(
+                "Categorical is not ordered for operation min "
+                "you can use .as_ordered() to change the Categorical "
+                "to an ordered one."
+            )
+        return self._encode(
+            self.codes._reduce(
+                op=op, skipna=skipna, min_count=min_count, *args, **kwargs
+            )
+        )
+
     def _binaryop(self, other: ColumnBinaryOperand, op: str) -> ColumnBase:
         other = self._wrap_binop_normalization(other)
         # TODO: This is currently just here to make mypy happy, but eventually

Aside, it is mind-boggling to me that unordered_cat.sort_values() is allowed, but unordered_cat.min() is not. How can you sort if you can't produce a minimum element?

@rjzamora any chance you tried out @wence- 's snippet above?

Aside, it is mind-boggling to me that unordered_cat.sort_values() is allowed, but unordered_cat.min() is not. How can you sort if you can't produce a minimum element?

A weak ordering and we don't like ties??? I don't know, I agree that this seems nonsensical on its face.

any chance you tried out @wence- 's snippet above?

Not yet - Thanks for suggestion @wence-! I will test this out soon.

A weak ordering and we don't like ties??? I don't know, I agree that this seems nonsensical on its face.

I did some digging, it's a combination of matching R's factor API and implementation leaking into semantics. See discussions pandas-dev/pandas#9611, pandas-dev/pandas#9622, and pandas-dev/pandas#12785

Effectively by having sort-based groupby be a promise, you're backed into a corner by having unordered categoricals as groupby keys. Of course, the right answer is to say "no can do", but ...

Added a test to the patch suggested by @wence- and submitted #15701