pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Home Page:https://pandas.pydata.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ENH: Dedicated method for creating conditional columns

erfannariman opened this issue · comments

Creating conditional columns is an operation you perform a lot when wrangling data. Still right now there is no dedicated and explicit way of creating conditional columns in pandas. This can be seen on StackOverflow which is my top hit when I google "pandas conditional column" (this will obviously vary per person).

On this StackOverflow link we see the result of not being explicit since there are at least 6 different methods mentioned and a lot of anti patterns, also methods we do not want users to use because of inefficiency (apply with if else, map with lambda x for example).

This ticket is made as a starting point for discussion if we should have a more explicit way of creating conditional columns in pandas. Also because we need to look ahead when switching to EA's because the most used methods (and top accepted answer on SO) are numpy methods and return numpy arrays.


What do we have now?:

For single conditional columns:

  • np.where
  • .loc with assignment
  • Series.where
  • Series.mask, see comment

For multi conditional columns:

  • np.select
  • complex list comprehensions
  • .apply with if else (not an option imo)

I don't have a suggestion yet, but I know I really like the clean code you can create with np.select and also with SQL CASE WHEN.

Other languages:

SQL has CASE WHEN:

CASE
    WHEN condition1 THEN result1
    WHEN condition2 THEN result2
    WHEN conditionN THEN resultN
    ELSE result
END;

R (dplyr) has case_when:

df %>% mutate(g = case_when(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4) ~ 2,
                            a == 0 | a == 1 | a == 4 | a == 3 |  c == 4 ~ 3,
                            TRUE ~ NA_real_))

Related tickets / questions:

  • #30357 (using apply with if else)
  • #34543 (conditional join)

One to add to the list is Series.mask, e.g.

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})

mask = df['a'] < 2
df['b'] = df['b'].mask(mask, 0)

mask = (df['a'] + df['b'] > 2) & (df['a'] + df['b'] < 8)
df['b'] = df['b'].mask(mask, 0)

One option, using pd.Series.mask under the hood, is to use a comma as a separator, where the condition comes before the result. I feel it is easier to understand what is going on, and makes the code a bit clean( my opinion). It also allows pandas data types to be retained, since under the hood, the implementation will be using pandas.mask, not np.select. Using @rhshadrach examples above:

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.assign(c = case_when(df.a < 2, 0, df.b))
# multiple conditions : 
df.assign(d = case_when (df.a < 2, 0,   # first condition, result   
                         df.b < 2, 1,    # second condition, result 
                         df.a))            # default

The idea is inspired by if_else in pydatatable and fcase in R's data.table; A chainable method is implemented in pyjanitor (currently in dev) as case_when

One option, using pd.Series.mask under the hood, is to use a comma as a separator, where the condition comes before the result. I feel it is easier to understand what is going on, and makes the code a bit clean( my opinion). It also allows pandas data types to be retained, since under the hood, the implementation will be using pandas.mask, not np.select. Using @rhshadrach examples above:

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.assign(c = case_when(df.a < 2, 0, df.b))
# multiple conditions : 
df.assign(d = case_when (df.a < 2, 0,   # first condition, result   
                         df.b < 2, 1,    # second condition, result 
                         df.a))            # default

The idea is inspired by if_else in pydatatable and fcase in R's data.table; A chainable method is implemented in pyjanitor (currently in dev) as case_when

This looks like a clean method for the api, that means this would be a general function eg pd.case_when, which makes sense to me. We have mask and where for the DataFrame and Series

Instead I think a default argument for the default value would make it more clear.

datar supports if_else and case_when (like dplyr's):

>>> import pandas as pd
>>> from datar.all import f, mutate, case_when, if_else
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> df >> mutate(c=if_else(f.a < 2, 0, f.b))
        a       b         c
  <int64> <int64> <float64>
0       1       4       0.0
1       2       5       5.0
2       3       6       6.0
>>> df >> mutate(d=case_when(f.a < 2, 0, f.b < 2, 1, True, f.a))
        a       b         d
  <int64> <int64> <float64>
0       1       4       0.0
1       2       5       2.0
2       3       6       3.0

would be averse to adding case_when as an api, but would require a champion and substantial testing.

would be averse to adding case_when as an api, but would require a champion and substantial testing.

I agree on the latter, I started with an attempt to implement this and ran into quite some complexity fast. But given the original post and problem statement, wouldn't you agree a "data wrangling" module needs a dedicated method to create conditional columns? It is an action you perform quite often when transforming data.

was a typo, 'wouldn't be averse'. that said this is a substantial effort and would have to be designed properly.

@erfannariman what complexity did you encounter?

@erfannariman what complexity did you encounter?

The first implementation of creating a conditional column was relatively easy, but then the design choices around numpy dtypes vs pandas extension dtypes. Also passing all the tests and edge cases + mypy type checks.

And np.select will still be the fastest method for multi-conditional cases as far as I tested. So using that under the hood made sense to me.

But to be fair, the attempt was relatively short compared to what is needed to implement this.

If you do not mind @erfannariman kindly share the edge cases, as well as the ones regarding dtypes issues so I can test on my end.

Any decision made here on what method to proceed with?

I am still a bit biased towards my suggestion in (issue: #46285, PR: #46286) because its based on Lambdas. Any major counter arguments?

Any major counter arguments?

If I'm understanding the code correctly, those lambdas will be called row-by-row. In that case, they would be highly inefficient when compared to vectorized operations.

If I'm understanding the code correctly, those lambdas will be called row-by-row. In that case, they would be highly inefficient when compared to vectorized operations.

@rhshadrach

Isn't it the same row-by-row assignment when we do:

df.assign(
    column_1 = lambda x: ...,
    column_2 = lambda x: ...,
    column_3 = lambda x: ...,
)

The columns above get assigned one by one, but in the assignment of the column itself, we have vectorisation. I believe this is similar to what was proposed in (issue: #46285, PR: #46286).

If you agree, I can re-open the PR and we can drive this feature forward together.

Thanks.

@erfannariman just checking if you are still working on this. If you are not working on it, I can take over and raise a PR

@ELHoussineT I see, I missed that you're allowing the values to be Series as well as callable. It seems to me this API would not be able to handle the example in #39154 (comment), is that correct?

@rhshadrach @ELHoussineT , one option, would be to create a standalone if_else function, that can be used within assign or on its own. That way assign stays the same with no changes and can be adapted to whatever assignment operation is required. Have a look at an example in #39154(comment) for a possible implementation

@rhshadrach

It seems to me this API would not be able to handle the example in #39154 (comment), is that correct?

I believe the only missing feature is the "default" value. I can tweak this API to support that and add some tests.

If you agree, I will open a PR.

@ELHoussineT - can you show what #39154 (comment) would look like with your suggested signature? My understanding is that the keys of the dictionary are to become the values in the result. I don't see how you can use a Series in a straight-forward manner as the values in the result.

@rhshadrach

Alright, let's start by explicitly describing the requirements for the functionality we want to build. Please feel free to suggest otherwise.

Requirements

The functionality should satisfy the following:

  1. Can assign multiple values to a column based on multiple conditions
  2. Vectorization is maintained
  3. The conditions can be lambdas/methods or Series. If lambdas/methods, the data frame shall be passed to the lambda/method
  4. Ability to provide "default" value (value to use if all conditions failed). If the default value is not defined, use current value otherwise None
  5. All conditions are evaluated. Moreover, values are assigned based on first match. This is an important point to agree on because in the example below, if there are items that satisfy condition 1 and condition 3 for example, said items will get value 1.
"value 1" = lambda x: <conditions 1>,
"value 2" = lambda x: <conditions 2> 
"value 3" = lambda x: <conditions 3> 

Options

Now that we agreed - on a high level - on the requirements. IMHO, here are our options:

Option 1:

What has been suggested by @samukweku in their comment.

Example:

df.assign(d = pd.case_when(lambda x: <some conditions>, 0,   # first condition, result   
                           lambda x: <some conditions>, 1,   # second condition, result 
                           df.a))                            # default (optional)

Option 2:

We follow what Spark did with their when(...).otherwise(...) (docs) functionality.

Example:

df.assign(d = pd.when(lambda x: <some conditions>, 0) # first condition, result   
                .when(lambda x: <some conditions>, 1) # second condition, result 
                .otherwise(df.a))            	      # default (optional)

Please note that if we choose this option, upon implementation, we can rename .otherwise(df.a) with .default(df.a) for convenience.

My opinion

I would grant it to option 2 as IMHO it is more intuitive compared to option 1. I would also be happy to hear what you think.

Let me know your thoughts so we can choose and implement a solution.

How do these differ from Series.mask?

(
    df.a
    .mask(lambda x: <some conditions>, 0) # first condition, result   
    .mask(lambda x: <some conditions>, 1) # second condition, result
)

Also, I don't see how we can implement otherwise. pyspark can do this because of lazy evaluation, but pandas is eager.

Edit: Ah, option 1 allows the default. Am I to assume that condition 2 will only be utilized where condition 1 was false? If this is the case, is it still evaluated where condition 1 is true? This would have implications if you're doing some op that may raise on rows where condition 1 is true.

I can speak about option1. It uses the same idea as np.select, where all conditions are evaluated, and values assigned based on first match, using pd.mask. so if condition 1 is true and condition 2 is true, condition 1 takes precedence. If all fails, then the default kicks in. Basically a for loop, just like in np select, but using pd.mask. Hopefully, I explained it well enough. An implementation is here and an example from stackoverflow

@rhshadrach

How do these differ from Series.mask?

(
    df.a
    .mask(lambda x: <some conditions>, 0) # first condition, result   
    .mask(lambda x: <some conditions>, 1) # second condition, result
)

I can see at least 1 major difference. Please correct me if I understood incorrectly; but in the method you mentioned, you cannot reference other rows.

For example, you cannot do the following:

df.assign(d = pd.when(lambda x: (x.a < 3) & (x.r.isna()), 0)             # first condition, result   
                .when(lambda x: (x.c % 3 == 0) & (x.t == "some str"), 1) # second condition, result 
                .otherwise(df.s))            	                         # default (optional)

If this is the case, is it still evaluated where condition 1 is true? This would have implications if you're doing some op that may raise on rows where condition 1 is true.

Good point, you are right. But as @samukweku this not handled in np.select and I don't believe that it is handled in PySpark either. Maybe it is too much of a corner case which should be handled by the method/lambda written by the user. What do you think?

@samukweku

where all conditions are evaluated, and values assigned based on first match

Good point, I missed that np.select works that way. I think Pandas is closer to Numpy than it is to PySpark. Hence, I adjusted the requirements to "values assigned based on first match" instead of "values assigned based on last match"

Given that I don't think Option 2 is viable (happy to be wrong here if there is a working implementation), Option 1 as linked to here looks good to me. However, I think we should drop the column_name argument and have the method return a Series.

Another question about the API is what to do with default. np.select uses 0, which makes sense with NumPy being much more heavily focused on numeric dtypes. However 0 doesn't seem like an appropriate default to me for pandas. A few other options

  • Use lib.no_default and insert the appropriate null value for the dtype
  • Use lib.no_default to detect and raise when the user doesn't pass a value
  • Use lib.no_default to detect when the user doesn't pass a value, and only raise when default needs to be used.

These all seem like viable options to me, it's not clear which is to be preferred. Any thoughts?

Yes, a pd.Series returned makes sense; the user can do whatever they want with it.

I'd just set the default to None; however, I think the first option of using lib.no_default and inserting the appropriate null value seems more plausible. Dont know the full details of lib.no_default though

Nice.

@samukweku Will you pick this one or shall I give it a shot? Thanks.

@ELHoussineT I'll give it a shot. Thanks. @erfannariman was already working on it, and I do not know the steps involved in taking over. @rhshadrach @erfannariman ok if i make a PR based on our conversations?

@samukweku - most certainly!

@samukweku Thanks! Let me know if you need any help.

@ELHoussineT sure .. i'll start on it

@ELHoussineT pls kindly take over with regards to this PR. Thanks

I've been busy with work, but I would still like to give this one more serious try if it's okay with the others.

@erfannariman Sure thing, if you decided to stop working on it, please ping me to start.

@erfannariman any progress so far? Got some time on me now for this ☺️

This is my first contribution to Pandas core, I would kindly ask for your guidance and patience to make it happen as it will somehow make me happy if I successfully contributed to Pandas core.

Can you please confirm your agreement with the initial approach in #50343?

Usage example:

df.assign(d = pd.case_when(lambda x: <some conditions>, 0,            # first condition, result   
                           (other_df.z > 4) | (other_df.x.isna()), 1, # second condition, result 
                           "some_value"))                             # default (optional)

Approach advantages:

  1. .assign(...) logic is not touched. (Changing the core .assign(...) was discouraged by @jreback in my previous attempt here)
  2. Conditions can be passed as a series or a callable (see "Usage example" in this comment)
  3. Maintains the organic behaviour of np.select where values are assigned based on the first match
  4. I tried to introduce minimal logic and rather rely on np.select

Open questions:

  1. How to expose the function to be accessible via simply calling pd.case_when(...)? I tried adding it here but its not effective yet.
  2. I would suggest for the first iteration of this functionality to keep the default simply None for now. Any disagreements?

Next steps:

  • If you agree to the approach in the mentioned PR, I will proceed with what is left to do in the PR which is mainly (1) docs and (2) tests

Credit:

  1. How to expose the function to be accessible via simply calling pd.case_when(...)? I tried adding it here but its not effective yet.

You need to also modify pandas.__init__ as well as the modification linked above.

  1. I would suggest for the first iteration of this functionality to keep the default simply None for now. Any disagreements?

I think you're saying the default value of the default argument in the function's signature should be None? In such a case, you would not be able to tell if the user passed None (so explicitly wants None) or if they are leaving the default argument unspecified. For this reason, I recommend using lib.no_default.

@rhshadrach Thanks for your input

I've added case_when to the api and implemented lib.no_default.

Docs and tests are remaining, will add those and move the PR from draft to review.

I took a shot at docs and tests, can you please take a final look? @rhshadrach

#50343

Thanks for picking this up, do you mind to keep the discussion in your PR, so no need to comment here for every update. The core devs will respond when they have time. This way we keep this issue clean. @ELHoussineT