ENH: Dedicated method for creating conditional columns

Question

ENH: Dedicated method for creating conditional columns

erfannariman opened this issue 4 years ago · comments

Creating conditional columns is an operation you perform a lot when wrangling data. Still right now there is no dedicated and explicit way of creating conditional columns in pandas. This can be seen on StackOverflow which is my top hit when I google "pandas conditional column" (this will obviously vary per person).

On this StackOverflow link we see the result of not being explicit since there are at least 6 different methods mentioned and a lot of anti patterns, also methods we do not want users to use because of inefficiency (apply with if else, map with lambda x for example).

This ticket is made as a starting point for discussion if we should have a more explicit way of creating conditional columns in pandas. Also because we need to look ahead when switching to EA's because the most used methods (and top accepted answer on SO) are numpy methods and return numpy arrays.

What do we have now?:

For single conditional columns:

np.where
.loc with assignment
Series.where
Series.mask, see comment

For multi conditional columns:

np.select
complex list comprehensions
.apply with if else (not an option imo)

I don't have a suggestion yet, but I know I really like the clean code you can create with np.select and also with SQL CASE WHEN.

Other languages:

SQL has CASE WHEN:

CASE
    WHEN condition1 THEN result1
    WHEN condition2 THEN result2
    WHEN conditionN THEN resultN
    ELSE result
END;

R (dplyr) has case_when:

df %>% mutate(g = case_when(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4) ~ 2,
                            a == 0 | a == 1 | a == 4 | a == 3 |  c == 4 ~ 3,
                            TRUE ~ NA_real_))

Related tickets / questions:

#30357 (using apply with if else)
#34543 (conditional join)

Samuel Oranyeli commented 2 years ago

take

Erfan Nariman · Answer 1 · Thu Jan 14 2021 05:31:56 GMT+0800 (China Standard Time)

cc @jorisvandenbossche @WillAyd

Richard Shadrach · Answer 2 · Fri Jan 15 2021 12:31:29 GMT+0800 (China Standard Time)

One to add to the list is Series.mask, e.g.

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})

mask = df['a'] < 2
df['b'] = df['b'].mask(mask, 0)

mask = (df['a'] + df['b'] > 2) & (df['a'] + df['b'] < 8)
df['b'] = df['b'].mask(mask, 0)

Samuel Oranyeli · Answer 3 · Fri Oct 01 2021 06:40:02 GMT+0800 (China Standard Time)

One option, using pd.Series.mask under the hood, is to use a comma as a separator, where the condition comes before the result. I feel it is easier to understand what is going on, and makes the code a bit clean( my opinion). It also allows pandas data types to be retained, since under the hood, the implementation will be using pandas.mask, not np.select. Using @rhshadrach examples above:

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.assign(c = case_when(df.a < 2, 0, df.b))
# multiple conditions : 
df.assign(d = case_when (df.a < 2, 0,   # first condition, result   
                         df.b < 2, 1,    # second condition, result 
                         df.a))            # default

The idea is inspired by if_else in pydatatable and fcase in R's data.table; A chainable method is implemented in pyjanitor (currently in dev) as case_when

Erfan Nariman · Answer 4 · Fri Oct 01 2021 16:33:24 GMT+0800 (China Standard Time)

One option, using pd.Series.mask under the hood, is to use a comma as a separator, where the condition comes before the result. I feel it is easier to understand what is going on, and makes the code a bit clean( my opinion). It also allows pandas data types to be retained, since under the hood, the implementation will be using pandas.mask, not np.select. Using @rhshadrach examples above:
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.assign(c = case_when(df.a < 2, 0, df.b))
# multiple conditions : 
df.assign(d = case_when (df.a < 2, 0,   # first condition, result   
                         df.b < 2, 1,    # second condition, result 
                         df.a))            # default
The idea is inspired by if_else in pydatatable and fcase in R's data.table; A chainable method is implemented in pyjanitor (currently in dev) as case_when

This looks like a clean method for the api, that means this would be a general function eg pd.case_when, which makes sense to me. We have mask and where for the DataFrame and Series

Instead I think a default argument for the default value would make it more clear.

pwwang · Answer 5 · Fri Mar 18 2022 02:41:26 GMT+0800 (China Standard Time)

datar supports if_else and case_when (like dplyr's):

>>> import pandas as pd
>>> from datar.all import f, mutate, case_when, if_else
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> df >> mutate(c=if_else(f.a < 2, 0, f.b))
        a       b         c
  <int64> <int64> <float64>
0       1       4       0.0
1       2       5       5.0
2       3       6       6.0
>>> df >> mutate(d=case_when(f.a < 2, 0, f.b < 2, 1, True, f.a))
        a       b         d
  <int64> <int64> <float64>
0       1       4       0.0
1       2       5       2.0
2       3       6       3.0

Jeff Reback · Answer 6 · Sun Mar 20 2022 02:34:38 GMT+0800 (China Standard Time)

would be averse to adding case_when as an api, but would require a champion and substantial testing.

Erfan Nariman · Answer 7 · Sun Mar 20 2022 05:17:10 GMT+0800 (China Standard Time)

would be averse to adding case_when as an api, but would require a champion and substantial testing.

I agree on the latter, I started with an attempt to implement this and ran into quite some complexity fast. But given the original post and problem statement, wouldn't you agree a "data wrangling" module needs a dedicated method to create conditional columns? It is an action you perform quite often when transforming data.

Jeff Reback · Answer 8 · Sun Mar 20 2022 05:20:15 GMT+0800 (China Standard Time)

was a typo, 'wouldn't be averse'. that said this is a substantial effort and would have to be designed properly.

Samuel Oranyeli · Answer 9 · Sun Mar 20 2022 05:58:39 GMT+0800 (China Standard Time)

@erfannariman what complexity did you encounter?

Erfan Nariman · Answer 10 · Mon Mar 21 2022 00:30:44 GMT+0800 (China Standard Time)

@erfannariman what complexity did you encounter?

The first implementation of creating a conditional column was relatively easy, but then the design choices around numpy dtypes vs pandas extension dtypes. Also passing all the tests and edge cases + mypy type checks.

And np.select will still be the fastest method for multi-conditional cases as far as I tested. So using that under the hood made sense to me.

But to be fair, the attempt was relatively short compared to what is needed to implement this.

Samuel Oranyeli · Answer 11 · Mon Mar 21 2022 04:47:31 GMT+0800 (China Standard Time)

If you do not mind @erfannariman kindly share the edge cases, as well as the ones regarding dtypes issues so I can test on my end.

ELHoussineT · Answer 12 · Mon Oct 10 2022 21:27:01 GMT+0800 (China Standard Time)

Any decision made here on what method to proceed with?

ELHoussineT · Answer 13 · Mon Oct 10 2022 21:28:35 GMT+0800 (China Standard Time)

I am still a bit biased towards my suggestion in (issue: #46285, PR: #46286) because its based on Lambdas. Any major counter arguments?

Richard Shadrach · Answer 14 · Mon Oct 17 2022 04:54:20 GMT+0800 (China Standard Time)

Any major counter arguments?

If I'm understanding the code correctly, those lambdas will be called row-by-row. In that case, they would be highly inefficient when compared to vectorized operations.

ELHoussineT · Answer 15 · Wed Oct 19 2022 22:26:30 GMT+0800 (China Standard Time)

If I'm understanding the code correctly, those lambdas will be called row-by-row. In that case, they would be highly inefficient when compared to vectorized operations.

@rhshadrach

Isn't it the same row-by-row assignment when we do:

df.assign(
    column_1 = lambda x: ...,
    column_2 = lambda x: ...,
    column_3 = lambda x: ...,
)

The columns above get assigned one by one, but in the assignment of the column itself, we have vectorisation. I believe this is similar to what was proposed in (issue: #46285, PR: #46286).

If you agree, I can re-open the PR and we can drive this feature forward together.

Thanks.

Samuel Oranyeli · Answer 16 · Thu Oct 20 2022 05:34:13 GMT+0800 (China Standard Time)

@erfannariman just checking if you are still working on this. If you are not working on it, I can take over and raise a PR

Richard Shadrach · Answer 17 · Thu Oct 20 2022 10:20:32 GMT+0800 (China Standard Time)

@ELHoussineT I see, I missed that you're allowing the values to be Series as well as callable. It seems to me this API would not be able to handle the example in #39154 (comment), is that correct?

Samuel Oranyeli · Answer 18 · Thu Oct 20 2022 13:51:25 GMT+0800 (China Standard Time)

@rhshadrach @ELHoussineT , one option, would be to create a standalone if_else function, that can be used within assign or on its own. That way assign stays the same with no changes and can be adapted to whatever assignment operation is required. Have a look at an example in #39154(comment) for a possible implementation

ELHoussineT · Answer 19 · Thu Oct 20 2022 22:10:00 GMT+0800 (China Standard Time)

@rhshadrach

It seems to me this API would not be able to handle the example in #39154 (comment), is that correct?

I believe the only missing feature is the "default" value. I can tweak this API to support that and add some tests.

If you agree, I will open a PR.

Richard Shadrach · Answer 20 · Fri Oct 21 2022 11:20:11 GMT+0800 (China Standard Time)

@ELHoussineT - can you show what #39154 (comment) would look like with your suggested signature? My understanding is that the keys of the dictionary are to become the values in the result. I don't see how you can use a Series in a straight-forward manner as the values in the result.

ELHoussineT · Answer 21 · Mon Oct 24 2022 18:02:59 GMT+0800 (China Standard Time)

Kindly allow me to reply ~ by tomorrow.

…

On Fri, 21 Oct 2022, 5:20 am Richard Shadrach, ***@***.***> wrote: @ELHoussineT <https://github.com/ELHoussineT> - can you show what #39154 (comment) <#39154 (comment)> would look like with your suggested signature? My understanding is that the keys of the dictionary are to become the values in the result. I don't see how you can use a Series in a straight-forward manner as the values in the result. — Reply to this email directly, view it on GitHub <#39154 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE5NYIN6VATA6SZ7PJR6DQTWEIDXNANCNFSM4WBOHW3A> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ELHoussineT · Answer 22 · Tue Oct 25 2022 05:13:07 GMT+0800 (China Standard Time)

@rhshadrach

Alright, let's start by explicitly describing the requirements for the functionality we want to build. Please feel free to suggest otherwise.

Requirements

The functionality should satisfy the following:

Can assign multiple values to a column based on multiple conditions
Vectorization is maintained
The conditions can be lambdas/methods or Series. If lambdas/methods, the data frame shall be passed to the lambda/method
Ability to provide "default" value (value to use if all conditions failed). If the default value is not defined, use current value otherwise None
All conditions are evaluated. Moreover, values are assigned based on first match. This is an important point to agree on because in the example below, if there are items that satisfy condition 1 and condition 3 for example, said items will get value 1.

"value 1" = lambda x: <conditions 1>,
"value 2" = lambda x: <conditions 2> 
"value 3" = lambda x: <conditions 3>

Options

Now that we agreed - on a high level - on the requirements. IMHO, here are our options:

Option 1:

What has been suggested by @samukweku in their comment.

Example:

df.assign(d = pd.case_when(lambda x: <some conditions>, 0,   # first condition, result   
                           lambda x: <some conditions>, 1,   # second condition, result 
                           df.a))                            # default (optional)

Option 2:

We follow what Spark did with their when(...).otherwise(...) (docs) functionality.

Example:

df.assign(d = pd.when(lambda x: <some conditions>, 0) # first condition, result   
                .when(lambda x: <some conditions>, 1) # second condition, result 
                .otherwise(df.a))            	      # default (optional)

Please note that if we choose this option, upon implementation, we can rename .otherwise(df.a) with .default(df.a) for convenience.

My opinion

I would grant it to option 2 as IMHO it is more intuitive compared to option 1. I would also be happy to hear what you think.

Let me know your thoughts so we can choose and implement a solution.

Richard Shadrach · Answer 23 · Tue Oct 25 2022 19:06:21 GMT+0800 (China Standard Time)

How do these differ from Series.mask?

(
    df.a
    .mask(lambda x: <some conditions>, 0) # first condition, result   
    .mask(lambda x: <some conditions>, 1) # second condition, result
)

Also, I don't see how we can implement otherwise. pyspark can do this because of lazy evaluation, but pandas is eager.

Edit: Ah, option 1 allows the default. Am I to assume that condition 2 will only be utilized where condition 1 was false? If this is the case, is it still evaluated where condition 1 is true? This would have implications if you're doing some op that may raise on rows where condition 1 is true.

Samuel Oranyeli · Answer 24 · Tue Oct 25 2022 19:19:33 GMT+0800 (China Standard Time)

I can speak about option1. It uses the same idea as np.select, where all conditions are evaluated, and values assigned based on first match, using pd.mask. so if condition 1 is true and condition 2 is true, condition 1 takes precedence. If all fails, then the default kicks in. Basically a for loop, just like in np select, but using pd.mask. Hopefully, I explained it well enough. An implementation is here and an example from stackoverflow

ELHoussineT · Answer 25 · Tue Oct 25 2022 22:18:44 GMT+0800 (China Standard Time)

@rhshadrach

How do these differ from Series.mask?

(
    df.a
    .mask(lambda x: <some conditions>, 0) # first condition, result   
    .mask(lambda x: <some conditions>, 1) # second condition, result
)

I can see at least 1 major difference. Please correct me if I understood incorrectly; but in the method you mentioned, you cannot reference other rows.

For example, you cannot do the following:

df.assign(d = pd.when(lambda x: (x.a < 3) & (x.r.isna()), 0)             # first condition, result   
                .when(lambda x: (x.c % 3 == 0) & (x.t == "some str"), 1) # second condition, result 
                .otherwise(df.s))            	                         # default (optional)

If this is the case, is it still evaluated where condition 1 is true? This would have implications if you're doing some op that may raise on rows where condition 1 is true.

Good point, you are right. But as @samukweku this not handled in np.select and I don't believe that it is handled in PySpark either. Maybe it is too much of a corner case which should be handled by the method/lambda written by the user. What do you think?

ELHoussineT · Answer 26 · Tue Oct 25 2022 22:20:33 GMT+0800 (China Standard Time)

@samukweku

where all conditions are evaluated, and values assigned based on first match

Good point, I missed that np.select works that way. I think Pandas is closer to Numpy than it is to PySpark. Hence, I adjusted the requirements to "values assigned based on first match" instead of "values assigned based on last match"

Richard Shadrach · Answer 27 · Sat Oct 29 2022 04:45:20 GMT+0800 (China Standard Time)

Given that I don't think Option 2 is viable (happy to be wrong here if there is a working implementation), Option 1 as linked to here looks good to me. However, I think we should drop the column_name argument and have the method return a Series.

Another question about the API is what to do with default. np.select uses 0, which makes sense with NumPy being much more heavily focused on numeric dtypes. However 0 doesn't seem like an appropriate default to me for pandas. A few other options

Use lib.no_default and insert the appropriate null value for the dtype
Use lib.no_default to detect and raise when the user doesn't pass a value
Use lib.no_default to detect when the user doesn't pass a value, and only raise when default needs to be used.

These all seem like viable options to me, it's not clear which is to be preferred. Any thoughts?

Samuel Oranyeli · Answer 28 · Sat Oct 29 2022 08:24:48 GMT+0800 (China Standard Time)

Yes, a pd.Series returned makes sense; the user can do whatever they want with it.

I'd just set the default to None; however, I think the first option of using lib.no_default and inserting the appropriate null value seems more plausible. Dont know the full details of lib.no_default though

ELHoussineT · Answer 29 · Thu Nov 03 2022 06:04:13 GMT+0800 (China Standard Time)

Nice.

@samukweku Will you pick this one or shall I give it a shot? Thanks.

Samuel Oranyeli · Answer 30 · Thu Nov 03 2022 06:16:02 GMT+0800 (China Standard Time)

@ELHoussineT I'll give it a shot. Thanks. @erfannariman was already working on it, and I do not know the steps involved in taking over. @rhshadrach @erfannariman ok if i make a PR based on our conversations?

Richard Shadrach · Answer 31 · Sun Nov 06 2022 04:22:45 GMT+0800 (China Standard Time)

@samukweku - most certainly!

ELHoussineT · Answer 32 · Sun Nov 06 2022 19:07:00 GMT+0800 (China Standard Time)

@samukweku Thanks! Let me know if you need any help.

Samuel Oranyeli · Answer 33 · Sun Nov 06 2022 19:31:40 GMT+0800 (China Standard Time)

@ELHoussineT sure .. i'll start on it

Samuel Oranyeli · Answer 34 · Sat Nov 12 2022 13:37:26 GMT+0800 (China Standard Time)

@ELHoussineT pls kindly take over with regards to this PR. Thanks

Erfan Nariman · Answer 35 · Sat Nov 12 2022 19:19:54 GMT+0800 (China Standard Time)

I've been busy with work, but I would still like to give this one more serious try if it's okay with the others.

ELHoussineT · Answer 36 · Sun Nov 13 2022 23:32:05 GMT+0800 (China Standard Time)

@erfannariman Sure thing, if you decided to stop working on it, please ping me to start.

Samuel Oranyeli · Answer 37 · Sun Dec 04 2022 09:08:58 GMT+0800 (China Standard Time)

@erfannariman any progress so far? Got some time on me now for this ☺️

ELHoussineT · Answer 38 · Mon Dec 19 2022 22:59:48 GMT+0800 (China Standard Time)

This is my first contribution to Pandas core, I would kindly ask for your guidance and patience to make it happen as it will somehow make me happy if I successfully contributed to Pandas core.

Can you please confirm your agreement with the initial approach in #50343?

Usage example:

df.assign(d = pd.case_when(lambda x: <some conditions>, 0,            # first condition, result   
                           (other_df.z > 4) | (other_df.x.isna()), 1, # second condition, result 
                           "some_value"))                             # default (optional)

Approach advantages:

.assign(...) logic is not touched. (Changing the core .assign(...) was discouraged by @jreback in my previous attempt here)
Conditions can be passed as a series or a callable (see "Usage example" in this comment)
Maintains the organic behaviour of np.select where values are assigned based on the first match
I tried to introduce minimal logic and rather rely on np.select

Open questions:

How to expose the function to be accessible via simply calling pd.case_when(...)? I tried adding it here but its not effective yet.
I would suggest for the first iteration of this functionality to keep the default simply None for now. Any disagreements?

Next steps:

If you agree to the approach in the mentioned PR, I will proceed with what is left to do in the PR which is mainly (1) docs and (2) tests

Credit:

Some of the logic is inspired by this implementation which at the time of constructing the PR was contributed to by @samukweku @Zeroto521 @thatlittleboy and @ericmjl

ELHoussineT · Answer 39 · Mon Jan 02 2023 18:10:34 GMT+0800 (China Standard Time)

ping @erfannariman or @rhshadrach

Richard Shadrach · Answer 40 · Wed Jan 04 2023 06:05:45 GMT+0800 (China Standard Time)

How to expose the function to be accessible via simply calling pd.case_when(...)? I tried adding it here but its not effective yet.

You need to also modify pandas.__init__ as well as the modification linked above.

I would suggest for the first iteration of this functionality to keep the default simply None for now. Any disagreements?

I think you're saying the default value of the default argument in the function's signature should be None? In such a case, you would not be able to tell if the user passed None (so explicitly wants None) or if they are leaving the default argument unspecified. For this reason, I recommend using lib.no_default.

ELHoussineT · Answer 41 · Fri Jan 06 2023 17:51:29 GMT+0800 (China Standard Time)

@rhshadrach Thanks for your input

I've added case_when to the api and implemented lib.no_default.

Docs and tests are remaining, will add those and move the PR from draft to review.

ELHoussineT · Answer 42 · Fri Jan 06 2023 19:11:40 GMT+0800 (China Standard Time)

I took a shot at docs and tests, can you please take a final look? @rhshadrach

#50343

Erfan Nariman · Answer 43 · Fri Jan 06 2023 19:44:48 GMT+0800 (China Standard Time)

Thanks for picking this up, do you mind to keep the discussion in your PR, so no need to comment here for every update. The core devs will respond when they have time. This way we keep this issue clean. @ELHoussineT