How would you like feedback?
jcmkk3 opened this issue · comments
Hi Max,
I'm really excited to see this new project of yours. You've designed a lot of great APIs and the dataframe space in python could use some more exploration.
I see that you've listed some prompts for feedback in the README. I would love to provide feedback and ideas. It always feels a bit awkward to me to use the Issues section of GitHub brainstorming discussions, but I'm happy to follow your lead on however you'd prefer.
Would you like feedback here, or do you think it makes sense to turn on the Discussions section of the repo and add it there?
🙈 I wasn't expecting anyone to find this yet!
(That you did, though, is another positive indicator that tells me that there is a space for a "Trello- like" data manipulation tool to "Jira- like" pandas)
But, I'll accept any and all feedback! (In this thread, here, is totally fine)
I'm buttoning up the docstrings, readme, examples, and the rest of the unit tests today (+ this weekend), but the API for 1.0 is 98% done!
Overall, I really like where this library is headed. It feels like a pretty pythonic take on the grammar of data manipulation. I'm still trying to feel out where your line is between simple-in-implementation and simple-in-use. Below are some comments from my initial exploration of the library.
I felt a bit confused about whether redframes was taking an iterative approach or an array/vectorized approach. For instance, the three primary places that accept a function as an argument take them in different forms.
mutate
: Accepts a function that takes a row of the data frame and returns a single valuefilter
: Accepts a function that takes a data frame and returns a boolean sequence that is used as a masksummarize
: Accepts a function that takes a column and returns a single value
If the approach was to be iterative, I would expect filter
to be passed a row and return a boolean. summarize
might be an exceptional case where it could be expected to behave differently. For maximum consistency, it would probably look like reduce
, but I'm not sure that is actually desirable. At least not without some pre-defined reducer
objects or something that could be used as a shortcut for the common aggregations.
If the approach was to be array/vectorized, I would expect mutate
to operate on a whole column. That would mean passing it the whole data frame like the filter
operation does. One benefit to this option is that it would be possible to mix an aggregation into a calculation like {"pct_count": lambda row: row["count"] / sum(row["count"])}
. I could understand if this was undesirable depending on your vision of simplicity.
take: take
is typically the opposite of drop
, but in this library, those two verbs work on different dimensions of the data frame (rows vs columns). I feel like select
and drop
can work well together, but I think that it would be better to use a different verb for row selection to reduce confusion. See itertools.takewhile
and itertools.dropwhile
as examples of these names being used as opposites in python.
accumulate: It was surprising to me that this was hardcoded to only do a cummulative sum. I feel like this method should either be renamed or better aligned with the capabilities and (potentially) function signature of itertools.accumulate
.
wrap/unwrap: I wonder if implementing the dataframe interchange protocol would remove the need for these functions? That would also make it possible to interoperate with other data frame libraries in addition to pandas.
rename: Since mutate
and summarize
take dictionaries that use the keys as the column name after calculation, I believe that rename
should follow the same pattern. So {"new_name": "old_name"}
group: It might be worth considering whether group
is better as a parameter on the functions that could be affected by it. Hadley has stated before that he wished that he would have went that direction in dplyr. It also appears that they will be adding those parameters in a future dplyr
summarize: arquero and d3 both use rollup
as their verb for this action. It has somehow really resonated with me. It gets rid of the American/British English spelling difference and is also shorter, which better matches most other verbs. I think that summarize
is a good verb, as well, I just like rollup
better.
count: Counting is something that I do very frequently and I appreciate dplyr's count
or panda's value_counts
. It felt especially clunky to count with summarize
because I had to pass rf.stat.count
a column reference, even though I just want to count the whole data frame. That could be made better if the aggregate functions were passed the whole data frame and were in charge of selecting the appropriate columns themselves, however. Something like summarize({"count": rf.Count()})
. I think that would be a good change, but also having a data frame level count
verb.
transform: I don't actually like the verb name, but if mutate
can't take (and expand) an aggregate, then it feels like there needs to be some more explicit way to do it. dataflow-api calls this a joinaggregate, for instance.
Really appreciate your kind words and thoughtful feedback!
At a high level "simple-in-use" should trump "simple-in-implementation". That said, because I've got pandas on the backend (for now) some things just aren't possible (without a massive lift).
But, you're right, if this worked it would be sick (just don't know how to make it work):
df.mutate({"pct_count": lambda row: row["count"] / sum(row["count"])})
Concerning the signatures for mutate
, filter
, summarize
... I spent a lot of time thinking about and prototyping these out before doing the rest of the verbs. It would be awesome if each just accepted a function, but it's just not possible to get all the signatures to match. Consider the following:
### MUTATE
# Current:
df.mutate({"foo2": lambda row: row["foo"] * 2})
# Option B: (But how to assign this to "foo2"?)
df.mutate(lambda row: row["foo"] * 2)
# Option C: (But how to accept multiple columns in this form?)
df.mutate("foo2", lambda x: x * 2, "foo")
# Option D: (But kwargs seem mysterious??)
df.mutate(
foo2=lambda row: row["foo"] * 2,
foo3=lambda row: row["foo"] / 3
)
# Option E: (But this isn't chainable)
df["foo2"] = df["foo"] * 2
### FILTER
# Current:
df.filter(lambda row: (row["foo"] == 2) | (row["bar"] < 2) & (roo))
# Option B: (But how to represent or?)
df.filter({"foo": lambda x: x < 5, "bar": lambda x: x == 10})
### SUMMARIZE
# Current:
df.summarize({
"foo_mean": ("foo", rf.stat.mean),
"foo_sum": ("foo", rf.stat.sum)
})
# Option B: (But how to assign this to a column? How to name it? How to do multiple summaries?)
df.summarize(lambda d: sum(d["foo"]))
# Option C: (But where does "foo" even come from? sum("foo") isn't real python!)
df.summarize(foo_mean=sum("foo"))
Verb-by-verb responses:
take: select
and drop
are already established verbs in pandas, so I just kept these the same. Obviously, I couldn't use "head" and "tail" because they ain't verbs (lol), so I just had to pick some word... well, I didn't just pick a word, I actually stole it from dask (I know it's not perfect, but it grows on you)
accumulate: I had originally planned for this verb to accommodate min/max/mean/sum but scrapped it (for now) because I've never actually used df["col"].cummin()
before
wrap/unwrap: Thanks for the tip on the dataframe interchange protocol... I'll definitely adopt this in 1.1+
rename: I really struggled with {old_name": "new_name"}
versus {"new_name": "old_name"}
but went for the former because of the established precedence set by pandas...
group: I actually considered putting "by" arguments in take
, summarize
, rank
, accumulate
, etc. Super interesting to read about Hadley's regrets. I'll have to take them seriously and think some more about this.
summarize: rollup is a great verb (far better than aggregate)! I might actually alias to this...
count: there is df.dimensions["rows"]
? But maybe I should consider a dplyr like "tally" verb...
transform: If there's enough demand for a transform verb I'll add! But I just can't really think up legit ML/vis workflow use cases right now.
filter
is already using a signature that would match up with these. It just doesn't assign names to the function, but that makes sense for a filter.
Playing around with assign this morning (great tip!)
It works:
But breaks something like this:
Honestly, I'm hesitant about adopting vectorized operations, because they don't appear in regular/analog Python...
I always wanted mutate to feel like this (like map):
Curious if you see away around the break?
Your ideas about summarize/rollup are certainly interesting. I actually quite like:
{
"max_bill_length_mm": lambda df: rf.stat.max(df["bill_length_mm"])
}
But I think bare functions like len
might introduce confusion...
And, right now, I'm pretty opposed to something like Mean("body_mass_g")
as it can't stand alone as "valid Python"...
Just have to figure out how to make your idea of creating an entirely new DataFrame work with "group/by"~
A potential issue with:
{
"max_bill_length_mm": lambda df: rf.stat.max(df["bill_length_mm"])
}
If we expose the entire "DataFrame", what's preventing someone from doing something crazy like this:
{
"sillyness": lambda df: np.min(df["foo"]) + np.min(df["bar"])
}
This is how I would do percentage total mutate as it currently stands:
import redframes as rf
df = rf.DataFrame({"foo": [1, 2, 3]})
(
df.mutate({
"total": lambda _: sum(df["foo"]),
"percent": lambda row: row["foo"] / row["total"]
})
)
foo | total | percent |
---|---|---|
1 | 6 | 0.166667 |
2 | 6 | 0.333333 |
3 | 6 | 0.5 |
Certainly not perfect, but not horrific, either...
First, I want to state that I don't think that you need to make any of these changes and I'm mostly just brainstorming ways that I could see it being more flexible and consistent. I won't have any hurt feelings if you decide that any of this doesn't fit in with your vision.
But breaks something like this:
I think that this is okay behavior in my opinion. Like many sql dialects and assign
in pandas, you cannot refer to columns created in the same "query" as the current expression. Someone could chain mutate
if they needed to do that. I'm not sure if pandas does it under the hood, but I assume that this restriction is beneficial to other backends like modin and dask so that they can run each expression in parallel.
Honestly, I'm hesitant about adopting vectorized operations, because they don't appear in regular/analog Python...
I always wanted mutate to feel like this (like map):
I am a fan of functional programming so I do like the idea of map-like behavior. For a library like this, however, vectorized operations are going to allow users to scale the furthest. They are also already a core part of the python data ecosystem. The reality is that a fluent interface to manipulate data in a pipeline is already beyond typical idiomatic python. Map/reduce like behavior is technically part of python, but is also not considered idiomatic. The idiomatic way would be a bunch of comprehensions that are assigned to intermediate variables. I think that what you are already working toward is the best known approach for this sort of API.
But I think bare functions like len might introduce confusion...
And, right now, I'm pretty opposed to something like Mean("body_mass_g") as it can't stand alone as "valid Python"...
I should have been clear that I just put len
in there as an example of what could work, but I would hope that there would actually just be a wrapper around it so that it was count
instead.
I can understand the objection to Mean("body_mass_g")
, but hear me out for one more moment. This might be too magic for you, but you could have a function/class that was able to accomodate both usages.
This is a quick and dirty way to implement it, but you should be able to create a wrapper function that could be used as decorator to the aggregate function that you'd want to behave that way.
If we expose the entire "DataFrame", what's preventing someone from doing something crazy like this:
Again, I don't actually see this as a problem. It is silly, but to me the contract of summarize is that you give it a dataframe and it returns a single value. It would be important to ensure that the dataframe being passed is a view so that the aggregate functions can't modify the underlying dataframe, but that is already the case for mutate
/assign
and filter
. One example of where this flexibility could actually be a benefit is calculating the "mean_intensity" in this beer dataset. FYI, you can do this no problem in dplyr.
This is how I would do percentage total mutate as it currently stands:
import redframes as rf df = rf.DataFrame({"foo": [1, 2, 3]}) ( df.mutate({ "total": lambda _: sum(df["foo"]), "percent": lambda row: row["foo"] / row["total"] }) )
The problem with this approach is that most of the time that I would actually want this, I would want to know the percent of total over groups. Currently, that would mean that I would have to assign an intermediate value after counting the groups instead of being able to perform all operations in a chain.
In pandas, I could do this to get the same results:
(
penguins
.groupby("species", as_index=False)
.agg(**{"count": ("species", len)})
.assign(**{"pct_count": lambda df: df["count"] / sum(df["count"])})
)
Haven't forgotten about you!
Just published 1.1 with rollup
and __dataframe__
support.
I'm still thinking about/working on a compromise for mutate
, stay tuned, and thanks for all of your feedback and interest so far!
Just closing this issue because there's nothing to resolve. I'm happy to keep discussing/brainstorming, though. I like the latest changes and look forward to seeing how this library evolves.
I did a little POC experiment brainstorming additional dataframe APIs. This one would be a little bit like scikit-learn's pipelines or a more powerful version of the pandas pipe. It does help with being extendable since it doesn't require attaching methods to the main data object. It also plays pretty nicely with black formatting and doesn't require wrapping additional parenthesis for long chains. Mostly, just fun thinking of API ideas. https://github.com/jcmkk3/random-experiments/blob/main/tablular_pipelines.ipynb
We've been discussing this library over at PRQL — we really like it!
OTOH something that stood out to me was the pandas comparison — pandas does allow for a fluent approach to method chaining — but the comparison shows lots of assigning to df
. Given that folks who see this will often be experienced at pandas, showing the best possible pandas comparison would demonstrate a depth of understanding, as well as being more useful.