maxhumber / redframes

General Purpose Data Manipulation Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How would you like feedback?

jcmkk3 opened this issue · comments

Hi Max,

I'm really excited to see this new project of yours. You've designed a lot of great APIs and the dataframe space in python could use some more exploration.

I see that you've listed some prompts for feedback in the README. I would love to provide feedback and ideas. It always feels a bit awkward to me to use the Issues section of GitHub brainstorming discussions, but I'm happy to follow your lead on however you'd prefer.

Would you like feedback here, or do you think it makes sense to turn on the Discussions section of the repo and add it there?

🙈 I wasn't expecting anyone to find this yet!

(That you did, though, is another positive indicator that tells me that there is a space for a "Trello- like" data manipulation tool to "Jira- like" pandas)

But, I'll accept any and all feedback! (In this thread, here, is totally fine)

I'm buttoning up the docstrings, readme, examples, and the rest of the unit tests today (+ this weekend), but the API for 1.0 is 98% done!

Overall, I really like where this library is headed. It feels like a pretty pythonic take on the grammar of data manipulation. I'm still trying to feel out where your line is between simple-in-implementation and simple-in-use. Below are some comments from my initial exploration of the library.

I felt a bit confused about whether redframes was taking an iterative approach or an array/vectorized approach. For instance, the three primary places that accept a function as an argument take them in different forms.

  1. mutate: Accepts a function that takes a row of the data frame and returns a single value
  2. filter: Accepts a function that takes a data frame and returns a boolean sequence that is used as a mask
  3. summarize: Accepts a function that takes a column and returns a single value

If the approach was to be iterative, I would expect filter to be passed a row and return a boolean. summarize might be an exceptional case where it could be expected to behave differently. For maximum consistency, it would probably look like reduce, but I'm not sure that is actually desirable. At least not without some pre-defined reducer objects or something that could be used as a shortcut for the common aggregations.

If the approach was to be array/vectorized, I would expect mutate to operate on a whole column. That would mean passing it the whole data frame like the filter operation does. One benefit to this option is that it would be possible to mix an aggregation into a calculation like {"pct_count": lambda row: row["count"] / sum(row["count"])}. I could understand if this was undesirable depending on your vision of simplicity.

take: take is typically the opposite of drop, but in this library, those two verbs work on different dimensions of the data frame (rows vs columns). I feel like select and drop can work well together, but I think that it would be better to use a different verb for row selection to reduce confusion. See itertools.takewhile and itertools.dropwhile as examples of these names being used as opposites in python.

accumulate: It was surprising to me that this was hardcoded to only do a cummulative sum. I feel like this method should either be renamed or better aligned with the capabilities and (potentially) function signature of itertools.accumulate.

wrap/unwrap: I wonder if implementing the dataframe interchange protocol would remove the need for these functions? That would also make it possible to interoperate with other data frame libraries in addition to pandas.

rename: Since mutate and summarize take dictionaries that use the keys as the column name after calculation, I believe that rename should follow the same pattern. So {"new_name": "old_name"}

group: It might be worth considering whether group is better as a parameter on the functions that could be affected by it. Hadley has stated before that he wished that he would have went that direction in dplyr. It also appears that they will be adding those parameters in a future dplyr

summarize: arquero and d3 both use rollup as their verb for this action. It has somehow really resonated with me. It gets rid of the American/British English spelling difference and is also shorter, which better matches most other verbs. I think that summarize is a good verb, as well, I just like rollup better.

count: Counting is something that I do very frequently and I appreciate dplyr's count or panda's value_counts. It felt especially clunky to count with summarize because I had to pass rf.stat.count a column reference, even though I just want to count the whole data frame. That could be made better if the aggregate functions were passed the whole data frame and were in charge of selecting the appropriate columns themselves, however. Something like summarize({"count": rf.Count()}). I think that would be a good change, but also having a data frame level count verb.

transform: I don't actually like the verb name, but if mutate can't take (and expand) an aggregate, then it feels like there needs to be some more explicit way to do it. dataflow-api calls this a joinaggregate, for instance.

Really appreciate your kind words and thoughtful feedback!

At a high level "simple-in-use" should trump "simple-in-implementation". That said, because I've got pandas on the backend (for now) some things just aren't possible (without a massive lift).

But, you're right, if this worked it would be sick (just don't know how to make it work):

df.mutate({"pct_count": lambda row: row["count"] / sum(row["count"])})

Concerning the signatures for mutate, filter, summarize... I spent a lot of time thinking about and prototyping these out before doing the rest of the verbs. It would be awesome if each just accepted a function, but it's just not possible to get all the signatures to match. Consider the following:

### MUTATE

# Current:
df.mutate({"foo2": lambda row: row["foo"] * 2})

# Option B: (But how to assign this to "foo2"?)
df.mutate(lambda row: row["foo"] * 2)

# Option C: (But how to accept multiple columns in this form?)
df.mutate("foo2", lambda x: x * 2, "foo") 

# Option D: (But kwargs seem mysterious??)
df.mutate(
    foo2=lambda row: row["foo"] * 2, 
    foo3=lambda row: row["foo"] / 3
) 

# Option E: (But this isn't chainable)
df["foo2"] = df["foo"] * 2

### FILTER

# Current:
df.filter(lambda row: (row["foo"] == 2) | (row["bar"] < 2) & (roo))

# Option B: (But how to represent or?)
df.filter({"foo": lambda x: x < 5, "bar": lambda x: x == 10})

### SUMMARIZE

# Current:
df.summarize({
    "foo_mean": ("foo", rf.stat.mean),
    "foo_sum": ("foo", rf.stat.sum)
})

# Option B: (But how to assign this to a column? How to name it? How to do multiple summaries?)
df.summarize(lambda d: sum(d["foo"]))

# Option C: (But where does "foo" even come from? sum("foo") isn't real python!)
df.summarize(foo_mean=sum("foo"))

Verb-by-verb responses:

take: select and drop are already established verbs in pandas, so I just kept these the same. Obviously, I couldn't use "head" and "tail" because they ain't verbs (lol), so I just had to pick some word... well, I didn't just pick a word, I actually stole it from dask (I know it's not perfect, but it grows on you)

accumulate: I had originally planned for this verb to accommodate min/max/mean/sum but scrapped it (for now) because I've never actually used df["col"].cummin() before

wrap/unwrap: Thanks for the tip on the dataframe interchange protocol... I'll definitely adopt this in 1.1+

rename: I really struggled with {old_name": "new_name"} versus {"new_name": "old_name"} but went for the former because of the established precedence set by pandas...

group: I actually considered putting "by" arguments in take, summarize, rank, accumulate, etc. Super interesting to read about Hadley's regrets. I'll have to take them seriously and think some more about this.

summarize: rollup is a great verb (far better than aggregate)! I might actually alias to this...

count: there is df.dimensions["rows"]? But maybe I should consider a dplyr like "tally" verb...

transform: If there's enough demand for a transform verb I'll add! But I just can't really think up legit ML/vis workflow use cases right now.

This is generally how I imagine that mutate would work. It would likely use pandas underlying assign method instead of apply. Just without having to unpack the dictionary ourselves since your function signatures only take dictionaries (I like that choice).
Screenshot 2022-09-19 9 19 17 PM

Here is a very simplistic implementation of how I would imagine summarize/rollup to work. I would think that redframes would want to provide some convenience class/functions like Mean for most of the common aggregates.

Screenshot 2022-09-19 9 25 00 PM

filter is already using a signature that would match up with these. It just doesn't assign names to the function, but that makes sense for a filter.

Playing around with assign this morning (great tip!)

It works:

image

But breaks something like this:

image

Honestly, I'm hesitant about adopting vectorized operations, because they don't appear in regular/analog Python...

I always wanted mutate to feel like this (like map):

image

Curious if you see away around the break?

Your ideas about summarize/rollup are certainly interesting. I actually quite like:

{
    "max_bill_length_mm": lambda df: rf.stat.max(df["bill_length_mm"])
}

But I think bare functions like len might introduce confusion...

And, right now, I'm pretty opposed to something like Mean("body_mass_g") as it can't stand alone as "valid Python"...

Just have to figure out how to make your idea of creating an entirely new DataFrame work with "group/by"~

A potential issue with:

{
    "max_bill_length_mm": lambda df: rf.stat.max(df["bill_length_mm"])
}

If we expose the entire "DataFrame", what's preventing someone from doing something crazy like this:

{
    "sillyness": lambda df: np.min(df["foo"]) + np.min(df["bar"])
}

This is how I would do percentage total mutate as it currently stands:

import redframes as rf

df = rf.DataFrame({"foo": [1, 2, 3]})

(
    df.mutate({
        "total": lambda _: sum(df["foo"]),
        "percent": lambda row: row["foo"] / row["total"]
    })
)
foo total percent
1 6 0.166667
2 6 0.333333
3 6 0.5

Certainly not perfect, but not horrific, either...

First, I want to state that I don't think that you need to make any of these changes and I'm mostly just brainstorming ways that I could see it being more flexible and consistent. I won't have any hurt feelings if you decide that any of this doesn't fit in with your vision.

But breaks something like this:

I think that this is okay behavior in my opinion. Like many sql dialects and assign in pandas, you cannot refer to columns created in the same "query" as the current expression. Someone could chain mutate if they needed to do that. I'm not sure if pandas does it under the hood, but I assume that this restriction is beneficial to other backends like modin and dask so that they can run each expression in parallel.

Honestly, I'm hesitant about adopting vectorized operations, because they don't appear in regular/analog Python...

I always wanted mutate to feel like this (like map):

I am a fan of functional programming so I do like the idea of map-like behavior. For a library like this, however, vectorized operations are going to allow users to scale the furthest. They are also already a core part of the python data ecosystem. The reality is that a fluent interface to manipulate data in a pipeline is already beyond typical idiomatic python. Map/reduce like behavior is technically part of python, but is also not considered idiomatic. The idiomatic way would be a bunch of comprehensions that are assigned to intermediate variables. I think that what you are already working toward is the best known approach for this sort of API.

But I think bare functions like len might introduce confusion...

And, right now, I'm pretty opposed to something like Mean("body_mass_g") as it can't stand alone as "valid Python"...

I should have been clear that I just put len in there as an example of what could work, but I would hope that there would actually just be a wrapper around it so that it was count instead.

I can understand the objection to Mean("body_mass_g"), but hear me out for one more moment. This might be too magic for you, but you could have a function/class that was able to accomodate both usages.
Screenshot 2022-09-20 10 52 24 AM

This is a quick and dirty way to implement it, but you should be able to create a wrapper function that could be used as decorator to the aggregate function that you'd want to behave that way.

If we expose the entire "DataFrame", what's preventing someone from doing something crazy like this:

Again, I don't actually see this as a problem. It is silly, but to me the contract of summarize is that you give it a dataframe and it returns a single value. It would be important to ensure that the dataframe being passed is a view so that the aggregate functions can't modify the underlying dataframe, but that is already the case for mutate/assign and filter. One example of where this flexibility could actually be a benefit is calculating the "mean_intensity" in this beer dataset. FYI, you can do this no problem in dplyr.
Screenshot 2022-09-20 11 16 15 AM

This is how I would do percentage total mutate as it currently stands:

import redframes as rf

df = rf.DataFrame({"foo": [1, 2, 3]})

(
    df.mutate({
        "total": lambda _: sum(df["foo"]),
        "percent": lambda row: row["foo"] / row["total"]
    })
)

The problem with this approach is that most of the time that I would actually want this, I would want to know the percent of total over groups. Currently, that would mean that I would have to assign an intermediate value after counting the groups instead of being able to perform all operations in a chain.
Screenshot 2022-09-20 11 23 03 AM

In pandas, I could do this to get the same results:

(
    penguins
    .groupby("species", as_index=False)
    .agg(**{"count": ("species", len)})
    .assign(**{"pct_count": lambda df: df["count"] / sum(df["count"])})
)

Haven't forgotten about you!

Just published 1.1 with rollup and __dataframe__ support.

I'm still thinking about/working on a compromise for mutate, stay tuned, and thanks for all of your feedback and interest so far!

Just closing this issue because there's nothing to resolve. I'm happy to keep discussing/brainstorming, though. I like the latest changes and look forward to seeing how this library evolves.

I did a little POC experiment brainstorming additional dataframe APIs. This one would be a little bit like scikit-learn's pipelines or a more powerful version of the pandas pipe. It does help with being extendable since it doesn't require attaching methods to the main data object. It also plays pretty nicely with black formatting and doesn't require wrapping additional parenthesis for long chains. Mostly, just fun thinking of API ideas. https://github.com/jcmkk3/random-experiments/blob/main/tablular_pipelines.ipynb

We've been discussing this library over at PRQL — we really like it!

OTOH something that stood out to me was the pandas comparison — pandas does allow for a fluent approach to method chaining — but the comparison shows lots of assigning to df. Given that folks who see this will often be experienced at pandas, showing the best possible pandas comparison would demonstrate a depth of understanding, as well as being more useful.