plasma-umass / scalene

Scalene: a high-performance, high-precision CPU, GPU, and memory profiler for Python with AI-powered optimization proposals

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Collect success stories!

emeryberger opened this issue · comments

Inspired by this issue on a different project, I'd love to hear stories from people who have successfully used Scalene. Did you use it to fix a performance problem, excessive memory consumption, or a leak? (Or something else?) What kind of performance problem? How did Scalene help? Your stories will help guide the development of new features, and also brighten my day!

Hi @emeryberger . I've just used Scalene in my private 3 morning Higher Performance Python tutorial which I ran with a hedge fund, I run this course both privately and as a "public" course for open enrolment several times a year. I added Scalene to the most recent iteration and I plan to use it more in forthcoming sessions. I'm also likely to make use of it in forthcoming public talks. The students last week were very interested in using Scalene further. They use Pandas and NumPy extensively.

For the most recent session we used it to identify how Pandas uses more memory than expected in 2 simple situations:

  • doing a groupby on a large dataframe, taking a mean, then retrieving just one desired group's result (if you compare this to using a query to get the desired group and taking the mean of this, this second approach is faster & uses less RAM) - this demonstrates "always try to do the least necessary work both for saving RAM and saving CPU time"
  • building a large dataframe using concat on component dataframes and realising that concat by default has copy=True, so you double your RAM usage. This behaviour is "correct" in the documentation but often comes as a surprise to students (and Pandas has a lot of these surprising situations) https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

From the first point I've built a bug report for Pandas as I'm sure the RAM used is excessive ( pandas-dev/pandas#37139 ). The report was filed using memory_profiler as that's been used on Pandas before, I tracked it using both memory_profiler and Scalene originally. Once I'm more confident with using Scalene I'd be happy to use it in bug reports. Just thinking out loud - if you wanted to rerun the code I'd presented in that bug report using Scalene and add a comment as a follow-up as verification (perhaps with any observations you could make thanks to Scalene) I suspect it'd introduce more Pandas users to Scalene.

As a suggestion - if you collected some examples of how Scalene could track poor memory usage in Pandas and Scikit Learn I suspect you'd have some popular blog posts you could share! Scikit Learn for example in a few places will copy data from Python to an external tool if the dtype is the wrong size and I think memory_profiler won't help a lot with that diagnosis but Scalene should make it clear.

Many thanks for the tool!

I used Scalene a couple weeks ago in the context of a Machine Learning homework (Classification using gradient descent). I had to implement the following function (see Iverson bracket notation), where y_n is either -1 or 1, lambda_1, lambda_2 and b are scalars, and x_n and w are vectors:

g_{w_i} = \sum_{n = 1}^N - \left[ y_n \left( x_n w + b \right) > 1 \right] y_n {x_n}_i + \lambda_1 \left( 2 \left[ w_i < 0 \right] - 1 \right) + 2 \lambda_2 w_i

Being unfamiliar with Python and NumPy, I initially wrote the following code (NSFW for NumPy enthusiasts).

for i in range(n_features):
    for n in range(n_samples):
        subgrad[i] += (- y[n] * X[n][i]) if y[n] * (np.dot(X[n], w) + b) < 1 else 0
    subgrad[i] += self.lambda1 * (-1 if w[i] < 0 else 1) + 2 * self.lambda2 * w[i]

Obviously, this was really slow, with about 80 iterations per minute, when the goal was 10,000. I ran the program through Scalene, which output the following.

Scalene Output 1

The column headers are missing, but you can see that 98% of the time is spent in Python, and not in native code, which is the problem. With my (clearly lacking) NumPy skills, I improved it to the following, removing one of the loops.

for n in range(n_samples):
    if y[n] * (np.dot(X[n], w) + b) < 1:
        subgrad += (- y[n] * X[n])
subgrad += self.lambda1 * (w / np.abs(w)) + 2 * self.lambda2 * w

This was much better, and allowed to reach the desired goal of 10,000 iterations per minute. It also reduced the time spent in Python, as shown in the Scalene output below.

Scalene Output 2

I've learned later this can be improved further, removing all loops!
Here is a possible improved version suggested to me (I could never come up with this), along with its corresponding Scalene output.

index = y * (np.dot(X, w) + b) < 1
yi = y[index]
subgrad = np.sum(- yi.reshape(yi.size, 1) * X[index,:], axis = 0)

Scalene Output 3

As you can see in this final version, only very little time is spent in Python. Thank you Scalene!

I used scalene to optimize a long-running script in https://github.com/ConsenSys/code_merklization_of_traces. It was great help for someone who only sporadically works with Python, much less with unfamiliar native libraries!
And I must say it was extremely welcome after the shocking realization of how poor is the Python profiling panorama of cProfile/snakeviz...

I used Scalene to profile my library, Rich, which can print tables in the terminal. A user reported that very large tables (10,000 rows) were slow. There were no algorithmic improvements I could see from viewing the code, which lead me to consider profiling.

Running Scalene on a script to print a large table highlighted two lines that were taking way more time that I would have expected. The first one was an isinstance call that uses a runtime checkable protocol. The second was copying a dataclasss. Neither of these were particularly slow operations, it was just they were repeated once per cell in the table (i.e. 80,000 calls).

Optimizing those was fairly trivial and resulted in a 45% improvement in speed. I'd say that was a success.

Hey @emeryberger hope you're doing well. Just used Scalene today for profiling some pandas code. Chained indexing into a multi-level column names apparently caused a bunch of copying to occur, which Scalene easily found for me!

@donald-pinckney nice! Can you share the code / the fix? Thanks!

@donald-pinckney agreed a code example (or even just more detail) about what you were doing would be helpful to recreate useful examples.

import pandas as pd
import numpy as np
import timeit

# Code to setup example df, to recreate the approximate df structure of my code
column_names_example = [i for i in range(10000)]
index = pd.MultiIndex.from_tuples([("left", c) for c in column_names_example] + [("right", c) for c in column_names_example])
df = pd.DataFrame(np.random.rand(1000, 20000), columns=index)

# We also define a function which takes as input two columns from the dataframe
# It does whatever logic and returns a bool
def keep_column(left_col, right_col):
    # Shouldn't really matter what this is, but I had some indexing operations.
    # So here is some random indexing operations.
    # Note that this only reads from the columns, no writing is done
    return left_col[left_col.first_valid_index()] > right_col[right_col.last_valid_index()]

# Ok, finally we have the performance bug:
timeit.timeit(lambda: [c for c in column_names_example if keep_column(df["left"][c], df["right"][c])], number=10) / 10
# > gives average 14.68 seconds on my machine

After using Scalene I quickly found the bad line of code, and my simple fix is to lift the indexing out of the loop:

df_l = df["left"]
df_r = df["right"]
timeit.timeit(lambda: [c for c in column_names_example if keep_column(df_l[c], df_r[c])], number=10) / 10
# > gives average 0.81 seconds on my machine

Obviously lifting a constant out of the loop could help, but I doubt just that would make such a large difference. Probably there is some underlying copying or something going on that only occurs with the double-indexing? So I googled a bit and found this in the documentation:

dfmi['one'] selects the first level of the columns and returns a DataFrame that is singly-indexed. Then another Python operation dfmi_with_one['second'] selects the series indexed by 'second'. This is indicated by the variable dfmi_with_one because pandas sees these operations as separate events. e.g. separate calls to __getitem__, so it has to treat them as linear operations, they happen one after another.

Contrast this to df.loc[:,('one','second')] which passes a nested tuple of (slice(None),('one','second')) to a single call to __getitem__. This allows pandas to deal with this as a single entity. Furthermore this order of operations can be significantly faster, and allows one to index both axes if so desired.

So there is some documentation that doing this double-indexing on hierarchical columns is bad, but that documentation doesn't really help explain why lifting it out of the loop matters so much.

But if we take the suggestion of the documentation, we get another potential fix:

timeit.timeit(lambda: [c for c in column_names_example if keep_column(df.loc[:, ("left", c)], df.loc[:, ("right", c)])], number=10) / 10
# > gives average 3.59 seconds on my machine

This is an improvement compared to the original code, but its still ~4x slower than my first fix.

Probably there is an easy and efficient way to do this without involving Python loops and just using numpy / pandas, but for a pandas beginner such as myself, the performance issues that popped up here were pretty subtle. Scalene was great at helping me find the slow line of code!

Thanks for that lib !

I figured out with Scalene that , maybe, loading N ( = 10000+ ) times the same schema.json file from disk when instanciating a class, and spending 85% of my code time in that json.load() call wasn't a great idea 🤣

We've started using scalene over at Semantic Scholar (www.semanticscholar.org) as part of our toolsuite for operationalizing machine learning models. Recently we found a model of ours was cost prohibitive and put an entire product direction in jeopardy. We generated a set of test data and ran our models with Scalene mounted -- the html output was able to pin point our squeakiest wheels and help us validate our changes were having an impact. The process was iterative, precise and repeatable. In the end, we were able to reduce costs by a staggering 92%.

With these models, there is also always the question of whether things would be more cost effective running inference services on GPUs rather than CPU. Scalene allowed us to quickly ascertain what fraction of our runtime would benefit from the hardware acceleration, and what CPU-bound code we'd need to pare down to achieve our goals.

Thanks for this tool :)

I used scalene to help investigate odd performance of StringIO.write(): The first 99,999 writes are ‘free’: Or, why lazy StringIO.write() may sprint into a memmove wall.

The new GUI reporting interface is slick.

Used scalene for a hobby/learning project, I was reading a CSV file into dict every time I called the function. I went from 9 minutes to 2 minutes to run my app.
Great job!

Used scalene to drop the runtime of a scientific tool we were using from 16 hours to 8 minutes with certain inputs 😄

malonge/RagTag#178