plasma-umass / coz

Coz: Causal Profiling

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why the random speed-up choice?

amedeedaboville opened this issue · comments

The coz paper states:

COZ’s profiler thread begins an experiment by selecting a line to virtually speed up, and a randomly-chosen percent speedup. Both parameters must be selected randomly; any systematic method of exploring lines or speedups could lead to systematic bias in profile results.
One might assume that COZ could exclude lines or virtual speedup amounts that have not shown a performance effect early in previous experiments, but prioritizing experiments based on past results would prevent COZ from identifying an important line if its performance only matters after some warmup period.

I'd like to understand your thinking on this. I only see two reasons to do this:

  1. Functions need to warm up. Doesn't coz already wait a little bit before starting experiments?

  2. Path dependence. Maybe speeding up a line a lot forces something else to get JIT-ed , or you fill up a queue and the program changes algorithms, and future results depend on past ones.

In this case, would a permutation of range(0,100,5) for the speedups also work?

For collecting data for a graph, I think randomly sampling from the range makes it take a lot more samples to collect all 21 data points.
I think this is the "coupon's collector problem", and taking aside the 50% chance of speedup=0, the other 20 increments should take 20*(1+1/2 + 1/3 + ... + 1/20) = 72 samples to fill the range on average.

So combined with the fact that 50% of the time the speedup is 0, I think this means you need to run an experiment 144 times on average instead of 21 to collect a full graph?

Let me know if there's no way around this, or it doesn't matter because, eg, it already takes a number of experiment samples to get good information.

I've mostly been running coz with --fixed-line, but sometimes the way it doesn't get data for a particular point in the plot makes me want to run it with seq 0 100 5 | while read $i; do coz run --fixed-speedup $i --fixed-line ... .

Should I just tweak my program to make more samples happen anyway?

Or is the random speedup only needed when a random line is picked, and could having coz use permute(range(0,100,5)) would work fine when --fixed-line is selected?

Regards,
Amédée

At first glance this sounds like it would be okay, but you're right in suspecting that we need multiple experiments for a given line and speedup to get reliable data. That doesn't preclude cycling through a random permutation of speedups instead of choosing randomly though.

I question whether randomly selecting speedup amounts is the culprit here. Coz chooses lines at random, but that selection is from the distribution of where the program spends time. As a result, Coz tends to collect a ton of data for hot code, and almost nothing else. It would be useful to know if reducing this bias toward hot code (somewhat, not entirely) would resolve the issue, or if there's still a problem getting good coverage on speedup amounts.