lweasel / piquant

A pipeline to assess the quantification of transcripts.

Home Page:http://piquant.readthedocs.org/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature Request: Record & report usage statistics for (pre-)quantification steps

rob-p opened this issue · comments

I'm really liking piquant! It's making the synthetic benchmarking for our new manuscript so much easier than would otherwise be the case. The wealth of statistics and plots it produces already allow me to explore almost all of the variables of interest I care about. However, one feature that would be incredibly useful is the ability to record / report the resource usage of the quantification and pre-quantification steps. For example, recording the memory usage and runtime required by each tool and providing some sort of simple visualization of this would be very slick.

Obviously, since I'm focusing on methods that are AFAP (as fast as possible --- subject to being highly-accurate), this is a somewhat selfish feature request ;P. However, I do think it would be generally useful to have, as it also lets us explore how e.g. the memory and time resources required by a method vary with the size of the input dataset.

Hi Rob - I'm really pleased that you're finding the tool useful!

I don't think that this request should be too hard to do. What I'm thinking would be to use the GNU time command (unless there is a more appropriate tool?) to record the time taken and max memory usage for each command executed during quantification into a CSV file. Then those numbers could be assembled and plotted across parameter sets in more or less the same way as any of the other statistics, e.g. you could see time taken or memory used by Salmon vs. Sailfish plotted as a function of sequencing depth or read length etc. Does that sound like the kind of visualisation output that you are looking for?

(For pre-quantification, given that this is only executed once per quantifier for a particular set of input transcripts, I would probably just print the time and memory usage to stdout from the run_quantification.sh script?).

Ideally, for quantification methods that involve multiple steps, it would probably be nice to be able to split the time up for the graphs into the amount taken, e.g., for mapping and then for quantification itself, but I can see that being slightly more fiddly to deal with, so I would probably just go for plotting the total time taken in the first instance.

Hi Owen,

Yes, this sounds like exactly what I was envisioning :). I think the general approach sounds good, though there are some subtleties (e.g. producing plots for user / sys / wall time so we can see not just how fast methods are, but how efficiently they use the available processing power).

For pre-quantification, I think it makes sense to do something simple (like what you suggest), though it would probably be best to redirect those timing / mem-usage results to a well-defined location and file so that they can be collected non-interactively.

Finally, I agree that for quant methods with multiple steps, it would be ideal to be able too look at e.g. alignment time vs. quant time etc. However, this seems like something that could be iterated upon after a first pass that quantifies total time. Anyway, let me know if there's any way I can be of help on this.

Hi Rob,

Unfortunately I haven't had time to extensively test it yet, but I wanted to get the code for a first pass at this in place.

By default, recording of time and memory usage is now on (though it can be turned off by specifying "--nousage" to the "prepare_quant_dirs" and "analyse_runs" commands), and it assumes that the GNU time command is available at /usr/bin/time. Four quantities are recorded:

  • total real time taken by (pre)quantification commands (log base 10 seconds)
  • total user mode time taken by (pre)quantification commands (log base 10 seconds)
  • total kernel mode time taken by (pre)quantification commands (log base 10 seconds)
  • maximum resident memory for any command during (pre)quantification (in gigabytes)

(though the particular quantities that are recorded can be changed by fiddling about with resource_usage.py).

When the "analyse_runs" command is run, what you should get (in the same output directory as all the other stats) is:

  • a CSV file "overall_quant_usage.csv" recording time and memory usage for quantification commands for each run
  • a CSV file "overall_prequant_usage.csv" recording time and memory usage for prequantification commands for each quantifier
  • a sub-directory "resource_usage_graphs". This has the same structure of sub-directories below it as, e.g., "overall_transcript_stats_graphs", except at the bottom level the graphs are obviously of the time and memory quantities for particular sets of quantification runs.

It's not yet making any plot for the prequantification quantities, though this would be easy enough to do if it were desirable (e.g. a bar plot of real/user/sys time for each quantifier?).

Of course, let me know if you hit any issues with this, or if there are any improvements I can make!

Hi Owen,

Thank you so much for prioritizing this feature request! It sounds awesome and I can't wait to try it out. It also lands just in time. With the semester coming to an end, I'm finally getting a chance to turn my focus to the Salmon manuscript (which I'm considering writing in the open, if I can convince my co-author that it's a good idea), and this will automate my resource collection!

I'll be sure to test it out soon and let you know how it works. I think your idea for a bar plot for pre-quant resource usage is similar to what I had in mind. If you don't get to it in the next day or so, I'd be happy to take a swing and submit a pull-request if I'm successful ;P. Thanks again; I can't wait to try this out!

Best,
Rob

Hopefully not treading on toes, but I had a few spare hours so I've had a stab at adding a couple of graphs for the prequantification resource usage data. The "resource_usage_graphs" directory should now contain two extra graphs at the top level - "prequant_time_usage.pdf" is a bar plot comparing the real, user and kernel mode time taken by prequantification for each quantification method, and prequant_memory_usage.pdf is a bar plot comparing the maximum resident memory.

Awesome! No toe-stepping at all (I've been busy with other things in Salmon). This feature sounds great; I'll try it out immediately.