willtownes / quminorm

Turning the quasi-UMI method into a bioconductor package

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Regarding a timer for the quminorm function

jtheorell opened this issue · comments

Hi!
Using the quminorm function again now in its latest form, and it seems like the timer (or rather the continuous progress report) that I threw in earlier has had to go, probably when making it all parallelized. With my data, it seems like the process is very slow in the new version (I had to cancel it after 5 minutes for ~2000 cells), and regardless I find it very hard to deal with processes that go under the hood without any progress reports for extended periods of time. I therefore have two suggestions (if you do not want to revert to the previous non-parallelized version which takes 2 minutes for the same dataset):

  • would it be possible to give an estimate of how long time the process will take? Then that could be printed before going in to the parallelized phase.
  • Would it be worth dividing the data in say chunks of 500 cells, and then process these linearily? If so, then you could print a progress report between each chunk.

Best
J

Hi Jakob, nice to hear from you again. That is surprising it is so slow. Could you describe your data matrix in more detail- is it a dense matrix or a sparse Matrix (and if so what is the class- dgCMatrix, etc)? What fraction of the entries are zero and how many genes are there? Also, it shouldn't be doing parallel processing unless you explicitly set mc.cores to a value larger than 1 (the default). My apologies for removing your status bar, that was not intentional. I will try to figure a way to put it back in at least for the serial processing scenario.

Hi!
Thank you for your replies!
It should be a sparse matrix, as I use the CPM slot in a singleCellExperiment, which upon its creation converts dense matrices to sparse ditto. There are 58565 transcripts, and the fraction of entries with zero is 94%. I did not set the mc.cores to anything but one, right, so that should not be it! Interestingly, I observed the same phenomenon when I started to work with this code: even converting from a for-loop to a lapply-scenario slowed things down considerably, and it got even worse when I tried to run things with bplapply. This might have been due to the use of dense matrices though, that it seems like the for loop has no problems with.
What I had implemented before was to print "Column x of y processed" for each round, which becomes slightly tideous to look at, but that very clearly shows what is going on and is simple enough to implement.