Understanding time spent in pandoc step

Question

Understanding time spent in pandoc step

lcolladotor opened this issue 10 years ago · comments

Leonardo Collado-Torres commented 10 years ago

Hello,

I'm not sure whether to ask this here or at https://github.com/rstudio/rmarkdown. Anyhow, using rmarkdown and knitrBootstrap I generated an HTML file that includes among other things around 110 plots and a datatable output with 2000 rows made with rCharts.

When generating the report, I basically take the Sys.time() just before creating the report, and then again at the end of the Rmd file, to then calculate the difference. This is how I know how much time was spent running the R code inside the report. Now, because I was running this in a cluster, I get an email with information as shown below:

 Start Time       = 04/23/2014 17:05:33
 End Time         = 04/23/2014 22:02:23
 User Time        = 01:11:27
 System Time      = 00:02:30
 Wallclock Time   = 04:56:50
 CPU              = 01:13:58

From this, I can see that loading the data needed for the report, running the code in the Rmd file, and then converting the md output to HTML with pandoc took nearly 5 hours. Yet, the time running the R code took ~52 min and was completed at 2014-04-23 18:20:11 EDT. So, around 23 minutes were spent loading the data and 3 hrs 40 min on the pandoc step.

I was surprised by this as I would normally expect running the R code to be the slowest part. Now, the resulting HTML output is ~46 mb big. I know that the recipient hard disk on the cluster network was under heavy I/O load due to someone's else jobs. So maybe that increased the pandoc time, but still, not ~4x vs the R code step.

Any clues? Is there anything I could do to speed things up? Well, maybe generating less plots but that would against the purpose of the report I'm generating.

I'll try re-running it later when the I/O load is lower on the disk, just to rule this out.

Thanks!

Details

The plots were generated using the dev="CairoPNG" knitr option.

There is the R session info:

## R version 3.1.0 Patched (2014-04-23 r65467)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      parallel  methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] rCharts_0.4.2                           
##  [2] derfinder_0.0.57                        
##  [3] mgcv_1.7-29                             
##  [4] nlme_3.1-117                            
##  [5] RColorBrewer_1.0-5                      
##  [6] xtable_1.7-3                            
##  [7] data.table_1.9.2                        
##  [8] gridExtra_0.9.1                         
##  [9] ggplot2_0.9.3.1                         
## [10] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0
## [11] GenomicFeatures_1.16.0                  
## [12] AnnotationDbi_1.26.0                    
## [13] Biobase_2.24.0                          
## [14] GenomicRanges_1.16.2                    
## [15] GenomeInfoDb_1.0.2                      
## [16] IRanges_1.22.3                          
## [17] BiocGenerics_0.10.0                     
## [18] biovizBase_1.12.0                       
## [19] derfinderReport_0.0.16                  
## 
## loaded via a namespace (and not attached):
##  [1] BatchJobs_1.2            BBmisc_1.5              
##  [3] bibtex_0.3-6             BiocParallel_0.6.0      
##  [5] biomaRt_2.20.0           Biostrings_2.32.0       
##  [7] bitops_1.0-6             brew_1.0-6              
##  [9] BSgenome_1.32.0          bumphunter_1.4.0        
## [11] Cairo_1.5-5              cluster_1.15.2          
## [13] codetools_0.2-8          colorspace_1.2-4        
## [15] DBI_0.2-7                dichromat_2.0-0         
## [17] digest_0.6.4             doRNG_1.6               
## [19] evaluate_0.5.3           fail_1.2                
## [21] foreach_1.4.2            formatR_0.10            
## [23] Formula_1.1-1            GenomicAlignments_1.0.0 
## [25] ggbio_1.12.0             gtable_0.1.2            
## [27] Hmisc_3.14-4             httr_0.3                
## [29] iterators_1.0.7          knitcitations_0.5-0     
## [31] knitr_1.5.31             knitrBootstrap_1.0.0    
## [33] labeling_0.2             lattice_0.20-29         
## [35] latticeExtra_0.6-26      locfit_1.5-9.1          
## [37] markdown_0.6.5           MASS_7.3-31             
## [39] Matrix_1.1-3             matrixStats_0.8.14      
## [41] munsell_0.4.2            pkgmaker_0.20           
## [43] plyr_1.8.1               proto_0.3-10            
## [45] qvalue_1.38.0            Rcpp_0.11.1             
## [47] RCurl_1.95-4.1           registry_0.2            
## [49] reshape2_1.2.2           RJSONIO_1.0-3           
## [51] rmarkdown_0.1.84         R.methodsS3_1.6.1       
## [53] rngtools_1.2.4           Rsamtools_1.16.0        
## [55] RSQLite_0.11.4           rtracklayer_1.24.0      
## [57] scales_0.2.4             sendmailR_1.1-2         
## [59] splines_3.1.0            stats4_3.1.0            
## [61] stringr_0.6.2            survival_2.37-7         
## [63] tcltk_3.1.0              tools_3.1.0             
## [65] VariantAnnotation_1.10.0 whisker_0.3-2           
## [67] XML_3.98-1.1             XVector_0.4.0           
## [69] yaml_2.1.11              zlibbioc_1.10.0

Pandoc info

$ pandoc --version
pandoc 1.12.3.1
Compiled with texmath 0.6.6, highlighting-kate 0.5.6.
Syntax highlighting is supported for the following languages:
    actionscript, ada, apache, asn1, asp, awk, bash, bibtex, boo, c, changelog,
    clojure, cmake, coffee, coldfusion, commonlisp, cpp, cs, css, curry, d,
    diff, djangotemplate, doxygen, doxygenlua, dtd, eiffel, email, erlang,
    fortran, fsharp, gnuassembler, go, haskell, haxe, html, ini, java, javadoc,
    javascript, json, jsp, julia, latex, lex, literatecurry, literatehaskell,
    lua, makefile, mandoc, markdown, matlab, maxima, metafont, mips, modelines,
    modula2, modula3, monobasic, nasm, noweb, objectivec, objectivecpp, ocaml,
    octave, pascal, perl, php, pike, postscript, prolog, python, r,
    relaxngcompact, restructuredtext, rhtml, roff, ruby, rust, scala, scheme,
    sci, sed, sgml, sql, sqlmysql, sqlpostgresql, tcl, texinfo, verilog, vhdl,
    xml, xorg, xslt, xul, yacc, yaml
Default user data directory: /home/bst/student/lcollado/.pandoc
Copyright (C) 2006-2013 John MacFarlane
Web:  http://johnmacfarlane.net/pandoc
This is free software; see the source for copying conditions.  There is no
warranty, not even for merchantability or fitness for a particular purpose.
$ pandoc-citeproc --version
pandoc-citeproc 0.3.0.1

The actual pandoc call

/jhpce/shared/jhpce/core/JHPCE_tools/1.0/bin/pandoc basicExploration.utf8.md --to html --from markdown-hard_line_breaks+superscript+tex_math_dollars+raw_html+markdown_in_html_blocks-implicit_
figures --output basicExploration.html --filter /jhpce/shared/jhpce/core/JHPCE_tools/1.0/bin/pandoc-citeproc -H /scratch/temp/660049.1.shared.q/RtmpgjQsCg/knitr_bootstrap_full.html

Call to derfinderReport::generateReport(). This is a package I'm making at https://github.com/lcolladotor/derfinderReport

## generateReport(prefix = "run2-v0.0.46", browse = FALSE, makeBestClusters = FALSE, 
##     nBestClusters = 20, fullCov = fullCov, device = "CairoPNG")

jjallaire · Answer 1 · Thu Apr 24 2014 18:31:47 GMT+0800 (China Standard Time)

We just discovered that for very large output files there was a lot of time being spent by pandoc on processing citations. As a result we just made a change to only call the citation filter if a bibliography field is in the YAML. Assuming you don't have a bibliography this might made a big difference in performance in your scenario. Try installing the latest rmarkdown from GitHub and see if that makes things better.

Jim Hester · Answer 2 · Thu Apr 24 2014 20:13:26 GMT+0800 (China Standard Time)

Leonardo,

Let me know if jj's rmarkdown change fixed the issue and I will close this issue.

Also thank you for using knitrBootstrap for such a large project, let me know if there are any other issues you run into.

Leonardo Collado-Torres · Answer 3 · Sun Apr 27 2014 02:13:06 GMT+0800 (China Standard Time)

Hello,

I wanted to check if it wasn't the cluster being slow before re-running with the new rmarkdown version. The results are below.

Re-run with same versions

## Start job (begins loading data)
Thu Apr 24 16:52:12 EDT 2014

## Time reported in the report as to when it finished running the R code
2014-04-24 17:14:25 EDT

## Time elapsed in running report R code
## 18.37 mins

## End job
Thu Apr 24 17:36:41 EDT 2014

So, approximately the pandoc step took 22 minutes to run. Overall, this is a huge difference in time to what I mentioned earlier. So it does look like the cluster was for some reason been very slow the first time.

I'm particularly puzzled by the difference in time spent running the R code for the report... hmm. Well, my only clue is that the cluster I/O was so slow that the main time difference comes from writing the supporting files before running pandoc.

Pandoc call

/jhpce/shared/jhpce/core/JHPCE_tools/1.0/bin/pandoc basicExploration.utf8.md --to html --from markdown-hard_line_breaks+superscript+tex_math_dollars+raw_htm
l+markdown_in_html_blocks-implicit_figures --output basicExploration.html --filter /jhpce/shared/jhpce/core/JHPCE_tools/1.0/bin/pandoc-citeproc -H /scratch/
temp/701228.1.shared.q/Rtmpwy0Ggy/knitr_bootstrap_full.html

Package versions

knitr 1.5.31
rmarkdown 0.1.84

Cluster report details

 Start Time       = 04/24/2014 16:52:12
 End Time         = 04/24/2014 17:36:41
 User Time        = 00:28:49
 System Time      = 00:00:31
 Wallclock Time   = 00:44:29
 CPU              = 00:29:20

Re-run with new `rmarkdown`

## Start job (begins loading data)
Sat Apr 26 13:10:25 EDT 2014

## Time reported in the report as to when it finished running the R code
2014-04-26 13:31:54 EDT

## Time elapsed in running report R code
## 18.11 mins

## End job
Sat Apr 26 13:53:48 EDT 2014

So around 22 minutes in the pandoc step.

This time around, running the R code is basically the same as the re-run with the older versions. That's good. I can also see that the pandoc call changed. However, it took basically the same time to run the pandoc step. Note that there is no bibliography in the YAML as I use the knitcitations package.

So, it seems to me that the changes in rmarkdown didn't really impact this scenario and overall it was the cluster I/O speed that greatly affected the results. That is, the network disk where the data is hosted and the report is being generated was under heavy loads from other users and that lead to the huge 5 hour original run time.

Thus, if the cluster I/O speed is very slow, it might be best to copy files to the $TMPDIR in the cluster node. But well, this scenario will be rare.

Pandoc call

/jhpce/shared/jhpce/core/JHPCE_tools/1.0/bin/pandoc basicExploration.utf8.md --to html --from markdown-hard_line_breaks+superscript+tex_math_dollars+raw_htm
l+markdown_in_html_blocks-implicit_figures --output basicExploration.html -H /scratch/temp/708427.1.shared.q/RtmpkPRS4C/knitr_bootstrap_full.html

Package versions

knitr 1.5.32
rmarkdown 0.1.90

Cluster report details

 Start Time       = 04/26/2014 13:10:25
 End Time         = 04/26/2014 13:53:48
 User Time        = 00:27:34
 System Time      = 00:00:29
 Wallclock Time   = 00:43:23
 CPU              = 00:28:03

Thanks for the help!

Understanding time spent in pandoc step

Details

Re-run with same versions

Pandoc call

Package versions

Cluster report details

Re-run with new rmarkdown

Pandoc call

Package versions

Cluster report details

Re-run with new `rmarkdown`