Genrich: detecting sites of genomic enrichment
Table of Contents
- I/O files and options
- Filtering options
- ATAC-seq mode
- Peak-calling parameters
Genrich is a peak-caller for genomic enrichment assays (e.g. ChIP-seq, ATAC-seq). It analyzes alignment files generated following the assay and produces a file detailing peaks of significant enrichment.
sample.bam(alignment file, sorted by queryname)
Genrich(compiled as described below)
To produce a file listing regions of genomic enrichment:
$ ./Genrich -t sample.bam -o sample.narrowPeak -v
The software can be downloaded from GitHub.
To compile, run
make in the folder in which the software was downloaded. The executable
Genrich should be produced.
Usage: ./Genrich -t <file> -o <file> [optional arguments] Required arguments: -t <file> Input SAM/BAM file(s) for experimental sample(s) -o <file> Output peak file (in ENCODE narrowPeak format) Optional I/O arguments: -c <file> Input SAM/BAM file(s) for control sample(s) -f <file> Output bedgraph-ish file for p/q values -k <file> Output bedgraph-ish file for pileups and p-values -b <file> Output BED file for reads/fragments/intervals -R <file> Output file for PCR duplicates (only with -r) Filtering options: -r Remove PCR duplicates -e <arg> Comma-separated list of chromosomes to exclude -E <file> Input BED file(s) of genomic regions to exclude -m <int> Minimum MAPQ to keep an alignment (def. 0) -s <float> Keep sec alns with AS >= bestAS - <float> (def. 0) -y Keep unpaired alignments (def. false) -w <int> Keep unpaired alns, lengths changed to <int> -x Keep unpaired alns, lengths changed to paired avg Options for ATAC-seq: -j Use ATAC-seq mode (def. false) -d <int> Expand cut sites to <int> bp (def. 100) -D Skip Tn5 adjustments of cut sites (def. false) Options for peak-calling: -p <float> Maximum p-value (def. 0.01) -q <float> Maximum q-value (FDR-adjusted p-value; def. 1) -a <float> Minimum AUC for a peak (def. 200.0) -l <int> Minimum length of a peak (def. 0) -g <int> Maximum distance between signif. sites (def. 100) Other options: -X Skip peak-calling -P Call peaks directly from a log file (-f) -z Option to gzip-compress output(s) -v Option to print status updates/counts to stderr
Here is an overview of the method used by Genrich to identify peaks (Fig. 1):
- Parse alignments for the experimental sample and create an experimental "pileup" by counting the DNA fragments that cover each position of the genome (additional information about alignment parsing can be found here).
- Create a control pileup using the control sample (if available) and background level (additional information about control/background pileup calculation can be found here).
- Calculate p-values for each genomic position, as described here.
- (Optional) Convert p-values to q-values, as described here.
- Calculate the "area under the curve" (AUC) for all regions reaching statistical significance (e.g., q < 0.05 ⇒ -log(q) > 1.301).
- Combine nearby regions and call peaks whose total AUC is above a threshold (details of peak-calling parameters can be found here).
Genrich analyzes paired-end reads aligned to a reference genome. It correctly infers full fragments as spanning between the 5' ends of two properly paired alignments. By default, it does not consider unpaired alignments (including those from single-end reads), although there are three options for keeping such alignments, as described here.
An alternative analysis mode for ATAC-seq is also provided by Genrich, as described here.
Genrich analyzes reads/fragments that map to multiple locations in the genome by adding a fractional count to each location. This allows for peak detection in regions of the genome that are otherwise inaccessible to the assay. The input SAM/BAM file(s) must list secondary alignments for multimapping reads/fragments, such as those reported by the short read aligner Bowtie2 in
-k <int> mode or
-a mode. Additional information about the processing of secondary alignments by Genrich can be found in the description of the
Genrich provides an option for removing PCR duplicates (
-r). In this process, it analyzes reads/fragments based on their alignments, in three separate groups (proper pairs, discordant pairs, and singletons), and removes those identified as duplicates from further analysis. One novel feature is that this evaluation takes into account reads/fragments with multiple alignments. Additional information on the duplicate identification procedure can be found here.
Genrich computes the genome length as the sum of the lengths of the chromosomes (reference sequences) in the header of the SAM/BAM file. The length is reduced if the user specifies chromosomes (
-e) or genomic regions (
-E) to be excluded, as described below. The calculated length is used for computing a background pileup value and q-values.
The background pileup value is calculated by dividing the total sequence information (sum of read/fragment/interval lengths) in the experimental sample by the calculated genome length. The net control pileup value at a particular genomic position is the maximum of the background pileup value and the pileup of the control sample at that position (if a control sample is specified). Note that control pileups are scaled to match the experimental, based on the total sequence information in each.
The p-value for each base of the genome is calculated assuming a null model with a log-normal distribution. The control/background pileup value is used as the parameter μ, and the σ parameter is 1.2×μ if μ ≤ 7, and 10×log10(μ) if μ > 7. These formulas, as well as the choice of the log-normal distribution, were determined from a comprehensive review of control samples from various ChIP-seq experiments (human, mouse, and worm) downloaded from SRA.
- Because the log-normal is a continuous probability distribution, fractional experimental pileup values can be considered. Such values need to be evaluated by Genrich due to reads/fragments with multiple alignments.
- Earlier versions of Genrich (prior to v0.5) used the exponential distribution for the null model. Although this distribution has some convenient properties (continuous, one-parameter, -log10(p) ∝ x / β), the study described above showed that the exponential distribution was a good fit to the control pileup distribution only occasionally. However, this was still better than the Poisson distribution, which is frequently used as a null model in genomics software but was a good fit to the control pileup distribution never.
Genrich calls peaks for multiple replicates collectively. First, it analyzes the replicates separately, with p-values calculated for each. At each genomic position, the multiple replicates' p-values are then combined by Fisher's method. The combined p-values are converted to q-values, and peaks are called. This obviates the need for IDR (you're welcome!).
-t <file> Input SAM/BAM file(s) for experimental sample(s)
- Genrich analyzes alignment files in SAM/BAM format. SAM files must have a header.
- The SAM/BAM files must be sorted by queryname (via
samtools sort -n).
- SAM/BAM files for multiple replicates can be specified, comma-separated (or space-separated, in quotes).
- Multiple SAM/BAM files for a single replicate should be combined in advance via
- Genrich reads from
- This file need not be specified when peak-calling directly from a log file previously produced by Genrich (
-o <file> Output peak file (in ENCODE narrowPeak format)
- As indicated, the output file is in ENCODE narrowPeak format. Here are details of the fields:
|1. chrom||Name of the chromosome|
|2. chromStart||Starting position of the peak (0-based)|
|3. chromEnd||Ending position of the peak (not inclusive)|
|5. score||Average AUC (total AUC / bp) × 1000, rounded to the nearest int (max. 1000)|
|7. signalValue||Total area under the curve (AUC)|
|8. pValue||Summit -log10(p-value)|
|9. qValue||Summit -log10(q-value), or
|10. peak||Summit position (0-based offset from chromStart): the midpoint of the peak interval with the highest significance (the longest interval in case of ties)|
- Here is the portion of the output file corresponding to the peaks called in Figure 1:
chr1 894446 894988 peak_10 402 . 217.824936 4.344683 1.946031 317 chr1 895834 896167 peak_11 343 . 114.331093 4.344683 1.946031 90
- Genrich writes to
- This file need not be specified when peak-calling is skipped (
-c <file> Input SAM/BAM file(s) for control sample(s)
- Alignment files for control samples (e.g. input DNA) can be specified, although this is not strictly required.
- SAM/BAM files for multiple replicates can be listed, comma-separated (or space-separated, in quotes) and in the same order as the experimental files. Missing control files should be indicated with
-f <file> Output bedgraph-ish file for p/q values
- With a single replicate, this log file lists experimental/control pileup values, p- and q-values, and significance (
*) for each interval. Here is the portion of the log file corresponding to the beginning of
chr1 894435 894436 33.000000 2.477916 3.183460 1.208321 chr1 894436 894442 34.000000 2.477916 3.231466 1.241843 chr1 894442 894446 35.000000 2.477916 3.278469 1.274561 chr1 894446 894447 36.000000 2.477916 3.324516 1.306471 * chr1 894447 894450 39.000000 2.477916 3.457329 1.398035 * chr1 894450 894451 40.000000 2.477916 3.499948 1.427253 * chr1 894451 894460 41.000000 2.477916 3.541798 1.455938 *
- With multiple replicates, this log file lists p-values of each replicate, combined p-value, q-value, and significance for each interval.
- Note that this file (as well as the
-kfile, below) is called "bedgraph-ish" because it contains multiple
dataValuefields, which isn't strictly allowed in the bedGraph format. However, a simple application of
awkcan produce the desired bedgraph files for visualization purposes (see this awk reference for a guide to printing specific fields of input records).
- When peak-calling is skipped (
-X), the significance column is not produced.
- This file functions as the input when peak-calling directly from a log file (
-k <file> Output bedgraph-ish file for pileups and p-values
- For each replicate, sequentially, this file lists a header line (
# experimental file: <name>; control file: <name>), followed by experimental/control pileups and a p-value for each interval. This is the way to examine pileup values with multiple replicates, since the
-flog file does not supply them in that case.
-b <file> Output BED file for reads/fragments/intervals
- This is an unsorted BED file of the reads/fragments/intervals analyzed. The 4th column gives the read name, number of valid alignments, 'E'xperimental or 'C'ontrol, and sample number (0-based), e.g.
-R <file> Output file for PCR duplicates (only with -r)
- This log file lists the header of each read/fragment classified as a PCR duplicate, followed by the alignment, the header of the read/fragment it matched, and the alignment type. For example:
SRR5427886.5958 chr4:185201876-185201975 SRR5427886.4688 paired SRR5427886.1826 chr12:34372610,+;chr1:91852878,- SRR5427886.2040 discordant SRR5427886.10866 chr14:53438632,+ SRR5427886.4746 single
- The duplicates from multiple input files are separated by a comment line listing the next filename, such as
# experimental file #0: SRR5427886.bam.
- This file can be used to filter the original SAM/BAM file, using a simple script such as
getReads.py, for example.
-r Remove PCR duplicates
- With this option, all reads/fragments classified as PCR duplicates are removed from further analysis. Reads/fragments are evaluated based on their alignments, in three separate groups:
- Proper pairs: A properly paired alignment is classified as a duplicate if the reference name (chromosome) and the 5' positions of the two reads match those of another properly paired alignment.
- Discordant pairs (where both R1 and R2 reads align, but not in a proper pair): A discordant alignment is classified as a duplicate if the reference name, 5' position, and strand (orientation) of both alignments match those of another discordant alignment.
- Singletons (where either R1 or R2 aligns): A singleton alignment is classified as a duplicate if the reference name, 5' position, and strand (orientation) match either end of a properly paired alignment, either end of a discordant pair, or another singleton alignment.
- Within each of the three groups, Genrich analyzes reads/fragments in order based on the total sums of the quality scores (sums of both R1's and R2's quality scores with paired alignments). In case of ties, reads are processed in the order they appear in the SAM/BAM.
- Discordant and singleton duplicates are evaluated only if unpaired alignments are to be kept.
- There is no consideration of read order (R1 vs. R2) by Genrich. That is, if the 5' coordinates of R1 and R2 of one paired alignment match the coordinates of R2 and R1 of another paired alignment, respectively, it is classified as a duplicate. The same applies to discordant and singleton duplicates.
- For reads/fragments with multiple alignments, all secondary alignments within the
-sthreshold are considered. If any of a read's/fragment's alignments is evaluated as a duplicate, then the whole read/fragment is classified as a duplicate, and all of its alignments are discarded. Note that no maximum of 10 alignments per read/fragment is imposed at this stage, and all alignments to skipped chromosomes (
-e) or genomic regions (
-E) are still evaluated.
- For discordant alignments, all combinations of valid R1 and R2 alignments are analyzed.
-e <arg> Comma-separated list of chromosomes to exclude
- All alignments to the given list of chromosomes (reference sequences) are excluded from peak-calling. The alignments' lengths do not factor into the total sequence information calculation, nor to the average fragment length calculation (
-x), and the alignments are not printed to the
- For reads/fragments with multiple alignments, the scores of alignments to
-echromosomes are considered for comparison purposes.
- The lengths of the
-echromosomes are subtracted from the total genome length calculated by the program.
-E <file> Input BED file(s) of genomic regions to exclude
- All alignments, or portions of alignments, that lie within the given genomic regions are excluded from peak-calling, and no peak may extend into or around an excluded region. The alignments' lengths (within an excluded region) do not factor into the total sequence information calculation. However, the full fragment lengths are counted for the average fragment length calculation (
-x), and the full fragments are listed in the
- In the output log files (
-k), excluded regions have experimental/control pileup values of
0.0and p-/q-values of
- Multiple BED files can be specified, comma-separated (or space-separated, in quotes). Overlapping BED intervals are merged appropriately.
- The regions' lengths are subtracted from the total genome length calculated by the program.
- Genomic regions to which reads typically do not align uniquely can be specified, but this is not recommended. Instead, one should consider taking advantage of Genrich's ability to analyze multimapping reads.
- The accessory script
findNs.pyproduces a BED file of 'N' homopolymers in a fasta file (e.g. a reference genome). Its output should be given to Genrich via
-m <int> Minimum MAPQ to keep an alignment (def. 0)
- All alignments with
MAPQless than the given value are eliminated. This is equivalent to filtering with
samtools view -q <int>.
- This option should not be used if the SAM/BAM lists multiple alignments for some reads/fragments. Instead, filtering should be accomplished via
-s <float>, below.
-s <float> Keep sec alns with AS >= bestAS - <float> (def. 0)
- Genrich considers all secondary alignments of multimapping reads, but, by default, it keeps only the alignments whose scores are equal to the best score for the read/fragment. Setting a value such as
-s 20causes Genrich also to keep secondary alignments whose scores are within 20 of the best.
- The SAM/BAM should have alignment scores under the extra field
AS. If not, all alignments are considered equivalent.
- Each of the
Nalignments for a read/fragment is counted as
1/Nfor the pileup.
- To avoid excessive memory usage and the imprecision inherent in floating-point values, a maximum of 10 alignments per read/fragment are analyzed by Genrich. Reads/fragments with more than 10 alignments within the
-sthreshold are subsampled based on the best alignment scores; in the case of ties, alignments appearing first in the SAM/BAM are favored.
- The alignment score for a fragment (pair of reads) is equal to the sum of the reads' individual scores.
- Properly paired alignments take precedence over unpaired alignments, regardless of the alignment scores.
- As stated above, Bowtie2 reports secondary alignments in
-k <int>mode or
-amode. The short read aligner BWA does not report secondary alignments.
By default, Genrich analyzes only properly paired alignments and infers the full fragments as spanning between the 5' ends of the two alignments (Fig. 2). It does not analyze unpaired alignments unless one of three options is selected:
-y Keep unpaired alignments (def. false) -w <int> Keep unpaired alns, lengths changed to <int> -x Keep unpaired alns, lengths changed to paired avg
-y: unpaired alignments are kept, just as they appear in the SAM/BAM
-w <int>: unpaired alignments are kept, with their lengths changed to the given value (from their 5' ends)
-x: unpaired alignments are kept, with their lengths changed to the average length of fragments inferred from properly paired alignments (excluding those aligning to skipped chromosomes [
example.bamhas both properly paired alignments (top left) and unpaired alignments (top right). By default, Genrich infers the full fragments from the paired alignments and discards the unpaired alignments. Unpaired alignments can be kept via
-w <int>, or
-x, as described above.
ATAC-seq is a method for assessing genomic regions of open chromatin. Since only the ends of the DNA fragments indicate where the Tn5 transposase enzyme was able to insert into the chromatin, it may not be optimal to interpret alignments as shown above (Fig. 2). Genrich has an alternative analysis mode for ATAC-seq in which it creates intervals centered on transposase cut sites (Fig. 3).
-j Use ATAC-seq mode (def. false) -d <int> Expand cut sites to <int> bp (def. 100) -D Skip Tn5 adjustments of cut sites (def. false)
-yis set, adjusted forward by 5bp). The lengths of the intervals can be changed from the default of
Unpaired alignments can be analyzed with
-y, though only one interval, centered on the read's 5' end, is inferred. Both
-w <int> and
-x are equivalent to
-y in ATAC-seq mode.
By default, Genrich centers the intervals at the ends of the reads/fragments, adjusted forward by 5bp to account for the Tn5 transposase occupancy. That is, for the 5' ends of fragments (or for reads aligning in a normal orientation), the position is increased by +5, and for the 3' ends of fragments (or for reads aligning in a reverse-complement orientation), the position is adjusted by -5. To avoid this position adjustment (e.g. for DNase-seq), one can use
For full fragments, when the two cut site intervals overlap, they are merged into a single interval. To get a BED file of cut sites, one can run
-d 1 -b <file>.
The remainder of the peak-calling process (calculating pileups and significance values) is identical to the default analysis mode. Note that the interval lengths (not the fragment lengths) are used to sum the total sequence information for the calculation of control/background pileup values.
-p <float> Maximum p-value (def. 0.01) -q <float> Maximum q-value (FDR-adjusted p-value; def. 1)
- These parameters establish the statistical threshold below which a base is considered significantly enriched in the experimental sample(s) vs. the control/background. The significance value is automatically converted to a -log10 scale by Genrich.
- If a
-qthreshold is specified, the
-pthreshold is ignored.
- If a
-qthreshold is not specified, q-values are not calculated (reported as
-a <float> Minimum AUC for a peak (def. 200.0)
- The peak-calling method requires that, for a peak to be called, the total significance of the region must exceed a minimum value. The total significance is calculated as the sum of the -log(p) values above the threshold set by
-pover the length of the region (i.e. the Area Under the -log(p) Curve).
- If a
-qthreshold is specified, the Area Under the -log(q) Curve is calculated (see Fig. 1).
-l <int> Minimum length of a peak (def. 0)
- With this option, any potential peak whose length is below the specified value is discarded, regardless of its significance. The default of 0 means that no peaks are eliminated on this basis.
-g <int> Maximum distance between signif. sites (def. 100)
- This parameter sets the maximum distance between sites that achieve significance in order for them to be linked together into the same potential peak.
-X Skip peak-calling
- This is a convenience option for those who are unsure of the peak-calling parameters but do not want to run the full analysis multiple times. Genrich interprets the alignment files (including identifying PCR duplicates) and produces intermediate log files, but does not perform the peak-calling step. The bedgraph-ish log file listing statistics (
-f) can subsequently be used to call peaks directly (see
- With this option, the requirement to specify an output peak file (
-o) is suspended, and no such file is produced.
-P Call peaks directly from a log file (-f)
- This is a convenience option to call peaks from a bedgraph-ish log file (
-f) produced by a previous Genrich run. One can modify the peak-calling parameters and explore the results without needing to run the full analysis (and requiring much less time and memory).
- Any excluded chromosomes (
-e) and genomic regions (
-E) that were previously specified in the Genrich run that produced the log file are honored. Additional chromosomes and regions can be specified via the same command-line options, although the results may be slightly different than running the full analysis, due to differences in the calculations of genome length and sequence information. Note that there is no verification that an excluded region (
-E) does not extend past the end of a chromosome, unlike when the full analysis is run.
- With this option, the requirement to specify input SAM/BAM file(s) (
-t) is suspended, but the input log file (
-f) must be specified.
- Optional output files (
-R) cannot be produced with
-z Option to gzip-compress output(s)
- When selected, all output files are gzip-compressed.
-S Option to skip sort order check
- When selected, Genrich does not verify that the sort order of the input SAM/BAM file is queryname, even though the file is still parsed under that assumption. This is a convenience option, to be used only by those who understand the consequences.
-v/--verbose Option to print status updates/counts to stderr -h/--help Print the usage message and exit -V/--version Print the version and exit
A sequencing run was downloaded from SRA. Its reads were adapter-trimmed by NGmerge and aligned to the human genome (hg19) by Bowtie2 with
-k 20. The SAM alignments from bowtie2 were piped through SAMtools to sort them by queryname and convert the file to the BAM format. The resulting alignment file
SRR5427886.bam was analyzed by Genrich with the options to remove PCR duplicates (
-r) and to keep unpaired alignments, extended to the average fragment length (
-x). All alignments to chrM and chrY were discarded (
-e chrM,chrY), as well as alignments to two sets of excluded intervals: regions of 'N' homopolymers in the hg19 genome (produced by
findNs.py) and high mappability regions (
-E hg19_Ns.bed,wgEncodeDukeMapabilityRegionsExcludable.bed.gz). There was no control sample. Peaks were called using a maximum q-value of 0.05 (
-q 0.05) and a minimum AUC of 20.0 (
$ ./Genrich -t SRR5427886.bam -o SRR5427886.narrowPeak \ -f SRR5427886.log -r -x -q 0.05 -a 20.0 -v \ -e chrM,chrY -E hg19_Ns.bed,wgEncodeDukeMapabilityRegionsExcludable.bed.gz Processing experimental file #0: SRR5427886.bam BAM records analyzed: 146277509 Unmapped: 2877550 To skipped refs: 2560688 (chrM,chrY) Paired alignments: 137462998 secondary alns: 67724920 Unpaired alignments: 3376273 secondary alns: 2232056 PCR duplicates -- Paired aln sets: 35574220 duplicates: 4660723 (13.1%) Discordant aln sets: 64679 duplicates: 7736 (12.0%) Singleton aln sets: 1036778 duplicates: 475201 (45.8%) Fragments analyzed: 31286234 Full fragments: 30616993 (avg. length: 226.4bp) Half fragments: 669241 (from unpaired alns, extended to 226bp) - control file #0 not provided - Background pileup value: 2.477916 Peak-calling parameters: Genome length: 2826865605bp Significance threshold: -log(q) > 1.301 Min. AUC: 20.000 Max. gap between sites: 100bp Peaks identified: 35114 (27918264bp)
The total time to execute the above command (analyzing a BAM of 146.3 million alignments and calling peaks) was 10.5min. It required 17.1GB of memory.
Once a log file (
SRR5427886.log) is produced, calling peaks with an alternative set of parameters is accomplished most easily with the
-P option. For example, with
-p 0.01 -a 200:
$ ./Genrich -P -f SRR5427886.log -o SRR5427886_p01_a200.narrowPeak -p 0.01 -a 200 -v Peak-calling from log file: SRR5427886.log Peak-calling parameters: Genome length: 2826865605bp Significance threshold: -log(p) > 2.000 Min. AUC: 200.000 Max. gap between sites: 100bp Peaks identified: 47329 (62194983bp)
This required just 36sec and less than 0.1MB of memory.
However, if one wanted the alignments to be interpreted differently, such as ATAC-seq mode (
-j), the original command (with
-t SRR5427886.bam) would need to be rerun.
In verbose mode, Genrich may print one or more warnings to
Read N prevented from extending below 0 on <chrom>: This may occur due to extending unpaired alignments (
-x) or in ATAC-seq mode (
Read N prevented from extending past <int> on <chrom>: This also may occur due to extending unpaired alignments (
-x) or in ATAC-seq mode (
-j). A maximum of 128 warning messages of these types (
Read N prevented...) are printed per SAM/BAM.
Large scaling may mask true signal: This is printed if the scaling factor for the control pileup is greater than 5.
BED interval ignored - located off end of reference: An excluded BED interval (
-E) whose start coordinate is past the end of the reference sequence is ignored. One should ensure that the genome version that produced the BED intervals matches that of the SAM/BAM.
BED interval extends past end of ref. - edited to <loc>: An excluded BED interval (
-E) whose end coordinate is past the end of the reference sequence is adjusted as indicated. Again, one should ensure that the genome version that produced the BED intervals matches that of the SAM/BAM.
No paired alignments to calculate avg frag length -- Printing unpaired alignments "as is": When there are no properly paired alignments and the
-xaverage extension option is selected, the unpaired alignments are printed as they appear in the SAM/BAM.
Read N, alignment at <loc> skipped due to overflow: The maximum difference in pileup values from one genomic position to the next is +32767, and additional read alignments are skipped due to this limitation. Removing PCR duplicates (
-r) may help reduce this issue.
Read N, alignment at <loc> skipped due to underflow: The minimum difference in pileup values from one genomic position to the next is -32768, and additional read alignments are skipped due to this limitation. Removing PCR duplicates (
-r) may help reduce this issue.
Read N has more than 128 alignments: If a read has more than 128 alignments in the SAM/BAM, only the first 128 are considered. As described above, the best 10 alignments (at most) are ultimately analyzed by Genrich.
All q-values are 1: When no p-values are smaller than the reciprocal of the calculated genome length, all q-values are calculated to be 1. If peaks are expected to be called, one should consider using a p-value threshold instead (
Skipping chromosome N -- Reads aligning to it were used...: This occurs when peak-calling directly from a log file (
-P), and the log file indicates that reads/fragments aligned to a chromosome to be skipped (
-e). When Genrich created the log file, the chromosome was not skipped. Therefore, the reads/fragments that aligned to it were used in the background pileup calculation, and its length was included in the genome length calculation. Those calculations affected the statistics present in the log file, so the called peaks may be slightly different than those called when running the full analysis.
Skipping given BED regions -- Reads aligning to them were used...: This occurs when peak-calling directly from a log file (
-P), and the log file indicates that reads/fragments aligned to one or more genomic regions to be skipped (
-E). When Genrich created the log file, the regions were not skipped. Therefore, the reads/fragments that aligned to them were used in the background pileup calculation, and their lengths were included in the genome length calculation. Those calculations affected the statistics present in the log file, so the called peaks may be slightly different than those called when running the full analysis.
"orphan" alns: An "orphan" alignment is one that the SAM/BAM indicates is properly paired, but its pair could not be found. This could be due to a poorly formatted SAM/BAM, or possibly (but unlikely) a bug in alignment parsing by Genrich. This warning appears in the accounting of alignments analyzed.
Genrich runs very quickly but uses a considerable amount of memory. For starters, it requires 3 bytes for every base-pair of the reference genome, i.e. ~9GB for a human sample. The number of input files has little effect on memory, but certain analysis options (especially the option to remove PCR duplicates) can greatly increase the memory usage, particularly with large SAM/BAM input files. See above for an example.
Genrich is not multi-threaded.
Copyright © 2018 John M. Gaspar (email@example.com)
Any question, concern, or bug report about the program should be posted as an Issue on GitHub. Before posting, please check previous issues (both Open and Closed) to see if your issue has been addressed already. Also, please follow these good GitHub practices.