1. Introduction Bisulfite sequencing (BS-seq) has been extensively used for DNA methylome study. Ideally, BS-seq experiment should be able to directly and exactly identify the methylation state of a DNA fragment from the original genome. However, current BS-seq protocols still possess several intrinsic biases, which will impact methylation level estimation, such as overhang end-repair, 5� bisulfite conversion failure, sequencing into the adaptor and 3� low sequencing quality. Since BS-seq experiments are widely used and the resulting data will continue to grow exponentially in the near future, there is a strong need for a dedicated QC tool to evaluate and remove potential technical biases in BS-seq experiments. Here, we developed BSeQC package. It can comprehensively evaluate the quality of BS-seq experiments and automatically trim nucleotides with potential technical biases that may result in inaccurate methylation estimation. In addition, BSeQC also support removing duplicate reads and keeping one copy of the overlapping segment in paired-end sequencing. 2. Installation Prerequisite: . Python . Scipy . Numpy . Matplotlib (We recommend to install this python package. If not, we only can produce the Mbias table, not the Mbias plot) 2.1 Install BSeQC in defaut location 1. tar zxf BSeQC-VERSION.tar.gz 2. cd BSeQC-VERSION 3. python setup.py install #Skip step4 if '/usr/local/lib/python2.7/site-packages' is already included in your PYTHONPATH. 4. export PYTHONPATH=/usr/local/lib/python2.7/site-packages:$PYTHONPATH #Skip step5 if '/usr/local/bin' is already included in your PATH. 5. export PATH=/user/local/bin:$PATH #To make permanent change to your PYTHONPATH or PATH variables, copy the commands (step4 and step5) into your #'/home/user/.bashrc' or '/home/user/.bash_profile'. 2.2 Install BSeQC in user specified location 1. tar zxf BSeQC-VERSION.tar.gz 2. cd BSeQC-VERSION #You need to change '/home/user/' accordingly 3. python setup.py install --prefix=/home/user/ #setup PYTHONPATH, so that BSeQC knows where to import modules 4. export PYTHONPATH=/home/user/lib/pythonX.Y/site-packages:$PYTHONPATH #setup PATH, so that system knows where to find executable files. 5. export PATH=/home/user/bin:$PATH #To make permanent change to your PYTHONPATH or PATH variables, copy the commands (step4 and step5) into your #'/home/user/.bashrc' or '/home/user/.bash_profile'. 3. Input WGBS mapping files: . SAM files . BAM files 4. Usage Information mbias: BSeQC main executable function -s or --sam The SAM file for quality analysis; Multiple SAM file should be separated by the ','. (Mandatory) -r or --ref The reference genome fasta file. (Mandatory) -t or --samtools The path of samtools. (Optional) -n or --name The name for the output plot and table. default = 'NA' (Optional) -l or --len If the original mapping reads have been trimmed with adapter or other reasons, the original read length for the sam file should be set. Multiple length can also be separated by ','. If the read length of two mates in paired-end is different, please separated by '-'. (Optional) -p or --pvalue How many stds will be set for the trimming cutoff. Default = 2. (Optional) --drift How many drifts(%mC) will be set for the trimming cutoff. Default = 2. (Optional) -f or --trim_file User can determine the trimming bp by the trim file. (Optional) -a or --auto Automatically trim the biased bp. If not you can use the Mcall biases plot to manually decide how many bps to trim and make a trimming file. Default = True. (Optional) -o or --remove_overlap Keep only one copy of the overlapping segment of two read mates in paired-end seq. Default = True. (Optional) --filter_dup Pvalue cutoff Poisson distribution test in removing duplicate reads. Default = True. (Optional) --p_poisson Pvalue cutoff Poisson distribution test in removing duplicate reads. Default = 1e-5. -g or --gsize Effective genome size for calculate max coverage. It can be 1.0e7 or 10000000, or shortcuts: 'hs' for human (2.7e9), 'mm' for mouse (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for fruitfly (1.2e8). It is restricted by --filter_dup. (Optional) --not_mapping Whether keep the not-unique mapping reads in the filter SAM file. Default = True. (Optional) nonuniform: use the �Bis-SNP strategy� as an alternative method for trimming 5� bisulfite conversion failure -s or --sam The SAM file for quality analysis; Multiple SAM file should be separated by the ','. (Mandatory) -r or --ref The reference genome fasta file. (Mandatory) -t or --samtools The path of samtools. (Optional) -n or --name The name for the output plot and table. default = 'NA' (Optional) -o or --remove_overlap Keep only one copy of the overlapping segment of two read mates in paired-end seq. Default = True. (Optional) --filter_dup Pvalue cutoff Poisson distribution test in removing duplicate reads. Default = True. (Optional) --p_poisson Pvalue cutoff Poisson distribution test in removing duplicate reads. Default = 1e-5. -g or --gsize Effective genome size for calculate max coverage. It can be 1.0e7 or 10000000, or shortcuts: 'hs' for human (2.7e9), 'mm' for mouse (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for fruitfly (1.2e8). It is restricted by --filter_dup. (Optional) --not_mapping Whether keep the not-unique mapping reads in the filter SAM file. Default = True. (Optional) Strand symbol and information: Read type Strand Symbol Strand information Single-end ++ Watson strand Single-end -+ Crick strand Paired-end ++ Forward strand of Watson of reference (mate1) Paired-end +- Reverse strand of Watson of reference (mate2) Paired-end -+ Forward strand of Crick of reference (mate1) Paired-end -- Reverse strand of Crick of reference (mate2) 5. Example mbias: BSeQC main executable function ##paired-end (testdata: mNPC_chr1) #just focus on the mbias trimming bseqc mbias -s SRR299067_chr1.bam,SRR299069_chr1.bam,SRR299071_chr1.bam,SRR299073_chr1.bam,SRR299075_chr1.bam,SRR299077_chr1.bam,SRR299079_chr1.bam,SRR299081_chr1.bam,SRR299083_chr1.bam,SRR299068_chr1.bam,SRR299070_chr1.bam,SRR299072_chr1.bam,SRR299074_chr1.bam,SRR299076_chr1.bam,SRR299078_chr1.bam,SRR299080_chr1.bam,SRR299082_chr1.bam -r mm9.fa -n mNPC_Paired_BSseq_chr1 -o --filter_dup #filter duplicate reads, remove one copy of the overlapping segment, and not keep not_unique mapping reads bseqc mbias -s SRR299067_chr1.bam,SRR299069_chr1.bam,SRR299071_chr1.bam,SRR299073_chr1.bam,SRR299075_chr1.bam,SRR299077_chr1.bam,SRR299079_chr1.bam,SRR299081_chr1.bam,SRR299083_chr1.bam,SRR299068_chr1.bam,SRR299070_chr1.bam,SRR299072_chr1.bam,SRR299074_chr1.bam,SRR299076_chr1.bam,SRR299078_chr1.bam,SRR299080_chr1.bam,SRR299082_chr1.bam -r mm9.fa -n mNPC_Paired_BSseq_chr1_roverlap_dup -g 1.9e8 --not_mapping #use trimming file bseqc mbias -s SRR299067_chr1.bam,SRR299069_chr1.bam,SRR299071_chr1.bam,SRR299073_chr1.bam,SRR299075_chr1.bam,SRR299077_chr1.bam,SRR299079_chr1.bam,SRR299081_chr1.bam,SRR299083_chr1.bam,SRR299068_chr1.bam,SRR299070_chr1.bam,SRR299072_chr1.bam,SRR299074_chr1.bam,SRR299076_chr1.bam,SRR299078_chr1.bam,SRR299080_chr1.bam,SRR299082_chr1.bam -r mm9.fa -n mNPC_Paired_BSseq_chr1_roverlap_dup -g 1.9e8 --not_mapping -f mNPC_Paired_BSseq_chr1_trim_file.txt ##single-end #just focus on on the mbias trimming (testdata: H1_chr1) #for replicate1: set the samtools path; set the original sequence length, because some bps in the 3' end of the reads with low quality or adapter have been trimmed during mapping bseqc mbias -s methylC-seq_H1_r1_noaq_chr1_43.bam,methylC-seq_H1_r1_noaq_chr1_52.bam,methylC-seq_H1_r1_noaq_chr1_53.bam,methylC-seq_H1_r1_noaq_chr1_76.bam,methylC-seq_H1_r1_noaq_chr1_87.bam,methylC-seq_H1_r1_noaq_chr1_88.bam -r hg19.fa -l 43,52,53,76,87,88 -n H1_Single_BSseq_chr1_replicate1 --filter_dup #for replicate2 bseqc mbias -s methylC-seq_H1_r2_noaq_chr1.bam -r hg19.fa -n H1_Single_BSseq_chr1_replicate2 -l 87 --filter_dup nonuniform: use the �Bis-SNP strategy� as an alternative method for trimming 5� bisulfite conversion failure ##testdata: mNPC_chr1 bseqc nonuniform -s SRR299067_chr1.bam,SRR299069_chr1.bam,SRR299071_chr1.bam,SRR299073_chr1.bam,SRR299075_chr1.bam,SRR299077_chr1.bam,SRR299079_chr1.bam,SRR299081_chr1.bam,SRR299083_chr1.bam,SRR299068_chr1.bam,SRR299070_chr1.bam,SRR299072_chr1.bam,SRR299074_chr1.bam,SRR299076_chr1.bam,SRR299078_chr1.bam,SRR299080_chr1.bam,SRR299082_chr1.bam -r /mnt/Storage/data/Bowtie/mm9.fa -n Bis-SNP_strategy --filter_dup -o --not_mapping ##RRBS paired-end bseqc rrbs -s SRR726536_chr1.sam -r mm9.fa -n rrbs_p ##RRBS single-end bseqc rrbs -s SRR788619.sam -r mm9.fa -n rrbs_s_rep1 6. Output mbias: BSeQC main executable function name + 'CG_Mbias_plot.pdf': the CpG Mbias plot for each read length in each stand name + 'nonCG_Mbias_plot.pdf': the non-CpG Mbias plot for each read length in each stand name + 'Dup_dis.pdf': the duplicate reads distribution (when --filter_dup be set true) name_CG_strand_readlength.csv(in directory:name+'Mbias_table): the CpG Mbias table for each read length in each stand name_nonCG_strand_readlength.csv(in directory:name+'Mbias_table): the non-CpG Mbias table for each read length in each stand name_CG_trim_file.txt(in directory:name+Mbias_table): the trimming decision for each read length in each stand from the CpG Mbias plot name_nonCG_trim_file.txt(in directory:name+Mbias_table): the trimming decision for each read length in each stand from the non-CpG Mbias plot name_final_trim_file.txt: the most stringent trimming decision made by either CpG or non-CpG cytosines M-bias plots name_BSeQC_mbias_filter_report.txt: the detail trimming and filtering report nonuniform: use the �Bis-SNP strategy� as an alternative method for trimming 5� bisulfite conversion failure name_trimmed_nucleotides_dis.pdf: the distribution of the number of trimmed nucleotides based on the �Bis-SNP strategy� name_BSeQC_mbias_filter_report.txt: the detail trimming and filtering report 7. NOTE README file Because different DNA strands and read lengths can have distinct biases, BSeQC will trim them differently. If users are concerned about the different coverage on two strands, we can use the parameter (-f TRIM_FILE) to let Watson(+) and Crick(-) strands be trimmed with same nucleotides in the 5’ and 3’. The format of the trim_file(split by tab): strand read_length trim_5'_pos trim_3'_pos ++ 100 2 98 ++ 88 2 78 -+ 100 2 98 -+ 87 2 78