FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states

Home Page:http://felixkrueger.github.io/Bismark/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bismark Alignment Inquiry

KeemAntonio opened this issue · comments

Hi Felix,

I have a couple of questions regarding Bismark. I'm sorry if they are stupid questions; I am fairly new to WGBS data and Bismark analysis (or to bioinformatics in general).

We are researching non-model fish animals. Recently, their genomes have become available in the NCBI database, and their assembly level is Scaffold (Additionally, it is still unannotated). Concurrently, there is another project in our laboratory to sequence our target organisms. I, in particular, have been tasked with analyzing WGBS data from our target animal. I already have the raw WGBS reads (Library type: Accel Methyl-Seq DNA library), and I have already trimmed them using Trim Galore (--paired --fastqc). While we are waiting for the genome of our target samples from our own laboratory (we had a few hiccups with our genome sequencing, which is why our WGBS reads came first), can I use a different genome (that is already annotated) from another species to align my WGBS reads?

I tried aligning our methylome reads to the genome of D. rerio using the default parameters (-q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e., not performed)), which resulted in a mapping efficiency of 0.1%. I also tried aligning my WGBS reads to the unannotated genome of our target species (the one I mentioned earlier that has been deposited to the NCBI database) using also default parameters (-q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e., not performed)), and I only achieved 31% mapping efficiency. I am reading from your GitHub that I should use --score-min L,0,-0.6, which I will be trying. My main question is, is it alright to use the unannotated genome as my reference for Bismark alignment (I will also try aligning our methylome reads to the genome that was produced by our laboratory once it is finished)? And does Bismark only align methylome reads from the same species, or will my methylome reads be able to align to the genomes of closely related species?

Thank you

Hi Antonio,

Welcome to the world of methylation alignments! As a first comment, Accel Swift typically requires some specific trimming, namely

--clip_R1 10 --clip_R2 15 --three_prime_clip_R1 10 --three_prime_clip_R2 10

You can of course try to align to a close species at first to get some experience, and then perform it once more you've got a good reference sequence available. The issue that you have experienced already is that the more distantly related the species are, the lower the mapping efficiency will be. The default mapping parameters for Bismark are deliberately very strict, so increasing it to --score-min L,0,-0.6 will allow 3 times as many errors as the default mode.

As another issue, you should really re-do the trimming first, this should also greatly enhance the mapping efficiency (while reducing errors).

And finally, as soon as you have a reference sequence to align to, you can use Bismark, as it doesn't do anything with the annotations themselves. Once the annotation is at hand, you can rather use it as an additional layer for downstream analysis.

Hi Felix!

Thank you for your prompt response! we have re-done trimming using <--clip_R1 10 --clip_R2 15 --three_prime_clip_R1 10 --three_prime_clip_R2 10> in TrimGalore! we have seen significant improvement as you can see in per base sequence
new trim
we have also seen a significant increase on the mapping efficiency of one of our samples on one of our target organism (I am analyzing two different species of non-model fish) with a jump from ~30% to 51% using the newly trimmed reads (bismark Parameter: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e., not performed)). we have also tried aligning using --score-min L,0,-0.4 and got an increase from our old trimmed sequences (57.7%) comparing it to new trimmed sequences using the suggested parameter (66.9%). Lastly, we also aligned using --score-min L,0,-0.6 and got an increase from 72.9% to 77.0%. Thank you for this!

I just have a few more questions. since I am working on another species of fish we tried doing the same steps/parameters for trimming (--clip_R1 10 --clip_R2 15 --three_prime_clip_R1 10 --three_prime_clip_R2 10) but when we aligned it to its assembled genome that we got from NCBI database (note that we are still in the midst of our own genome assembly) we only got a 1.2% mapping efficiency using this parameter (-q --score-min L,0,-0.6 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e. not performed)) here is the final report:

Final Alignment report

Sequence pairs analysed in total: 94422002
Number of paired-end alignments with a unique best hit: 1156101
Mapping efficiency: 1.2%
Sequence pairs with no alignments under any condition: 93197502
Sequence pairs did not map uniquely: 68399
Sequence pairs which were discarded because genomic sequence could not be extracted: 4698

Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT: 573176 ((converted) top strand)
GA/CT/CT: 0 (complementary to (converted) top strand)
GA/CT/GA: 0 (complementary to (converted) bottom strand)
CT/GA/GA: 578227 ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report

Total number of C's analysed: 56000878

Total methylated C's in CpG context: 3779225
Total methylated C's in CHG context: 236041
Total methylated C's in CHH context: 537007
Total methylated C's in Unknown context: 15090

Total unmethylated C's in CpG context: 3543149
Total unmethylated C's in CHG context: 11928857
Total unmethylated C's in CHH context: 35976599
Total unmethylated C's in Unknown context: 292947

C methylated in CpG context: 51.6%
C methylated in CHG context: 1.9%
C methylated in CHH context: 1.5%
C methylated in Unknown context (CN or CHN): 4.9%

I have a feeling that the available genome that I am using in this species is highly fragmented

the assembly statistics is:

Genome size | 709.4 Mb
Total ungapped length | 577.7 Mb
Number of scaffolds | 191,173
Scaffold N50 | 11.4 Mb
Scaffold L50 | 22
Number of contigs | 281,720
Contig N50 | 2.6 kb
Contig L50 | 67,375
GC percent | 44.5
Genome coverage | 65.0x
Assembly level | Scaffold

do you have any recommendations on how to proceed on this problem? Thank you so much!

Hmm, a 1% mapping efficiency is very low indeed. This typically happens either when you have the wrong library type (can you attache the R1/R2 FastQC base composition plots?), or when you align to a completely wrong genome (e.g. human/mouse), or possibly if there were some grave technical errors, e.g. in R2?

Hi Felix! Thank you for responding. I'm attaching the fastqc report here for one pair of my samples:

R1:
per_sequence_gc_content
per_sequence_quality
per_tile_quality
sequence_length_distribution
adapter_content
duplication_levels
per_base_n_content
per_base_quality
per_base_sequence_content

R2:
adapter_content
duplication_levels
per_base_n_content
per_base_quality
per_base_sequence_content
per_sequence_gc_content
per_sequence_quality
per_tile_quality
sequence_length_distribution

I am also trying to see if my sequences are contaminated just to make sure that my samples are really the species that I collected. Do you have any recommendations on what software that I can use for this task? I'm reading that Fastq screen can screen contaminants and can be used for bisulfite sequences.

Regarding your last suggestion/comment, how do we assess if there is a grave technical error in our sequences?

Thank you!

Thanks for all these plots (they are from the 5' clipped samples, correct?). Overall they all look fine, there are no signs of technical or unexpected issues in the reads (e.g. in the quality, or tile graphs, or the N-count).

My next best guess would be that the sample are simply not from the organism you think they are from. If you wanted you could send me small subset of reads (e.g. 200,000, untrimmed, fastq.gz compressed) via email, and I could take a quick look?

Hi Felix! sorry for the late reply. Our lab had a field work for 3 weeks field work. May I ask what software would you suggest for me to be able to get a small subset of my sequences for the second fish species (fish2)? may I also ask for your email? is this email correct: fkrueger@altoslabs.com?

Currently we were done with the alignment of the first fish species ( Fish1; the one that I described that the mapping efficiency jumped from 30% to 50% using the default parameters of Bismark (bismark Parameter: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e., not performed)). Here is the result for report generation:

newplot (6)
newplot (7)
Bismark M-bias Read 1
Bismark M-bias Read 2
newplot (3)
newplot (4)
newplot (5)

I just have another question for Fish1:

during alignment, bismark is saying that "Chromosomal sequence could not be extracted for..." as shown in the figure below. is this to be expected since we are using genome at the scaffold level?
image

Yes, these warnings show up more frequently if you have short chromosomes or scaffolds. Typically the number is negligibly small, in your case ~0.05%. Just ignore.

To interact with reads you can just use gzip/gunzip, e.g. like so:

gunzip -c inputfile.fastq.gz | head -800000 | gzip - > output_200K.fastq.gz

Thanks for the sample files. The sample seems to have been processed using the Accel Swift kit, which introduces some hefty biases, especially at the start of Read 2:

Screenshot 2024-03-25 at 13 53 49

As a solution to this you just need to clip your reads from all ends, like so:

trim_galore --clip_R1 10 --clip_R2 15 --three_prime_clip_R1 10 --three_prime_clip_R2 10

I am convinced after that the reads will align just fine. Let me know how you get on!