ngmlr was used to mapping the raw reads to reference, and sniffles was used to call SVs, the tutorial is here SV calling with Sniffles
- calculate the difference frequency of SVs in difference populations
python SV_Vcf_FilterBy_SampleFreq.py sheep 0.4 Tibet_Hu.ONT.anno.vcf
- 0.4: the SV missing in specific samples
- annotated the selectived SVs
python SV_GeneAnno_2023.py sheep_sv_info.txt
population-scale shorted-reads SVs was used to verify the frequency of SVs identified by ONT data
- identify genomic rearrangements from whole-genome alignments
nucmer --maxmatch -c 100 -b 500 -l 50 refgenome qrygenome # Whole genome alignment.
delta-filter -m -i 90 -l 100 out.delta > out.filtered.delta # Remove small and lower quality alignments
show-coords -THrd out.filtered.delta > out.filtered.coords # Convert alignment information to a .TSV format as required by SyRI
syri -c out.filtered.coords -d out.filtered.delta -r refgenome -q qrygenome
- filter Inversion with length larger than 1Mb
- varify by contig
-
de novo genome assembly of Tibetan sheep at contig level as Tibet4_Flye_contig.fasta
flye --nano-raw Tibetan.fq.gz --out-dir Flye_assembly --threads 60
-
Contig-level genome of Tibetan sheep was aligned to Hu sheep reference genome
nucmer -p Tibetan-Hu Hu.fa Tibet4_Flye_contig.fasta delta-filter -l 10000 -q Tibetan-Hu.delta >Tibetan-Hu.filter.delta show-coords -q Tibetan-Hu.filter.delta -T -o Tibetan-Hu.txt
-
visualizing
-
varify inversion in population by localPCA
bcftools view CM029825.1.vcf.gz -O b -o CM029825.1.bcf.gz # convert vcf to bcf for chromosome CM029825.1 Rscript localpca.R CM029825.1 # performing localPCA Rscript plot.R # visualizing
-
HiC-Pro pipeline was carring out to map the raw Hi-C reads to reference genome to get the raw contact matrix and normalizing by ICE (iterative correction and eigenvector decomposition).
-
A/B compartment
The A/B compartments were identified at a 150 kb resolution through PCA analysis using the matrix2compartment function in cworld-dekker, A compartment with positive PC1, characterized by high gene density and GC content, B compartment is diametrically opposed.
# convert matrix from HiC-Pro software output to dense format matrix of each chromosome (for example with 150 kb resolution)
python sparseToDense.py -b sample_150000_abs.bed sample_150000_iced.matrix --perchr
# convert matrix from HiC-Pro software output to insulation format matrix
python runchangematrix.insulation.py -i sample_150000_iced_CM029833.1_dense.matrix -g Hu -c CM029833.1 -o 150000 -s 150000
## -i with each chromosome 150kb resolution dense format matrix file
## -g with species name
## -c with chromosome number
## -o with output prefix
## -s with resolution
# performs a PCA analysis on the input matrix
perl matrix2compartment.pl -i Hu_CM029833.1_150000.insulation.matrix -o Hu_CM029833.1_150000 --et
python matrix2EigenVectors.py -i Hu_CM029833.1_150000.zScore.matrix.gz -r util/geneDensity/Hu.refseq.txt -v
- TAD TAD boundaries at a 40 kb resolution were identified using matrix2insulation function in cworld-dekker.
# calculate insulation score to identify TAD boundry;Output file is TAD boundary information
perl -I /project/software/cworld-dekker-master/lib/ matrix2insulation.pl -i Hu_CM029833.1_40000.insulation.matrix --is 1000000 --ids 240000
# extract TAD region from TAD boundary
perl boundaries2bed.pl Hu_CM029833.1_40000.insulation--is1000000--nt0--ids240000--ss0--immean.insulation.boundaries 83093940 CM029833.1 > CM029833.1.tad.bed
- loop Inter-chromosomal significant interactions at a 20 kb resolution were identified using FitHiC.
# change format to fithic input format from HiC-Pro software output
python hicpro2fithic.py -i sample_20000.matrix -b sample_20000_abs.bed -s sample_20000_iced.matrix.biases
# calculate significant_interactions by fithic software
fithic -f fithic.fragmentMappability.gz -i fithic.interactionCounts.gz -t fithic.biases.gz -o hu_fithic -l hjf -v -x All -r 20000
# filter significant interactions by pvalue,qvalue and count number
python runfilterpvalue.qvalue.count.py hu_fithic/hjf.spline_pass1.res20000.significances.txt.gz hjf.spline_pass1.res20000.significances.txt.bed
- pyGenomeTracks was performing to visualize the Hi-C matrices.
- Seurat is an R package designed for analysis of single-cell RNA-seq data;
- In this study, the in-house script was stored in script/scRNA-seq/scRNA-seq.R.