xiaolongliang/TibetanSheep_SVs

Structural variation

ONT-based SV identify

ngmlr was used to mapping the raw reads to reference, and sniffles was used to call SVs, the tutorial is here SV calling with Sniffles

calculate the difference frequency of SVs in difference populations

python SV_Vcf_FilterBy_SampleFreq.py sheep 0.4 Tibet_Hu.ONT.anno.vcf
- 0.4: the SV missing in specific samples

annotated the selectived SVs

python SV_GeneAnno_2023.py sheep_sv_info.txt

population-scale shorted-reads SV identify

population-scale shorted-reads SVs was used to verify the frequency of SVs identified by ONT data

Manta
Delly
Lumpy

large inversion

identify genomic rearrangements from whole-genome alignments

nucmer --maxmatch -c 100 -b 500 -l 50 refgenome qrygenome       # Whole genome alignment. 
delta-filter -m -i 90 -l 100 out.delta > out.filtered.delta     # Remove small and lower quality alignments
show-coords -THrd out.filtered.delta > out.filtered.coords      # Convert alignment information to a .TSV format as required by SyRI
syri -c out.filtered.coords -d out.filtered.delta -r refgenome -q qrygenome

filter Inversion with length larger than 1Mb
varify by contig

de novo genome assembly of Tibetan sheep at contig level as Tibet4_Flye_contig.fasta

flye --nano-raw Tibetan.fq.gz --out-dir Flye_assembly --threads 60
Contig-level genome of Tibetan sheep was aligned to Hu sheep reference genome

nucmer -p Tibetan-Hu Hu.fa Tibet4_Flye_contig.fasta delta-filter -l 10000 -q Tibetan-Hu.delta >Tibetan-Hu.filter.delta show-coords -q Tibetan-Hu.filter.delta -T -o Tibetan-Hu.txt
visualizing

varify inversion in population by localPCA

bcftools view CM029825.1.vcf.gz -O b -o CM029825.1.bcf.gz # convert vcf to bcf for chromosome CM029825.1 Rscript localpca.R CM029825.1 # performing localPCA Rscript plot.R # visualizing

Hi-C data analysis

HiC-Pro pipeline was carring out to map the raw Hi-C reads to reference genome to get the raw contact matrix and normalizing by ICE (iterative correction and eigenvector decomposition).
A/B compartment
The A/B compartments were identified at a 150 kb resolution through PCA analysis using the matrix2compartment function in cworld-dekker, A compartment with positive PC1, characterized by high gene density and GC content, B compartment is diametrically opposed.

# convert matrix from HiC-Pro software output to dense format matrix of each chromosome (for example with 150 kb resolution)
python sparseToDense.py -b sample_150000_abs.bed sample_150000_iced.matrix --perchr

# convert matrix from HiC-Pro software output to insulation format matrix
python runchangematrix.insulation.py -i sample_150000_iced_CM029833.1_dense.matrix -g Hu -c CM029833.1 -o 150000 -s 150000
## -i with each chromosome 150kb resolution dense format matrix file
## -g with species name
## -c with chromosome number
## -o with output prefix
## -s with resolution

# performs a PCA analysis on the input matrix
perl matrix2compartment.pl -i Hu_CM029833.1_150000.insulation.matrix -o Hu_CM029833.1_150000 --et
python matrix2EigenVectors.py -i Hu_CM029833.1_150000.zScore.matrix.gz -r util/geneDensity/Hu.refseq.txt -v

TAD TAD boundaries at a 40 kb resolution were identified using matrix2insulation function in cworld-dekker.

# calculate insulation score to identify TAD boundry;Output file is TAD boundary information
perl -I /project/software/cworld-dekker-master/lib/ matrix2insulation.pl -i Hu_CM029833.1_40000.insulation.matrix --is 1000000 --ids 240000
# extract TAD region from TAD boundary
perl boundaries2bed.pl Hu_CM029833.1_40000.insulation--is1000000--nt0--ids240000--ss0--immean.insulation.boundaries 83093940 CM029833.1 > CM029833.1.tad.bed

loop Inter-chromosomal significant interactions at a 20 kb resolution were identified using FitHiC.

# change format to fithic input format from HiC-Pro software output
python hicpro2fithic.py -i sample_20000.matrix -b sample_20000_abs.bed -s sample_20000_iced.matrix.biases
# calculate significant_interactions by fithic software
fithic -f fithic.fragmentMappability.gz -i fithic.interactionCounts.gz -t fithic.biases.gz -o hu_fithic -l hjf -v -x All -r 20000
# filter significant interactions by pvalue,qvalue and count number
python runfilterpvalue.qvalue.count.py hu_fithic/hjf.spline_pass1.res20000.significances.txt.gz hjf.spline_pass1.res20000.significances.txt.bed

pyGenomeTracks was performing to visualize the Hi-C matrices.

scRNA-seq

Seurat is an R package designed for analysis of single-cell RNA-seq data;
In this study, the in-house script was stored in script/scRNA-seq/scRNA-seq.R.

xiaolongliang / TibetanSheep_SVs