xiaolongliang / TibetanSheep_SVs

Enhancing high-altitude adaptation in Tibetan sheep through selecting genomic structural variations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Structural variation

ONT-based SV identify

ngmlr was used to mapping the raw reads to reference, and sniffles was used to call SVs, the tutorial is here SV calling with Sniffles

  1. calculate the difference frequency of SVs in difference populations
python SV_Vcf_FilterBy_SampleFreq.py sheep 0.4 Tibet_Hu.ONT.anno.vcf
- 0.4: the SV missing in specific samples
  1. annotated the selectived SVs
python SV_GeneAnno_2023.py sheep_sv_info.txt

population-scale shorted-reads SV identify

population-scale shorted-reads SVs was used to verify the frequency of SVs identified by ONT data

  1. Manta
  2. Delly
  3. Lumpy

large inversion

  1. identify genomic rearrangements from whole-genome alignments
nucmer --maxmatch -c 100 -b 500 -l 50 refgenome qrygenome       # Whole genome alignment. 
delta-filter -m -i 90 -l 100 out.delta > out.filtered.delta     # Remove small and lower quality alignments
show-coords -THrd out.filtered.delta > out.filtered.coords      # Convert alignment information to a .TSV format as required by SyRI
syri -c out.filtered.coords -d out.filtered.delta -r refgenome -q qrygenome
  1. filter Inversion with length larger than 1Mb
  2. varify by contig
  • de novo genome assembly of Tibetan sheep at contig level as Tibet4_Flye_contig.fasta

    flye --nano-raw Tibetan.fq.gz --out-dir Flye_assembly --threads 60

  • Contig-level genome of Tibetan sheep was aligned to Hu sheep reference genome

    nucmer -p Tibetan-Hu Hu.fa Tibet4_Flye_contig.fasta delta-filter -l 10000 -q Tibetan-Hu.delta >Tibetan-Hu.filter.delta show-coords -q Tibetan-Hu.filter.delta -T -o Tibetan-Hu.txt

  • visualizing

  1. varify inversion in population by localPCA

    bcftools view CM029825.1.vcf.gz -O b -o CM029825.1.bcf.gz # convert vcf to bcf for chromosome CM029825.1 Rscript localpca.R CM029825.1 # performing localPCA Rscript plot.R # visualizing

Hi-C data analysis

  1. HiC-Pro pipeline was carring out to map the raw Hi-C reads to reference genome to get the raw contact matrix and normalizing by ICE (iterative correction and eigenvector decomposition).

  2. A/B compartment
    The A/B compartments were identified at a 150 kb resolution through PCA analysis using the matrix2compartment function in cworld-dekker, A compartment with positive PC1, characterized by high gene density and GC content, B compartment is diametrically opposed.

# convert matrix from HiC-Pro software output to dense format matrix of each chromosome (for example with 150 kb resolution)
python sparseToDense.py -b sample_150000_abs.bed sample_150000_iced.matrix --perchr

# convert matrix from HiC-Pro software output to insulation format matrix
python runchangematrix.insulation.py -i sample_150000_iced_CM029833.1_dense.matrix -g Hu -c CM029833.1 -o 150000 -s 150000
## -i with each chromosome 150kb resolution dense format matrix file
## -g with species name
## -c with chromosome number
## -o with output prefix
## -s with resolution

# performs a PCA analysis on the input matrix
perl matrix2compartment.pl -i Hu_CM029833.1_150000.insulation.matrix -o Hu_CM029833.1_150000 --et
python matrix2EigenVectors.py -i Hu_CM029833.1_150000.zScore.matrix.gz -r util/geneDensity/Hu.refseq.txt -v
  1. TAD TAD boundaries at a 40 kb resolution were identified using matrix2insulation function in cworld-dekker.
# calculate insulation score to identify TAD boundry;Output file is TAD boundary information
perl -I /project/software/cworld-dekker-master/lib/ matrix2insulation.pl -i Hu_CM029833.1_40000.insulation.matrix --is 1000000 --ids 240000
# extract TAD region from TAD boundary
perl boundaries2bed.pl Hu_CM029833.1_40000.insulation--is1000000--nt0--ids240000--ss0--immean.insulation.boundaries 83093940 CM029833.1 > CM029833.1.tad.bed
  1. loop Inter-chromosomal significant interactions at a 20 kb resolution were identified using FitHiC.
# change format to fithic input format from HiC-Pro software output
python hicpro2fithic.py -i sample_20000.matrix -b sample_20000_abs.bed -s sample_20000_iced.matrix.biases
# calculate significant_interactions by fithic software
fithic -f fithic.fragmentMappability.gz -i fithic.interactionCounts.gz -t fithic.biases.gz -o hu_fithic -l hjf -v -x All -r 20000
# filter significant interactions by pvalue,qvalue and count number
python runfilterpvalue.qvalue.count.py hu_fithic/hjf.spline_pass1.res20000.significances.txt.gz hjf.spline_pass1.res20000.significances.txt.bed
  1. pyGenomeTracks was performing to visualize the Hi-C matrices.

scRNA-seq

  • Seurat is an R package designed for analysis of single-cell RNA-seq data;
  • In this study, the in-house script was stored in script/scRNA-seq/scRNA-seq.R.

About

Enhancing high-altitude adaptation in Tibetan sheep through selecting genomic structural variations


Languages

Language:Perl 49.6%Language:Python 37.7%Language:R 12.3%Language:Shell 0.4%