yeeus / GCI

A program for assessing the T2T genome continuity

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Genome Continuity Inspector

About

Genome Continuity Inspector (GCI) is an assembly assessment tool for high-quality genomes (e.g. T2T genomes), in base resolution. After stringently filtering the alignments generated by mapping long reads (PacBio HiFi and/or Oxford Nanopore long reads) back to the genome assembly, GCI will report potential assembly issues and also a score to quantify the continuity of assembly.

https://github.com/yeeus/GCI/images/pipeline.png

Contents

Requirements

As for GCI, it requires:

  • python3.x (tested in python3.10)
  • pysam (stable version)
  • biopython (stable version)
  • numpy (stable version)
  • matplotlib (stable version)
  • bamsnap (for plotting in utility filter_bam.py)

Parameters

python GCI.py --help

usage: GCI.py [-r FILE] [--hifi  [...]] [--nano  [...]] [--chrs] [-R FILE] [-ts INT] [-dp FLOAT] [-d PATH] [-o STR] [-mq INT] [--mq-cutoff INT] [-ip FLOAT] [-op FLOAT] [-cp FLOAT] [-fl INT] [-p]
              [-dmin FLOAT] [-dmax FLOAT] [-ws INT] [-it STR] [-f] [-h] [-v]

A program for assessing the T2T genome

Input/Output:
  -r FILE, --reference FILE
                        The reference file
  --hifi  [ ...]        PacBio HiFi reads alignment files (at least one bam file)
  --nano  [ ...]        Oxford Nanopore long reads alignment files (at least one bam file)
  --chrs                A list of chromosomes separated by comma
  -R FILE, --regions FILE
                        Bed file containing regions
                        Be cautious! If both specify `--chrs` and `--regions`, chromosomes in regions bed file should be included in the chromosomes list
  -ts INT, --threshold INT
                        The threshold of depth to be reported as issues [0]
  -dp FLOAT, --dist-percent FLOAT
                        The distance between the candidate gap intervals for combining in chromosome units [0.005]
  -d PATH               The directory of output files [.]
  -o STR, --output STR  Prefix of output files [GCI]

Filter Options:
  -mq INT, --map-qual INT
                        Minium mapping quality for alignments [30]
  --mq-cutoff INT       The cutoff of mapping quality for keeping the alignment [50]
                        (only used when inputting more than one alignment files)
  -ip FLOAT, --iden-percent FLOAT
                        Minimum identity (num_match_res/len_aln) of alignments [0.9]
  -op FLOAT, --ovlp-percent FLOAT
                        Minimum overlapping percentage of the same read alignment if inputting more than one alignment files [0.9]
  -cp FLOAT, --clip-percent FLOAT
                        Maximum clipped percentage of the alignment [0.1]
  -fl INT, --flank-len INT
                        The flanking length of the clipped bases [15]

Plot Options:
  -p, --plot            Visualize the finally filtered whole genome (and regions if providing the option `-R`) depth [False]
  -dmin FLOAT, --depth-min FLOAT
                        Minimum depth in folds of mean coverage for plotting [0.1]
  -dmax FLOAT, --depth-max FLOAT
                        Maximum depth in folds of mean coverage for plotting [4.0]
  -ws INT, --window-size INT
                        The window size when plotting [50000]
  -it STR, --image-type STR
                        The format of the output images: png or pdf [png]

Other Options:
  -f, --force           Force rewriting of existing files [False]
  -h, --help            Show this help message and exit
  -v, --version         Show program's version number and exit

Examples:
python GCI.py -r ref.fa --hifi hifi.bam hifi.paf ... --nano nano.bam nano.paf ...

Usage

  1. (For haplotype-resolved genome) Prepare parental (specific) reads (if parental sequencing data are available, please skip this step)
# we recommend to use canu for binning
canu -haplotype \
    -p $prefix -d $dictionary \
    genomeSize=3g \
    maxThreads=$threads \
    -haplotypePat $pat \
    -haplotypeMat $mat \
    -pacbio-hifi $hifi \   ## binning ONT reads with -nanopore $ont
    useGrid=false

# because there would be unknown reads which could't be reliably binned, we suggest to combine them with haplotype-specific reads
cat ${canu_mat.fa.gz} ${canu_unknown.fa.gz} > ${canu_mat.final.fa.gz}
cat ${canu_pat.fa.gz} ${canu_unknown.fa.gz} > ${canu_pat.final.fa.gz}
  1. Map HiFi and/or ONT reads to assemblies (using minimap2 and winnowmap)
# minimap2 
minimap2 -t $threads -ax map-hifi $mat_asm $mat_hifi > ${mat.minimap2.hifi.sam}   ## mapping ONT reads with -ax map-ont
paftools.js sam2paf ${mat.minimap2.hifi.sam} | sort -k6,6V -k8,8n > ${mat.minimap2.hifi.paf} ## please sort the paf file because our program don't automatically sort the file by the targets names!

# winnowmap
meryl count k=15 output $mat_merylDB $mat_asm
meryl print greater-than distinct=0.9998 $mat_merylDB > $mat_repetitive_k15.txt
winnowmap -W $mat_repetitive_k15.txt -ax map-pb $mat_asm $mat_hifi > ${mat.winnowmap.hifi.sam}   ## mapping ONT reads with -ax map-ont

samtools view -@ $threads -Sb ${mat.winnowmap.hifi.sam} | samtools sort -@ $threads -o ${mat.winnowmap.hifi.bam}
samtools index ${mat.minimap2.hifi.bam} ## this is necessary!!!
  1. Filter the mapping files and get the genome continuity index

We recommend to input only one alignment file per software (minimap2 and winnowmap) using the same set of long reads. Importantly, there needs at least one bam file for one type of long reads, which means you'll get errors when providing only paf files.

# Before this, make sure you've generated the index file (.bai) for bam files
# We recommend to input one bam and one paf file produced by two softwares (for example, one bam file from winnowmap and one paf file from minimap2)
# PDF is recommended because PNG file may lose some details though GCI will output png files by default
python GCI.py -r ref.fa --hifi hifi.bam hifi.paf --nano ont.bam ont.paf -d mat -o mat -p -it pdf ...

Outputs

only providing one type of reads

  • ${dictionay}/
    • (${prefix}.gaps.bed) ## the positions of Ns in the assembly

    • ${prefix}.depth.gz ## the gzipped whole-genome depth file

    • ${prefix}.${threshold}.depth.bed ## the merged depth file in bed format

    • ${prefix}.gci ## an index file containing expected N50, observed N50, expected number of contigs, observed number of contigs, genome continuity index of each chromosome and reads type

    • (${prefix}.regins.gci) ## gci file for regions (if providing the parameter -R)

    • (images/) ## the depth plots across the whole chromosome (if providing the parameter -p)

      • ${prefix}.${target}.${image_type}

      https://github.com/yeeus/GCI/images/MH63.chr01.png

      • (${prefix}.${target}:${start}-${end}.${image_type}) (if providing the parameter -R)

      https://github.com/yeeus/GCI/images/MH63.Chr02_MH63:13000000-15000000.png

providing two types

  • ${dictionay}/
    • (${prefix}.gaps.bed)

    • ${prefix}_hifi.depth.gz ## the gzipped whole-genome depth file generated by the hifi alignment file

    • ${prefix}_nano.depth.gz ## the gzipped whole-genome depth file generated by the ont alignment file

    • ${prefix}_two_type.depth.gz ## the gzipped whole-genome depth file generated by the two types of reads alignment file

    • ${prefix}_hifi.${threshold}.depth.bed ## the merged depth file in bed format generated by the hifi alignment file

    • ${prefix}_nano.${threshold}.depth.bed ## the merged depth file in bed format generated by the ont alignment file

    • ${prefix}_two_type.${threshold}.depth.bed ## the merged depth file in bed format generated by the two types of reads alignment file

    • ${prefix}.gci

    • (${prefix}.regins.gci)

    • (images/)

      • ${prefix}.${target}.${image_type}

      https://github.com/yeeus/GCI/images/chm13.chr19.png

      • (${prefix}.${target}:${start}-${end}.${image_type})

      https://github.com/yeeus/GCI/images/arabidopsis.Chr2:0-500000.png

Benchmark

We benchmarked GCI in many genomes:

Type of reads CHM13.v.2.0 CN1.mat.v0.9 CN1.pat.v0.9 HG002.mat.cur.20211005 HG002.pat.cur.20211005 GGswu Col-CEN.v1.2 MH63RS3
HiFi (depth; GCI) ~58x; 41.8259 ~44x; 22.8391 ~44x; 22.4743 ~83x; 7.2645 ~83x; 11.9397 ~51x; 7.9901 ~90x; 30.7545 ~39x; 49.8945
Nano (depth; GCI) ~134x; 87.0425 ~39x; 51.5398 ~39x; 63.0391 ~257x; 18.3920 ~257x; 27.1588 ~103x; 30.4181 ~480x; 99.9999
HiFi + Nano 87.0425 66.7940 77.8956 18.7177 27.7796 29.3659 99.9999

Note: all the results are computed using one bam file from winnowmap and one paf file from minimap2

Utility

  • filter_bam.py

    • Usage

      This script is used to get the filtered bam files (and get the final plots).

      • Example1

        After getting the ${prefix}.${threshold}.depth.bed file, we'd like to get the detailed filtering information. So, we can just extract the alignments in the bed file:

        samtools view -@ $threads -Sb -L ${prefix}.${threshold}.depth.bed ${hifi.bam} > test.bam
        samtools index test.bam
        

        Then we can input the bam file with other paf file(s):

        python filter_bam.py test.bam test.paf -d test -o test ## if no prefix provided, the output file would be `test.filter.bam`
        

        Finally we would get the filtered bam file test.bam. Next, we can visualize the raw and filtered bam files in IGV:

        https://github.com/yeeus/GCI/images/igv_test.png The top track is the raw bam file and the bottom is the filtered.

        Don't be confused about the results, which are generated by filtering both test.bam and test.paf files, while only the former is plotted here. If there is any contradiction with the final bed file, please check the paf file to figure out the conflict.

      • Example2

        We can immediately visualize the alignments after getting the filtered file without using IGV. We provide a convenient method for visualizing via bamsnap.

        # first install the prerequisites and bamsnap
        # then we can visualize the alignments in one command
        python filter_bam.py test.bam test.paf -d test -o test -p -ref test.fasta -r chrxxx:xxx-xxx  ## Be cautious, in this case, the filtered file is `test.filter.bam`
        

        The result:

        https://github.com/yeeus/GCI/images/bamsnap_test.png

  • GCI_score.py

    • Usage

      This script is used for computing GCI score using depth files generated by GCI.py instead of running it again (#5)

      • Example
      python GCI_score.py -r ref.fa --hifi hifi.depth.gz --nano nano.depth.gz --two-type two_type.depth.gz  ## `--two-type` is recommended which is used to compute the final score (HiFi + Nano) in GCI.py
      

      Feel free to compute score for regions (#5) or specific chromosomes (#3) as in GCI.py

  • plot_depth.py

    • Usage

      This script is used to generate the depth plot as a stand-alone function (#1) and pdf is recommended.

      • Example
      python plot_depth.py -r ref.fasta --hifi hifi.depth.gz --nano nano.depth.gz -it pdf
      

      Unlike GCI.py, this script will output plots only for regions (#5).

  • convert_samtools_depth.py

    • Usage

      This script is used for converting depth file generated by samtools depth into the format compatible with plot_depth.py (#6)

      • Example
      python convert_samtools_depth.py $samtools_depth_file output_refix
      

      This will generate $output_prefix.depth.gz which can be used for plot_depth.py

Citation

Chen, Quanyu, et al. "GCI: a continuity inspector for complete genome assembly." bioRxiv (2024): 2024-04.

Help

If you get any problems, please raise an issue first. For another helps, please contact quanyu_chen@outlook.com.

To do

  • speed up and reduce memory usage

About

A program for assessing the T2T genome continuity

License:MIT License


Languages

Language:Python 100.0%