Compare similar sequences (*.fasta) to a reference (.fasta).

See the example report at: https://brwnj.github.io/idplot/

Setup

Nextflow is used to run the pipeline. Its installation instructions can be found at https://www.nextflow.io/ or installed via conda by way of bioconda. Bioconda includes a complete setup guide at https://bioconda.github.io/user/install.html.

Once your install completes and your channels are configured, run:

conda install nextflow

Usage

Executing the workflow using nextflow:

nextflow run brwnj/idplot -latest -with-docker \
    --reference data/MN996532.fasta \
    --fasta 'data/query_seqs/*.fasta'

This generally takes only a few minutes to complete which enables rapid screening for localized sequence similarity.

Parameters

The reference sequence (--reference) should be a fasta with only one sequence in it. Query sequences (--fasta) may either be single sequence files or multi-sequence fasta files and you can specify more than one using wildcards ('*').

Example sequences are found in data/query_sequences.

By default, output is written to ./results/idplot.html and can be opened with an internet browser.

An example report is available at: https://brwnj.github.io/idplot/

Using a custom alignment

In some cases it may be necessary to manually correct an alignment. In this case, idplot can accept the alignment and skip its internal alignment step. To do so, run:

nextflow run brwnj/idplot -latest -with-docker \
    --alignment my_alignment_msa.fasta

The first sequence in the file will be used as the reference (root) sequence.

Options --reference and --fasta are both omitted in this case.

Including breakpoint detection

We have opted to employ GARD via HyPhy to identify breakpoints. For each GARD fit iteration, we pull the sequences for each breakpoint and build a tree using FastTree.

An example command enabling GARD with 12 MPI processes:

nextflow run brwnj/idplot -latest -with-docker \
    --reference data/MN996532.fasta \
    --fasta 'data/query_seqs/*.fasta' \
    --gard --cpus 12

Including a reference annotation

Publicly available reference sequences, like in NCBI, often have an accompanying annotation that can be included within the ANI plot. On NCBI, one can get a GFF from the reference page under Send to:

nextflow run brwnj/idplot -latest -with-docker \
    --reference data/MN996532.fasta \
    --fasta 'data/query_seqs/*.fasta' \
    --gff MN996532.gff3

Within the report, this renders as:

As GFFs may have multiple feature types, we allow the reader to select their preferred feature type from the report header.

Coordinates in the original GFF will likely not match what is being displayed. Start and end coordinates are updated based on gaps introduced into the reference sequence during multiple sequence alignment.

Interpreting the report

Multiple sequence alignment

The reference sequence is fully colored in. Hovering along the reference shows the base for a given color.

Query sequences are colored at mismatches and gaps (gray).

Percent ID (ANI)

Percent ID is calculated across the window (default 500 bp) with the value being plotted at the center point. A 500 bp window will have 250 bp dead spots at the beginning and end of the reference length. No special treatment is given with respect to sequence content.

Sequences

Sequence selection is based on the level of x-axis zoom of the plot. Sequence gaps can be removed using the toggle. The selected region can be copied to clipboard, sent directly to BLAST (when selection length is less than 8kb), or all sequences for a given region can be exported to FASTA.

With GARD results

Including --gard in your nextflow command adds breakpoint detection and updates available data and visualizations in the report.

Breakpoints track

Regions identified by GARD as breakpoints are highlighted between the MSA and ANI plots. Clicks on the regions will navigate to the respective dendrogram.

Dendrograms

Per region dendrograms are generated using FastTree based off of regions identified by GARD.

Hovering over regions highlights the respective region in the GARD breakpoints track.

Clicking the region link will zoom the plot to facilitate downloading the sequence content for a given region.

Refinements

Breakpoints are identified over iterations by GARD, often to an unhelpful degree. This plot allows the user to explore breakpoints and trees across all GARD iterations. Selecting a new point will update the dendrograms and GARD breakpoints track.

quinlan-lab / idplot