Phased Assembly Variant Caller
PAV is a tool for discovering variation using assembled genomes aligned to a reference. It supports both phased and unphased assemblies.
PAV was developed for the Human Genome Structural Variation Consortium (HGSVC)
Ebert et al., “Haplotype-Resolved Diverse Human Genomes and Integrated Analysis of Structural Variation”, Science, February 25, 2021, eabf7117, https://doi.org/10.1126/science.abf7117.
PAV was originally developed as part of the Eichler lab at UW and is now updated and maintained by the Beck lab at JAX. Both labs continue to contribute to the HGSVC.
Eichler lab: https://eichlerlab.gs.washington.edu/
Beck lab: https://www.jax.org/research-and-faculty/research-labs/the-beck-lab
Change to a clean directory (the ANALYSIS directory) to run PAV. PAV will read config.json
from this directory and
write output to this directory. If you have a native install, do not run PAV from the PAV install location (the SITE
directory where Snakefile
and pavlib
are found).
PAV gets it's configuration from two files:
config.json
: Points to the reference genome and can be used to set optional parameters.assemblies.tsv
: A table of input assemblies.
A JSON configuration file, config.json
, configures PAV. Default options are built-in, and the only required option is
reference
pointing to a reference FASTA file variants are called against.
Example:
{
"reference": "/path/to/hg38.no_alt.fa.gz"
}
Note: The HGSVC reference for long reads can be found here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGSVC2/technical/reference/20200513_hg38_NoALT/
A no-ALT version of a reference is essential for long read assemblies. A reference containing alternate loci, decoys, or patches may produce unexpected results and will likely lead to a loss of sensitivity.
PAV can find assemblies in one of two ways:
assemblies.tsv
: Use for a small number of samples. Most will use this option.asm_pattern
inconfig.json
: If paths to the assemblies are consistent, then this method can be used to find the input without creatingassemblies.tsv
with an entry to each. This is useful for processing a large collection of assemblies (i.e. consortia-sized data).
Assemblies may be in FASTA, FASTQ, or GFA (hifiasm compatible) formats, and files may be optionally gzipped. File indexes such as ".fai" files are not needed, PAV will create the indices it needs. It can also take an FOFN (File Of File Names) pointing to multiple input files and processing them as one.
For phased assemblies, PAV expects two haplotypes separated into two files. It can process unphased assemblies by
leaving the haplotype column blank in assemblies.tsv
or giving it a path to a 0-byte file (works for both
assemblies.tsv
and asm_pattern
input).
A tab-separated-values (TSV) table is created with one line per assembly:
Create an assemblies TSV file, assemblies.tsv
, with three columns:
- NAME: Assembly name
- HAP1: Path to haplotype 1 FASTA
- HAP2: Path to haplotype 2 FASTA
The configuration option assembly_table
(in config.json
) may be used to set the name of the assemblies TSV file,
and the file may be gzipped.
Most spreadsheet applications including MS Excel and LibreOffice Calc can be used to edit this file. Be sure to save
as the right file type, it must be plain text and end with ".tsv" (not ".tsv.txt" unless the assembly_table
option
is adjust to look for it).
If you have one assembly per sample (unphased), then leave HAP2 blank for those samples or give it a path to a 0-byte file.
Use this method if you have multiple assemblies in a consistent path structure. With this method, it is possible to
process hundreds of assemblies without creating an entry in assembly_table.tsv
for each one.
Two wildcards will need to be inserted into the path, "asm_name" and "hap". "asm_name" can be any name for the assembly
or sample. "hap" must be "h1" and "h2". The input path goes into config.json
parameter asm_pattern
.
For example:
{
"reference": "/path/to/hg38.no_alt.fa.gz",
"asm_pattern": "/path/to/assemblies/{asm_name}/{hap}.fa.gz"
}
In this example, if an assembly with asm_name
"HG00733" is run, then PAV will expect to find two files:
- /path/to/assemblies/HG00733_CCS_SS_PG_PRR/h1.fa.gz
- /path/to/assemblies/HG00733_CCS_SS_PG_PRR/h2.fa.gz
If there is no "hap" or "parent" (see below) wildcard in the path, it is treated as an unphased assembly by PAV and
PAV inputs the assembly into h1
and never tries to read h2
.
To support hifiasm-trio, you may substitute the "parent" wildcard for "hap" (do not include both "hap" and "parent" wildcards). In this case, "mat" becomes and alias for "h1", and "pat" becomes an alias for "h2" when searching for files. The PAV output will still contain "h1" and "h2" for the maternal and paternal haplotypes, respectively.
Optionally, "sample" may be a wildcard in the path. When PAV sees this, it assumes the sample name is the first part of
"asm_name" delimited by underscores. For example, if "asm_name" is "HG00733_CCS_SS_PG_PRR", then "sample" is "HG00733".
This was a feature used mainly for HGSVC where we used a "sample/assembly" directory structure (e.g.
assemblies/HG00733/HG00733_CCS_SS_PG_PRR_h1.fa.gz
from pattern assemblies/{sample}/{asm_name}_{hap}.fa.gz
).
This may be useful for consorita with several assemblies per sample.
Additional information about configuration parameters for config.json
can be found in CONFIG.md
.
Change to the ANALYSIS directory (where config.json
is found), then run the container:
Docker:
sudo docker run --rm -v ${PWD}:${PWD} --user "$(id -u):$(id -g)" --workdir ${PWD} becklab/pav:latest -c 16
Singularity:
singularity run --bind "$(pwd):$(pwd)" library://becklab/pav/pav:latest -c 16
Notes:
- Cores: Set the maximum number of cores
-c
(or--cores
) to be used simultaneously. - Directory binding: You may need to adjust the directory bindings for your machine, but these parameters should work for most.
- Version: You may change "latest" to an explicit PAV version to ensure compatibility among samples.
PAV can process a phased human genome in 4.5 to 5.5 hours with 64 GB of memory and 32 cores with minimap2 alignments. Actual memory usage is around 52 GB.
See NATIVE_INSTALL.md
for help installing and PAV natively on a machine. This option necessary if Docker and
Singularity are not available or if distribute individual PAV steps over a cluster.
See EXAMPLE.md
to setup small example run to test PAV on your system.
Most projects will read from the VCF in the root of the run directory, but PAV outputs some other useful information.
The output directory (results/{asm_name}
) has several subdirectories:
- align: Information about contig alignments.
- Post-trimming alignments.
- The BED and FASTA files in this directory could be used to reconstruct a SAM file.
- bed: Variant calls in formatted BED files
- One file for each variant type (sv_ins, sv_del, indel_ins, indel_del, snv_snv)
- bed/fa: FASTA for inserted and deleted sequences
- Unique ID links sequnece to variant call
- No FASTA for SNVs (see REF and ALT in variant calls)
- callable: BED files of callable regions (where contigs aligned) smoothed by 500 bp windows.
- inv_caller: Intermediate output from the inversion caller.
- Flagged loci queried for inversions (not all produce calls).
- Contains data useful for visualizing inversions.
- lg_sv: Intermediate output from large SV calls.
Information about how PAV resolves two haplotypes as one diploid sample can be found in HAP_MERGING.md
.
Ebert et al., “Haplotype-Resolved Diverse Human Genomes and Integrated Analysis of Structural Variation,” Science, February 25, 2021, eabf7117, https://doi.org/10.1126/science.abf7117 (PMID: 33632895).
PAV was also presented at ASHG 2021:
Audano et al., "PAV: An assembly-based approach for discovering structural variants, indels, and point mutations in long-read phased genomes," ASHG Annual Meeting, October 20, 2021 (10:45 - 11:00 AM), PrgmNr 1160
Please open a case on the Github page for problems.
You may also contact Peter Audano directly (e-mail omitted to reduce SPAM). PAV was developed in the lab of Dr. Evan Eichler at the University of Washington and is currently maintained in the lab of Dr. Christine beck at The Jackson Laboratory.