VaLiAnT

The Variant Library Annotation Tool (VaLiAnT) is an oligonucleotide library design and annotation tool for Saturation Genome Editing and other Deep Mutational Scanning experiments.

A selection of libraries is included in the examples directory, including all necessary inputs and instructions to generate them.

Please also see the VaLiAnT Wiki for more information on use cases.

VaLiAnT

Citation

Please cite this paper when using VaLiAnT for your publications:

Variant Library Annotation Tool (VaLiAnT): an oligonucleotide library design and annotation tool for saturation genome editing and other deep mutational scanning experiments.

Barbon L, Offord V, Radford EJ, Butler AP, Gerety SS, Adams DJ, Tan HK, Waters AJ.

Bioinformatics. 2022 Jan 27;38(4):892-899.

DOI: 10.1093/bioinformatics/btab776. PMID: 34791067; PMCID: PMC8796380.

Usage

See the command line interface section for the full list and a more detailed description of the parameters.

Input parameters:

species
assembly
5' adaptor (optional)
3' adaptor (optional)

Main command input files:

configuration file (JSON)

SGE input files:

SGE targeton file (TSV)
reference genome sequence (FASTA)
reference genome index (FAI)
PAM protection file (VCF, optional)
VCF manifest file (CSV, optional)
custom variant files (VCF, optional)
features file (GTF/GFF2, optional)
codon table with frequencies (CSV, optional)
background variant file (VCF, optional)
background variant mask file (BED, optional)

cDNA input files:

cDNA targeton file (TSV)
cDNA sequences (single multi-FASTA)
cDNA annotation file (TSV, optional)
codon table with frequencies (CSV, optional)

Output files:

reference sequence retrieval quality check file (CSV, SGE-only)
oligonucleotide metadata file (CSV)
variant file (VCF, SGE-only)
unique oligonucleotides file (CSV)
configuration file (JSON)

The reference directory should contain both the FASTA file and its index; e.g., if the FASTA file is named genome.fa, a genome.fa.fai file should also be present in the same directory. When running the tool in a container, the directory containing both files should therefore be mounted.

The features file (SGE-only gff option) is required to detect exonic regions in the targeton, and should therefore be provided in most circumstances. The files should only contain features for one transcript per gene (the gene_id and transcript_id attributes are required to perform this check). The features file should match the assembly of the target reference genome. Any features of type other than CDS and UTR are ignored.

If the codon-table option is not set, this table will be used.

Ambiguous nucleotides are not allowed in the reference sequence. Soft-masking is ignored.

Oligonucleotides exceeding a given length (max-length option) will not be included in the unique oligonucleotide files and their metadata will be stored in separate files marked as 'excluded'.

Background variants, if provided, are applied before PAM protection variants; when the CDS features of a transcript are provided (via a GTF/GFF2 file), to keep the annotation consistent across targetons, such variants are applied in the minimal range of positions that spans at least the entire CDS, further extended to the boundaries of any targeton overlapping the CDS, and finally to the start and end position of the first and last background variant intersecting the resulting range, respectively.

Background variants may be filtered out by position by providing a set of genomic ranges to be excluded via a BED file (bg-mask option). Excluding frame-shifting variants may affect the annotation.

By default, errors are raised when background variants are not synonymous or shift the reading frame; the force-bg-ns and force-bg-indels flags may be passed to allow them.

Mutations that overlap background variants are discarded; warnings are always raised to identify them, with their positions expressed in absolute genomic coordinates for custom variants, and targeton-relative coordinates for pattern variants.

Python package

After installing the package in an appropriate virtual environment:

valiant sge \
    "${TARGETONS_FILE}" \
    "${REFERENCE_FILE}" \
    "${OUTPUT_DIR}" \
    "${SPECIES}" \
    "${ASSEMBLY}" \
    --gff "${GTF_FILE}" \
    --adaptor-5 "${ADAPTOR_5}" \
    --adaptor-3 "${ADAPTOR_3}"

Alternatively, a configuration file can be provided to the main command:

valiant -c config.json

Docker image

After building or pulling the Docker image (quay.io/wtsicgp/valiant:X.X.X, where X.X.X is a version tag):

docker run \
    -v "${HOST_INPUTS}":"${INPUT_DIR}":ro \
    -v "${HOST_REF}":"${REF_DIR}":ro \
    -v "${HOST_OUTPUT}":"${OUTPUT_DIR}" \
    valiant \
        valiant sge \
            "${INPUT_DIR}/${TARGETONS_FILE}" \
            "${REF_DIR}/${REFERENCE_FILE}" \
            "${OUTPUT_DIR}" \
            "${SPECIES}" \
            "${ASSEMBLY}" \
            --gff "${INPUT_DIR}/${GTF_FILE}" \
            --adaptor-5 "${ADAPTOR_5}" \
            --adaptor-3 "${ADAPTOR_3}"

The HOST_* environment variables represent local paths to be mounted by the Docker container.

Singularity image

After pulling the Docker image with Singularity:

singularity exec \
    --cleanenv \
    -B "${HOST_INPUTS}":"${INPUT_DIR}":ro \
    -B "${HOST_REF}":"${REF_DIR}":ro \
    -B "${HOST_OUTPUT}":"${OUTPUT_DIR}" \
    ${SINGULARITY_IMAGE} \
        valiant sge \
            "${INPUT_DIR}/${TARGETONS_FILE}" \
            "${REF_DIR}/${REFERENCE_FILE}" \
            "${OUTPUT_DIR}" \
            "${SPECIES}" \
            "${ASSEMBLY}" \
            --gff "${INPUT_DIR}/${GTF_FILE}" \
            --adaptor-5 "${ADAPTOR_5}" \
            --adaptor-3 "${ADAPTOR_3}"

Command line interface

Separate subcommands are provided depending on the target sequence origin:

SGE (sge): sequences from reference, genomic coordinates, three target regions;
cDNA DMS (cdna): user-provided sequences, relative coordinates, single target region.

The arguments and a few options are the same for both subcommands (see here), but the file formats may vary.

valiant

Main command.

Option	Format	Default	Description
`version`	flag	`false`	Show the version of the tool and quit.
`config`	file path	-	Path to the configuration file.

valiant (sge|cdna)

Arguments and options required or supported by both subcommands. The format of the input files may be different for SGE and cDNA targets (see the argument descriptions).

Argument	Format	Description
`OLIGO_INFO`	file path	Path to the targeton file (SGE or cDNA format).
`REF_FASTA`	file path	Path to the FASTA file of the target reference genome (`sge`) or cDNA sequences (`cdna`).
`OUTPUT`	file path	Output path (should exist already).
`SPECIES`	species name	Target species, to be reported in the oligonucleotide metadata.
`ASSEMBLY`	assembly name	Target assembly, to be reported in the oligonucleotide metadata.

Option	Format	Default	Description
`codon-table`	file path	-	Path to a codon table with frequencies.
`max-length`	integer	300	Maximum oligonucleotide length.
`adaptor-5`	DNA sequence	-	DNA sequence to be added at the 5' end of the oligonucleotide.
`adaptor-3`	DNA sequence	-	DNA sequence to be added at the 3' end of the oligonucleotide.
`log`	log level	`WARNING`	Name of the preferred log level (see the official documentation of the `logging` module).

valiant sge

Options specific to SGE.

The REF_FASTA path is expected to point to a reference genome in FASTA format.

Option	Format	Default	Description
`gff`	file path	-	Path to GTF/GFF2 file containing CDS and UTR features; one transcript per gene only.
`bg`	file path	-	Path to a background variant VCF file.
`pam`	file path	-	Path to a PAM protection file.
`vcf`	file path	-	Path to a VCF manifest file.
`revcomp-minus-strand`	flag	`false`	For minus strand targets, include the reverse complement of the mutated reference sequence in the oligonucleotide.
`sequences-only`	flag	`false`	Generate the reference sequence retrieval quality check file and quit.
`mask_bg_fp`	file path	-	Path to a BED file to exclude background variants from being applied to the specified genomic intervals.
`force-bg-ns`	flag	`false`	Allow non-synonymous background variants.
`force-bg-indels`	flag	`false`	Allow frame-shifting background variants.

valiant cdna

Options specific to cDNA DMS.

The REF_FASTA path is expected to point to a multi-FASTA containing cDNA sequences.

Option	Format	Default	Description
`annot`	file path	-	Path to a cDNA annotation file.

Mutation types

Types of mutation that apply to any target (label):

parametric deletion (e.g.: 1del, 2del0, 2del1)
single-nucleotide variant (snv)

Types of mutation that apply to CDS targets only (label):

in-frame deletion (inframe)
alanine codon substitution (ala)
stop codon substitution (stop)
all amino acid codon substitution (aa)
SNVRE (snvre)

Variants imported from VCF files are labelled as custom.

Parametric deletion

Non-overlapping stretches of nucleotides of a given length are deleted starting from a given offset. No partial deletions are performed at the end of the target regions. Format: <SPAN>del[<OFFSET>] (the offset is assumed to be zero if not set).

For backwards compatibility, in the metadata table, 1del0 is reported as 1del.

Given the target ACGTAAA, span two, and start offset zero (2del0), e.g.:

GTAAA
ACAAA
ACGTA

With start offset one (2del1), e.g.:

ATAAA
ACGAA
ACGTA

Single-nucleotide variant

Each nucleotide is replaced with all the alternatives.

Given the target AA, e.g.:

CA
GA
TA
AC
AG
AT

For CDS targets, the resulting amino acid change is reported.

In-frame deletion

Only for CDS targets.

Delete each triplet so that the reading frame is preserved.

Given the target GAAATTTGG with frame 2, e.g.:

GTTTGG
GAAAGG

Alanine codon substitution

Only for CDS targets.

Replace each codon with the top-ranking alanine codon.

Given the target GCAAAATTT, with GCC being the top-ranking alanine codon, e.g.:

GCCAAATTT
GCAGCCTTT
GCAAAAGCC

Stop codon substitution

Only for CDS targets.

Replace each codon with the top-ranking stop codon.

Given the target TAACCCGGG, with TGA being the top-ranking stop codon, e.g.:

TGACCCGGG
TAATGAGGG
TAACCCTGA

All amino acid codon substitution

Only for CDS targets.

Replace each codon with the top-ranking codon of all amino acids. Given the default codon table, this results in 19 mutated sequences for each codon mapping to an amino acid (the reference amino acid being excluded) and 20 for each stop codon.

Given the target AAATGA on the plus strand, e.g. (each column representing the sequences generated from one codon):

ATCTGA  AAAATC
ATGTGA  AAAATG
ACCTGA  AAAACC
AACTGA  AAAAAC
        AAAAAG
AGCTGA  AAAAGC
CGGTGA  AAACGG
CTGTGA  AAACTG
CCCTGA  AAACCC
CACTGA  AAACAC
CAGTGA  AAACAG
GTGTGA  AAAGTG
GCCTGA  AAAGCC
GACTGA  AAAGAC
GAGTGA  AAAGAG
GGCTGA  AAAGGC
TTCTGA  AAATTC
TACTGA  AAATAC
TGCTGA  AAATGC
TGGTGA  AAATGG

SNVRE

Only for CDS targets.

Given a set of SNV's, replace triplets according to the following rules:

if the SNV results in a synonymous mutation, replace the triplet with all the synonymous triplets of the variant
if the SNV results in a missense mutation, replace the triplet with the top-ranking synonymous triplet of the variant
if the SNV results in a nonsense mutation, replace the triplet with the top-ranking stop codon

For a given missense or nonsense SNV mutation, if the resulting triplet is already the top-ranking one, the second highest ranking triplet is used to generate the SNVRE mutation instead.

Given the following SNV's for sequence AAAAGT, e.g.:

mseq	ref	alt
CAAAGT	K	Q
GAAAGT	K	E
TAAAGT	K	STOP
ACAAGT	K	T
AGAAGT	K	R
ATAAGT	K	I
AACAGT	K	N
AAGAGT	K	K
AATAGT	K	N
...
AAAAGC	S	S
...

There is only one synonymous mutation for the first triplet (AAGAGT), but since lysine maps to only two codons and one of them is the reference, no SNVRE variants are generated from it. The one for the second triplet (AAAAGC), though, results in the top-ranking codon for serine, that maps to six codons, and therefore the following four SNVRE's are generated:

mseq	ref	alt
AAATCA	S	S
AAATCC	S	S
AAATCG	S	S
AAATCT	S	S

For missense mutations, the top-ranking codon (the current being excluded) for each alternative amino acid replaces the reference sequence:

mseq	ref	alt	snv	svnre
CAAAGT	K	Q	CAA	CAG
GAAAGT	K	E	GAA	GAG
ACAAGT	K	T	ACA	ACC
AGAAGT	K	R	AGA	CGG
ATAAGT	K	I	ATA	ATC
AACAGT	K	N	AAC	AAT
AATAGT	K	N	AAT	AAC

The resulting SNVRE variants would be:

mseq	ref	alt
CAGAGT	K	Q
GAGAGT	K	E
ACCAGT	K	T
CGGAGT	K	R
ATCAGT	K	I
AATAGT	K	N
AACAGT	K	N

For the nonsense mutation:

mseq	ref	alt	snv	svnre
TAAAGT	K	STOP	TAA	TGA

The resulting SNVRE variant would be:

mseq	ref	alt
TGAAGT	K	STOP

Unique codons do not generate SNVRE variants.

Custom variants

Applied to the targeton reference sequence as a whole. Only simple variants such as the following are supported:

substitutions
insertions (see below)
deletions (see below)
indels

The classification of the variants is based exclusively on the POS, REF, and ALT fields to be agnostic with respect to the VCF source.

While in the VCF format insertion and deletion positions refer to the base preceding the event and the reference and alternative sequences both include the preceding (or following, if the variants start at position one) base, for consistency with the conventions adopted for generated mutations, in the metadata table such variants are reported as shifted right by one and omitting the preceding (or following) base in the reference and alternative sequences.

Installation

Some of the dependencies are unsupported on Windows, and the tool cannot therefore be installed natively on it. The following options are available:

installing the Windows Subsystem for Linux (WSL) and creating a Python virtual environment
installing Docker (requires the WSL or Windows 10 Pro) and building or pulling the Docker image
installing Singularity (requires a virtualisation solution) and building a Singularity image from the Docker image

The instructions that follow apply to Linux and macOS.

Python virtual environment

Please take care to read errors during the dependency installation step carefully. HTSlib (pysam) has system dependencies and will highlight the packages that need to be installed.

Requirements:

Python 3.11 or above

To install in a virtual environment:

# Initialise the virtual environment
python3.11 -m venv .env

# Activate the virtual environment
source .env/bin/activate

# Install the valiant package
pip install .

Docker image

To build the Docker container:

docker build -t valiant .

File formats

Configuration file

JSON file collecting the execution parameters. It is always generated as an output (config.json) and can optionally be used as input by the main command, e.g.:

valiant -c config.json

Property	Format	Description
`appName`	`valiant`	Name of the application (constant).
`appVersion`	`x.y.z`	Version of the application.
`mode`	`sge`\|`cdna`	Execution mode.
`params`	`object`	Execution parameters.

An application version mismatch will result in a warning.

The execution parameters depend on the execution mode, and each corresponds to one of the command line arguments or options.

Common parameters

CLI argument	JSON property
`oligo_info_fp`	`oligoInfoFilePath`
`ref_fasta_fp`	`refFASTAFilePath`
`output_dir`	`outputDirPath`
`species`	`species`
`assembly`	`assembly`

CLI option	JSON property
`adaptor-5`	`adaptor5`
`adaptor-3`	`adaptor3`
`min-length`	`minOligoLength`
`max-length`	`maxOligoLength`
`codon-table`	`codonTableFilePath`

SGE parameters

CLI option	JSON property
`revcomp-minus-strand`	`reverseComplementOnMinusStrand`
`gff`	`GFFFilePath`
`bg`	`backgroundVCFFilePath`
`pam`	`PAMProtectionVCFFilePath`
`vcf`	`customVCFManifestFilePath`
`mask_bg_fp`	`maskBackgroundFilePath`
`force-bg-ns`	`forceBackgroundNonSynonymous`
`force-bg-indels`	`forceBackgroundFrameShifting`
`include-no-op-oligo`	`includeNoOpOligo`

Example:

{
    "appName": "valiant",
    "appVersion": "4.0.0",
    "mode": "sge",
    "params": {
        "species": "homo sapiens",
        "assembly": "GRCh38",
        "adaptor5": "AATGATACGGCGACCACCGA",
        "adaptor3": "TCGTATGCCGTCTTCTGCTTG",
        "minOligoLength": 1,
        "maxOligoLength": 300,
        "codonTableFilePath": null,
        "backgroundVCFFilePath": null,
        "oligoInfoFilePath": "parameter_input_files/brca1_nuc_targeton_input.txt",
        "refFASTAFilePath": "reference_input_files/chr17.fa",
        "outputDirPath": "brca1_nuc_output",
        "reverseComplementOnMinusStrand": true,
        "includeNoOpOligo": false,
        "GFFFilePath": "reference_input_files/ENST00000357654.9.gtf",
        "PAMProtectionVCFFilePath": "parameter_input_files/brca1_protection_edits.vcf",
        "customVCFManifestFilePath": "reference_input_files/brca1_custom_variants_manifest.csv",
        "maskBackgroundFilePath": null,
        "forceBackgroundNonSynonymous": false,
        "forceBackgroundFrameShifting": false
    }
}

cDNA parameters

CLI option	JSON property
`annot`	`annotationFilePath`

Example:

{
    "appName": "valiant",
    "appVersion": "4.0.0",
    "mode": "cdna",
    "params": {
        "species": "human",
        "assembly": "pCW57.1",
        "adaptor5": "AATGATACGGCGACCACCGA",
        "adaptor3": "TCGTATGCCGTCTTCTGCTTG",
        "minOligoLength": 1,
        "maxOligoLength": 300,
        "codonTableFilePath": null,
        "oligoInfoFilePath": "examples/cdna/input/cdna_targeton.tsv",
        "refFASTAFilePath": "examples/cdna/input/BRCA1_NP_009225_1_pCW57_1.fa",
        "outputDirPath": "examples/cdna/output",
        "annotationFilePath": "examples/cdna/input/cdna_annot.tsv"
    }
}

SGE targeton file

Tab-separated values (TSV) file describing the reference sequence coordinates and the types of mutation to be applied to the three target regions therein contained (collectively referred to as targeton). Multiple types of mutations can be applied to each target region. The coordinates of the target regions are derived from the genomic range of the second target region and an extension vector describing the lengths of the preceding and following regions.

Duplicate mutation types in any given group within the action vector are ignored.

Spacing is ignored when parsing the extension and action vectors.

The chromosome name needs to match the naming in the GTF/GFF2 file and in the reference genome.

Field	Format	Description
`ref_chr`	string	Chromosome name.
`ref_strand`	`+` or `-`	DNA strand.
`ref_start`	integer	Start position of the reference sequence.
`ref_end`	integer	End position of the reference sequence.
`r2_start`	integer	Start position of the second target region.
`r2_end`	integer	End position of the second target region.
`ext_vector`	`<int>, <int>`	Lengths of the first and third target regions.
`action_vector`	`(<str>, ...), (<str>, ...), (<str>, ...)`	Type of mutation labels grouped by target region.
`sgrna_vector`	`<str>, ...`	sgRNA identifiers matching with `SGRNA` tags in the PAM protection VCF file.

Example:

ref_chr	ref_strand	ref_start	ref_end	r2_start	r2_end	ext_vector	action_vector	sgrna_vector
chrX	+	41334132	41334320	41334253	41334297	25, 15	(1del), (1del, snv), (1del)	sgrna_1, sgrna_2

cDNA targeton file

Tab-separated values (TSV) file describing the target cDNA and the types of mutation to be applied to the target region therein contained (expressed in relative coordinates). Multiple types of mutations can be applied the the target region.

The cDNA identifier (seq_id) has to correspond to an entry in the multi-FASTA and (optionally) annotation files.

Field	Format	Description
`seq_id`	string	cDNA identifier.
`targeton_start`	integer	Targeton start position.
`targeton_end`	integer	Targeton stop position.
`r2_start`	integer	Target region start position.
`r2_end`	integer	Target region stop position.
`action_vector`	`<str>, ...`	Type of mutation labels.

Example:

seq_id	targeton_start	targeton_end	r2_start	r2_end	action_vector
ENST00000357654.9	114	121	114	121	snv,1del,snvre
ENST00000357654.9	114	150	120	130	1del,2del0

cDNA annotation file

TSV file describing the CDS region of each cDNA in relative coordinates (one-based and end-inclusive). Gene and transcript identifiers can also be provided.

Field	Format	Description
`seq_id`	string	cDNA identifier.
`gene_id`	string	Gene ID.
`transcript_id`	string	Transcript ID.
`cds_start`	string	cDNA CDS relative start position.
`cds_end`	string	cDNA CDS relative end position.

Example:

seq_id	gene_id	transcript_id	cds_start	cds_end
brca1_357654.9	ENSG00000012048.23	ENST00000357654.9	114	5705

VCF manifest file

CSV file listing the VCF files from which to import variants. Each VCF file is given an alias. If a tag is specified (vcf_id_tag), the VCF INFO field will be expected to contain it and its values will be used as variant identifiers; if no tag is specified, the ID field will be used instead.

Field	Format	Description
`vcf_alias`	string	VCF file alias.
`vcf_id_tag`	VCF tag	(Optional) Variant ID tag.
`vcf_path`	file path	VCF file path.

Example:

vcf_alias,vcf_id_tag,vcf_path
clinvar_1,ALLELEID,clinvar_abc.vcf
clinvar_2,ALLELEID,clinvar_xyz.vcf
gnomad,,gnomad_abc.vcf

PAM protection VCF file

VCF file containing single-nucleotide substitution variants linked to sgRNA identifiers via the SGRNA tag.

Example:

##fileformat=VCFv4.3
##INFO=<ID=SGRNA,Number=1,Type=String,Description="sgRNA identifier">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chrX	41334252	.	G	C	.	.	SGRNA=sgRNA_1
chrX	41337416	.	C	T	.	.	SGRNA=sgRNA_2
chrX	41339064	.	G	A	.	.	SGRNA=sgRNA_3
chrX	41341504	.	T	C	.	.	SGRNA=sgRNA_4
chrX	41341509	.	G	A	.	.	SGRNA=sgRNA_4

Oligonucleotide metadata file

Comma-separated values (CSV) file containing name, label, and all metadata of the oligonucleotides generated for any given targeton.

For cDNA targets, the reference chromosome (ref_chr) and strand (ref_strand) will be missing and all positions will be relative to the cDNA sequence. All fields related to PAM protection (pam_seq) and custom VCF variants (vcf_alias, vcf_var_id, and vcf_var_in_const), features unavailable for this target type, will also be empty (except for vcf_var_in_const, which will be set to zero).

The MAVE-HGVS strings are all linear genomic (relative to the start of the targeton) and do not include the reference. Because in HGVS insertion positions are described by the flanking nucleotides, those occurring at either end of the reference sequence should be treated differently (see the 3' rule in the relevant HGVS documentation); for consistency between SGE and cDNA mode, simplicity, and given the limited usefulness of liminal insertions, this is not the case in the current implementation, and therefore the invalid position zero might be found in insertion names.

Array fields use the semicolon as separator.

Index	Field	Format	Description
1	`oligo_name`	string	Name of the oligonucleotide.
2	`species`	species name	Species.
3	`assembly`	assembly name	Assembly.
4	`gene_id`	string	Gene ID.
5	`transcript_id`	string	Transcript ID.
6	`src_type`	`ref`\|`cdna`	Sequence source type (reference genome or cDNA).
7	`ref_chr`	string	Chromosome name.
8	`ref_strand`	`+`\|`-`	DNA strand.
9	`ref_start`	integer	Start position of the reference sequence.
10	`ref_end`	integer	End position of the reference sequence.
11	`revc`	0\|1	Whether the oligonucleotide contains the reverse complement of the reference sequence (minus strand transcripts only).
12	`ref_seq`	DNA sequence	Reference sequence.
13	`pam_seq`	DNA sequence	PAM-protected reference sequence.
14	`vcf_alias`	string	VCF file alias (custom mutations only).
15	`vcf_var_id`	string	Variant ID (custom mutations only).
16	`mut_position`	integer	Start position of the mutation.
17	`ref`	DNA sequence	Reference nucleotide or triplet.
18	`new`	DNA sequence	Mutated nucleotide or triplet. Not set for deletions.
19	`ref_aa`	amino acid	Reference amino acid.
20	`alt_aa`	amino acid	Alternative amino acid.
21	`mut_type`	`syn`\|`mis`\|`non`	Mutation type.
22	`mutator`	type of mutator	Label of the type of mutator that generated the oligonucleotide.
23	`oligo_length`	integer	Oligonucleotide length.
24	`mseq`	DNA sequence	Full oligonucleotide sequence (with adaptors, if any).
25	`mseq_no_adapt`	DNA sequence	Oligonucleotide sequence excluding adaptors.
26	`pam_mut_annot`	Array of `syn`\|`mis`\|`non`\|`ncd`	Applied PAM protection variant mutation types (or `ncd` if affecting a noncoding region).
27	`pam_mut_sgrna_id`	Array of sgRNA ID's	sgRNA ID's bound to the PAM protection variants spanned by the mutation or affecting the same codons as the mutation, if any.
28	`mave_nt`	MAVE-HGVS string	MAVE-HGVS string corresponding to the mutation.
29	`mave_nt_ref`	MAVE-HGVS string	MAVE-HGVS string corresponding to the mutation, where `REF` does not include PAM protection.
30	`vcf_var_in_const`	0\|1	Whether the variant is in a region defined as constant (custom mutations only).
31	`background_variants`	MAVE-HGVS strings	MAVE-HGVS strings corresponding to the background variants overlapping the targeton range, semicolon-separated.
32	`background_seq`	DNA sequence	Reference sequence altered by the background variants.

Example:

oligo_name,species,assembly,gene_id,transcript_id,src_type,ref_chr,ref_strand,ref_start,ref_end,revc,ref_seq,pam_seq,vcf_alias,vcf_var_id,mut_position,ref,new,ref_aa,alt_aa,mut_type,mutator,oligo_length,mseq,mseq_no_adapt,pam_mut_annot,pam_mut_sgrna_id,mave_nt,mave_nt_ref,vcf_var_in_const,background_variants,background_seq
ENST00000357654.9.ENSG00000012048.23_chr17:43104102_1del_rc,homo sapiens,GRCh38,ENSG00000012048.23,ENST00000357654.9,ref,chr17,-,43104080,43104330,1,AGAAAAGAAGAAGAAGAAGAAGAAGAAAACAAATGGTTTTACCAAGGAAGGATTTTCGGGTTCACTCTGTAGAAGTCTTTTGGCACGGTTTCTGTAGCCCATACTTTGGATGATAGAAACTTCATCTTTTAGATGTTCAGGAGAGTTATTTTCCTTTTTTGCAAAATTATAGCTGTTTGCATCTGTAAAATACAAGGGAAAACATTATGTTTGCAGTTAGAGAAAAATGTATGAATTATAATCAAAGAAAC,AGAAAAGAAGAAGAAGAAGAAGAAGAAAACAAATGGTTTTACCAAGGAAGGATTTTCGGGTTCACTCTGTAGAAGTCTTTTGGCGCGATTTCTGTAGCCCATACTTTGGATGATAGAAACTTCATCTTTTAGATGTTCAGGAGAGTTATTTTCCTTTTTTGCAAAATTATAGCTGTTTGCATCTGTAAAATACAAGGGAAAACATTATGTTTGCAGTTAGAGAAAAATGTATGAATTATAATCAAAGAAAC,,,43104102,A,,,,,1del,291,AATGATACGGCGACCACCGAGTTTCTTTGATTATAATTCATACATTTTTCTCTAACTGCAAACATAATGTTTTCCCTTGTATTTTACAGATGCAAACAGCTATAATTTTGCAAAAAAGGAAAATAACTCTCCTGAACATCTAAAAGATGAAGTTTCTATCATCCAAAGTATGGGCTACAGAAATCGCGCCAAAAGACTTCTACAGAGTGAACCCGAAAATCCTTCCTTGGTAAAACCATTTGTTTTCTCTTCTTCTTCTTCTTCTTTTCTTCGTATGCCGTCTTCTGCTTG,GTTTCTTTGATTATAATTCATACATTTTTCTCTAACTGCAAACATAATGTTTTCCCTTGTATTTTACAGATGCAAACAGCTATAATTTTGCAAAAAAGGAAAATAACTCTCCTGAACATCTAAAAGATGAAGTTTCTATCATCCAAAGTATGGGCTACAGAAATCGCGCCAAAAGACTTCTACAGAGTGAACCCGAAAATCCTTCCTTGGTAAAACCATTTGTTTTCTCTTCTTCTTCTTCTTCTTTTCT,syn;syn,,g.23del,g.23del,0,,AGAAAAGAAGAAGAAGAAGAAGAAGAAAACAAATGGTTTTACCAAGGAAGGATTTTCGGGTTCACTCTGTAGAAGTCTTTTGGCACGGTTTCTGTAGCCCATACTTTGGATGATAGAAACTTCATCTTTTAGATGTTCAGGAGAGTTATTTTCCTTTTTTGCAAAATTATAGCTGTTTGCATCTGTAAAATACAAGGGAAAACATTATGTTTGCAGTTAGAGAAAAATGTATGAATTATAATCAAAGAAAC

Variant file

VCF files containing a subset of the metadata in VCF format. The metadata are stored in the INFO field. The REF field reports the reference sequence including (*_pam.vcf) or excluding (*_ref.vcf) PAM protection edits.

The variants can be linked to the corresponding oligonucleotides via the SGE_OLIGO tag, and, for custom variants, to the original VCF files via the SGE_VCF_ALIAS and SGE_VCF_VAR_ID tags.

INFO tags:

Tag	Metadata field	Description
`SGE_OLIGO`	`oligo_name`	Corresponding oligonucleotide name.
`SGE_SRC`	`mutator`	Variant source.
`SGE_REF`	`ref`	(Optional) Reference sequence, if different from the PAM-protected reference sequence (PAM VCF only).
`SGE_VCF_ALIAS`	`vcf_alias`	(Optional) VCF variant identifier, only for `custom` variants.
`SGE_VCF_VAR_ID`	`vcf_var_id`	(Optional) VCF variant source file alias, only for `custom` variants.

Unique oligonucleotides file

Comma-separated values (CSV) file containing only the label and the sequence of the oligonucleotides generated for any given targeton, where the sequences are unique. This is a subset of the oligonucleotide metadata file fields (oligo_name and mseq) and rows. When multiple oligonucleotides have the same sequence, the first name in lexicographic order is chosen.

Example:

oligo_name,mseq
ENST00000256474.3.ENSG00000134086.8_chr3:10146513_G>A_snv,GGATTACAGGTGTGGGCCACCGTGCCCAGCCACCGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATAAGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG
ENST00000256474.3.ENSG00000134086.8_chr3:10146513_G>C_snv,GGATTACAGGTGTGGGCCACCGTGCCCAGCCACCGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATACGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG
ENST00000256474.3.ENSG00000134086.8_chr3:10146513_G>T_snv,GGATTACAGGTGTGGGCCACCGTGCCCAGCCACCGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATATGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG
ENST00000256474.3.ENSG00000134086.8_chr3:10146474_A_1del,GGATTACAGGTGTGGGCCACCGTGCCCAGCCCCGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATAGGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG
ENST00000256474.3.ENSG00000134086.8_chr3:10146475_C_1del,GGATTACAGGTGTGGGCCACCGTGCCCAGCCACGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATAGGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG
ENST00000256474.3.ENSG00000134086.8_chr3:10146477_G_1del,GGATTACAGGTGTGGGCCACCGTGCCCAGCCACCGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATAGGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG

Reference sequence retrieval quality check file

Comma-separated values (CSV) file with no header reporting the reference sequences as retrieved based on the genomic coordinates and extension vector provided in the SGE targeton file.

The targeton name is derived from the genomic coordinates of the reference sequence.

Field	Format	Description
(Targeton name)	`<CHR>_<START>_<END>_<STRAND>`	Name of the targeton.
(Reference genomic range)	`<CHR>:<START>-<END>`	Reference sequence region.
(5' constant region start)	integer	Start position of the 5' constant region.
(5' constant region sequence)	DNA sequence	Sequence of the 5' constant region.
(Target region 1 start)	integer	Start position of target region 1.
(Target region 1 sequence)	DNA sequence	Sequence of target region 1.
(Target region 2 start)	integer	Start position of target region 2.
(Target region 2 sequence)	DNA sequence	Sequence of target region 2.
(Target region 3 start)	integer	Start position of target region 3.
(Target region 3 sequence)	DNA sequence	Sequence of target region 3.
(3' constant region start)	integer	Start position of the 3' constant region.
(3' constant region sequence)	DNA sequence	Sequence of the 3' constant region.

Example:

chr3_10146443_10146687_plus,chr3:10146443-10146687,10146443,GGATTACAGGTGTGGGCCACCGTGCCCAGCC,10146474,ACCGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATAG,10146514,GTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAG,10146637,GTACTGACGTTTTACTTTTTAAAAAGATAAGGTTG,10146672,TTGTGGTAAGTACAGG

Development

To run the unit tests, install the extra requirements first:

pip install -r test-requirements.txt
./run_tests.sh

LICENSE

VaLiAnT
Copyright (C) 2020, 2021, 2022, 2023, 2024 Genome Research Ltd

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as
published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License
along with this program.  If not, see <https://www.gnu.org/licenses/>.

VaLiAnT

Citation

Usage

Python package

Docker image

Singularity image

Command line interface

valiant

valiant (sge|cdna)

valiant sge

valiant cdna

Mutation types

Parametric deletion

Single-nucleotide variant

In-frame deletion

Alanine codon substitution

Stop codon substitution

All amino acid codon substitution

SNVRE

Custom variants

Installation

Python virtual environment

Docker image

File formats

Configuration file

Common parameters

SGE parameters

cDNA parameters

SGE targeton file

cDNA targeton file

cDNA annotation file

VCF manifest file

PAM protection VCF file

Oligonucleotide metadata file

Variant file

Unique oligonucleotides file

Reference sequence retrieval quality check file

Development

LICENSE

About

Languages