This repository contains code and steps used for investigating Plasmodium vivax evolutionary history in central Africa using whole genome sequencing data. This work is now published in Malaria Journal:
All samples used in this study can be found in the metadata table: sample_info/metadata_table.csv
Multiplicity of Infection (MOI) was determined using the following steps:
- generate two vcfs: one polyclone with a max of 3 clones, and one a gvcf which indicates the level of coverage at each base. The commands to generate these are in
$ octopus -I ${DEDUP_BAM} -R ${REF} -T LT635626 -o api.poly3.vcf.gz --annotations AD -C polyclone --max-clones 3 --threads 16 --sequence-error-model PCR
$ octopus -I ${DEDUP_BAM} -R ${REF} -T LT635626 -o api.g.vcf.gz --annotations AD --refcall POSITIONAL --threads 16 --sequence-error-model PCR
scans the genomes2 directory that contains directories/files like ERR12355/api.poly3.vcf.gz . It produced the simple text file genomes2/mono-0.9.txt with lines like
ERR773745 OK
ERR773746 PolyClonal
ERR773747 NoGVCF
ERR773748 OK
The 0.9 indicates that a site is considered homozygous if the major allele frequency is 0.9. Run it like --cutoff 0.9 --allowed-het-sites=1 > mono-0.9.txt
Create the list of accessions with grep OK mono-0.9.txt | cut -f1 > mono-0.9-accs.txt
Accessions with no MOI data were included by default.
All scripts needed to download, map to PvP01 reference genome, and do variant calling are included in the genome_processing
using the command
$ sbatch accessions.txt
where "accessions.txt" contains one run accession number per line. This script will launch individual bash scripts via an array for each accession number to run in parallel, limited by the array size indicated (currently set to run 10 accessions at one time: #SBATCH --array=1-${NACC}%10
This script will create a new directory for each accession that contains the mapped and deduplicated BAM file as well as the gVCF file. Before combining individual VCF files for the joint callling step, the gVCF file needs to be updated to include the sample name for the genotype information. Run
with each accession number as the input to update the gVCF file.
$ for i in $(< accessions.txt) ; do sbatch ${i} ; done
This script will create a new file called <accesion>-samp.g.vcf
in the same directory as the original gVCF file.
To combine individual gVCFs into one file, first create a list of file locations:
$ for i in $(< accessions.xt) ; do echo ${i}/${i}-samp.g.vcf ; done > accessions.list
Then run
$ sbatch accessions.list
This script will create a new gVCF file named accessions-combined.g.vcf.gz
Run the joint calling step with the command:
$ sbatch accessions-combined.g.vcf.gz
The output of this script will be a file called accessions-combined-joint-called.g.vcf.gz
Before using this gVCF file for analyses, remove reads mapped to contig and reads from hypervariable sites: keep only chromosomes (no contigs) and remove masked regions using the command:
$ sbatch accessions-combined-joint-called.g.vcf.gz
Note: the script
can also be used to extract biallelic SNPs by uncommenting the final lines.
All scripts can be found in analysis/admixture_analysis/
Starting with SNPs-only VCF file (/genome_processing/
set to output biallelic SNPs only):
$ sort -R min_filt_no-singletons.recode.pruned.genotypes.bim | head -n 100000 | awk '{print $2}' > random100k.snps
# change format
# old: chr:pos
# now: chr pos
$ sed "s/\:/\t/g" random100k.snps > random100k.snps.txt
# extract random positions
$ bcftools view -R random100k.snps.txt min_filt_no-singletons.recode.vcf.gz > random100k_min_filt_no-singletons.recode.vcf
# SORT positions with vcftools 'vcf-sort' tool
$ cat random100k_min_filt_no-singletons.recode.vcf | vcf-sort > random100k-SORTED_min_filt_no-singletons.recode.vcf
Convert PvP01 chromosome names to integers
$ for i in $(<replace-chr-w-ints_sed-arguments.txt ) ; do sed -i ${i} chr-as-int_global_vivax.vcf ; done
AdmixturePipeline can be downloaded from GitHub:
To use on haploid data, update the command string in script to indicate haploid mode:
command_string = "admixture" + " -s " + str(np.random.randint(1000000)) + " --cv=" + str( + " " + self.prefix + ".ped " + str(i) + " " + haploid_str
To run, you will need a tab-separated file named popmap.txt
that contains:
accession1 Population1
accession2 Population2
... etc.
Submit the pipeline to Slurm using the script:
$ sbatch chr-as-int_global_vivax.vcf
Pong software can be downloaded from the Github repo:
Prepare information files for generating Admixture visualization
# File map
$ for i in *.Q ; do j=$i ; i=${i##*.pruned.genotypes.} ; echo -e "k-${i%%.Q}\t${i%%_[0-9]*.Q}\t${j}" ; done > filemap.txt
# Pop order
$ awk '{print $2}' popmap.txt |sort|uniq > pop_order.txt
# then manually arrange the order in this file to represent left to right on map
# Mapping individuals to populations
$ awk '{print $2}' popmap.txt > ind2pop.txt
#Run Pong
$ pong -m filemap.txt -i ind2pop.txt -n pop_order_revised.txt -v
Then view results in browser.
To generate Cross Validation Error box plots:
for i in *.stdout ; do grep -h "CV error" ${i} >> overall_cv_summary.txt ; done
awk '{print $3"\t"$4}' overall_cv_summary.txt | tr -d '():K=' >> cv_summary_table.txt
and visualize with analysis/admixture_analysis/CrossValidationError_boxplots.Rmd
This analysis uses Plink files generated by AdmixturePipeline in the previous step.
$ plink --bfile global_population.genotypes --pca
And visualize with analysis/plink_pca/global-vivax-pca.Rmd
Bash scripts to start Slurm jobs are included in analysis/trees
Starting with gvcf file produced by /genome_processing/
, first convert VCF to phylip format using script: vcf2phylip
$ sbatch vivax.vcf.gz
Then remove invariant sites using script from
$ sbatch vivax.phy
which will produce a new phylip file vivax.invariants-rm_snps-only.phy
To run IQtree, use the command:
$ sbatch vivax.invariants-rm_snps-only.phy
And visualize using FigTree or software of your choice.
Scripts for this section can be found in analysis/summary_statistics
First filter the VCF in two ways:
- Keep only biallelic sites (sites where every individual either has the reference allele or has a single alternate allele)
$ bcftools view -m2 -M2 -v snps ${VCF} > ${VCF%%.vcf.gz}_biallelic_snps_only.vcf
- Keep only sites where at least one individual in the VCF file has the alternate allele in the GT field (1)
$ bcftools view -S ${SUBSAMPLE} ${VCF} --min-ac=1 > ${SUBSAMPLE%%-accessions.txt}_${VCF%%.vcf.gz}_min-ac1.vcf
Then calculate the number of segregating sites in each population:
$ for i in *_biallelic_snps_only_min-ac1.vcf.gz ; do ./ --vcf ${i} ; done
For each population, you will run the script with the population VCF file and a tab separated population map file with accession number in the first column and the population in the second column:
sbatch population-SNPs.vcf popmap.txt
which will output a file called XX taht looks like:
pop chromosome window_pos_1 window_pos_2 avg_pi no_sites count_diffs count_comparisons count_missing
eastafrica LT635612 1 1000 NA 0 NA NA NA
eastafrica LT635612 1001 2000 NA 0 NA NA NA
To get the genome-wide average Pi value, run:
$ Rscript genome-ave-pi.R eastafrica-popfile_allchroms_1kb-windows_pixy_pi.txt
To find the private alleles in a population (i.e. alleles that are unique to one popoulation and not found in any other), create a text file with all the accession numbers/sample names for that population, then run this script with a VCF file containing only biallelic SNPs:
$ sbatch all-populations-biallelic-snps.vcf population-accessions.txt
script is included in /analysis/summary_statistics
if needed
Figure S1: P. vivax genome private alleles as a measure of population variation, separated by continent.
Visualize private alleles and segregating sites per country using /analysis/summary_statistics/visualize-private-alleles.Rmd
After calculating genome-wide pi (nucleotide diversity) for reach region in Africa using
and genome-ave-pi.R
as described above, visualize the data with analysis/summary-statistics/Visualize-Ave-Pi.Rmd
Supplementary Table 2 Identification of potential gene duplications in DRC P. vivax using read depth
See scripts in /analysis/duplication
for this section. Starting with BAM files from samples that have been aligned to PvP01 reference genome and optical duplicates removed (see /analysis/duplication/
), run:
$ for i in *.dedup.bam ; do sbatch ${i} ; done
This will produce a file with the genome-wide per site coverage for each deduped bam file in the directory (filename will end in .persitedepth.bedgraph
). Sort these bed files with
$ for i in *.persitedepth.bedgraph ; do sbatch ${i} ; done
Next pull out individual chromosomes for each sample:
$ for i in *.persitedepth.bedgraph ; do sbatch ${i} ; done
Make a new directory for each gene of interest:
$ mkdir RBP2c RBP2b RBP2a RBP1b RBP1a DBP2 DBP
From each sample's chromosome coverage file, pull out just the gene subregion:
for i in *_LT635617.persitedepth.bedgraph ; do sbatch ${i} ; done
then move those files to the appropriate directory, e.g.
$ mv *_DBP_20kb-each-side.persitedepth.bedgraph DBP/
These files need to be converted from space-delimited to tab-delimited. To do this for all files in each gene subdirectory, run the following command to produce a new tab-delim file (original file will be saved with _space-delim.original
file extension):
$ for i in */ ; do (cd ${i} ; for j in *.bedgraph ; do sed -i '_space-delim.original' 's/ /\t/g' ${j} ; done ) ; done
Make a named bedgraph for one file (in this case, the DRC sample of interest was named SANRU, and this is the one I used). Other samples will be appended to this table since they all have reads for every site.
$ for i in */ ; do (cd ${i} ; echo -e "chr\tpos\tSANRU" > SANRU_${i%/}_10kb-each-side.persitedepth_named.bedgraph ; cat SANRU_${i%/}_20kb-each-side.persitedepth.bedgraph >> SANRU_${i%/}_10kb-each-side.persitedepth_named.bedgraph ) ; done
Pull out only depth column and add sample name as header:
$ for j in */ ; do (cd ${j} ; for i in *persitedepth.bedgraph ; do echo -e "${i%%_*}" > ${i%%_*}_depth-col.bedgraph ; awk '{print $3}' ${i} >> ${i%%_*}_depth-col.bedgraph ; done ) ; done
If needed, create a symlink on each gene folder for the merg script
$ for j in */ ; do (cd ${j} ; ln -s ../merge-bedgraphs.R . ) ; done
Then merge all bedgraphs within each gene folder
$ for j in */ ; do (cd ${j} ; Rscript merge-bedgraphs.R SANRU_${j%/}_10kb-each-side.persitedepth_named.bedgraph ; mv merged.bedgraph ${j%/}_merged.bedgraph ) ; done
Copy the merged bedfile for each gene to working directory:
$ for j in */ ; do cp ${j}${j%/}_merged.bedgraph . ; done
Then visualize read depth levels with Genes_of_interest_CNV.Rmd
(for an individual gene, such as DBP hard coded in this script). This R markdown document produces a new file called DBP-CNV-per-country_large-font.png
, which is how Supplemental figure 4B was generated.
To to run the analysis for each gene and save image of graphs, run:
$ for i in */ ; do Rscript ./cnv_in_genes.R ${i%/} ; done
where each directory (*/
) is the name of a gene of interest.