Two datasets were used:
-
To retrieve a fasta of all genome assemblies available on NCBI:
- List of accessions: Assemblies_db_15-03-19.txt
- Script to create fasta file: viral_assemblies.py
-
Earth's virome database: https://www.nature.com/articles/nature19094#methods
cat /home/fodelian/Desktop/ViralGenomes/assembly_db/refseq_viral_genomes.fasta \
/home/fodelian/Desktop/ViralGenomes/assembly_db/mVGs_sequences_v2.fasta \
/home/fodelian/Desktop/ViralGenomes/SNG/SNG_contigs.fasta \
/home/fodelian/Desktop/ViralGenomes/VDN/VDN_contigs.fasta \
/home/fodelian/Desktop/ViralGenomes/VEV/VEV_contigs.fasta \
> raw_db_ctgs.fasta
Vsearch: https://github.com/torognes/vsearch
vsearch --cluster_fast raw_db_ctgs.fasta --consout 95_database.fasta --id 0.95 --iddef 0 --maxseqlength 3000000 --threads 6 --usersort
Script: preprocess.py
Reads from the 214 bulk soil metagenomes were quality trimmed using Trimmomatic v0.3635 and then paired reads were mapped to the viral contig database with Bowtie236, using default parameters. The output bam files were passed to BamM ‘filter’ v1.7.2 (http://ecogenomics.github.io/BamM/, accessed 15 December 2015) and reads that were aligned over ≥90% of their length at ≥95% > nucleic acid identity were retained.
First, we need to merge the files per sample:
cat SNG1_R1.fq.gz SNG2_R1.fq.gz > SNG_R1.fq.gz
cat VDN1_R1.fq.gz VDN2_R1.fq.gz > VDN_R1.fq.gz
cat VEV1_R1.fq.gz VEV2_R1.fq.gz > VEV_R1.fq.gz
cat SNG1_R2.fq.gz SNG2_R2.fq.gz > SNG_R2.fq.gz
cat VDN1_R2.fq.gz VDN2_R2.fq.gz > VDN_R2.fq.gz
cat VEV1_R2.fq.gz VEV2_R2.fq.gz > VEV_R2.fq.gz
- Trimming: Trimmomatic
- Mapping: BWA
Bam filter: BamM 'filter
References: