This repository contains a fully reproducible computational pipeline for analyzing the SARS-CoV-2 and chordate mitochondrial metagenomic content of the deep sequencing of environmental samples taken from the Huanan Seafood market by Liu et al (2023).
The pipeline was created by Jesse Bloom.
The analysis and results are described in Bloom et al, Virus Evolution, 9:vead050 (2023).
Key results are in the ./results/ subdirectory. These include:
- results/metadata/merged_metadata.csv: metadata about the samples extracted by processing files provided on the NGDC by Liu et al (2023).
- results/crits_christoph_data/check_sha512_vs_crits_christoph.csv: comparison of SHA-512 hashes for the FASTQ files downloaded from the NGDC to those reported in the earlier analysis by Crits-Christoph et al (2023).
- results/mitochondrial_genomes/retained.csv: the set of chordate mitochondrial genomes to which reads were aligned for the metagenomic analysis.
- results/aggregated_counts/sars2_aligned_by_run.csv: number of aligned SARS-CoV-2 reads for each sequencing run.
- results/aggregated_counts/sars2_aligned_by_sample.csv: number of aligned SARS-CoV-2 reads for each sample. Note a few samples (like A20) are listed twice with the description column indicating how each listing was sequenced (the sample was sequenced two different ways).
- results/aggregated_counts/mito_composition_by_run.csv: chordate mitochondrial composition for each sequencing run.
- results/aggregated_counts/mito_composition_by_sample.csv: chordate mitochondrial composition for each sample. Note a few samples (like A20) are listed twice with the description column indicating how each listing was sequenced (the sample was sequenced two different ways).
- results/rt_qpcr/rt_qpcr.csv: SARS-CoV-2 content of samples determined in current sequencing and Ct values from RT-qPCR reported by Liu et al (2023).
- results/plots/susceptible_table.csv: SARS-CoV-2 content of samples with high chordate mitochondrial composition from susceptible species sold live at the market.
- results/plots/susceptible_mammal_table.csv: SARS-CoV-2 content of samples with high mammalian mitochondrial composition from susceptible species sold live at the market.
- results/plots/raccoon_dog_long.csv: SARS-CoV-2 content of all samples ordered by the percent of the chordate mitochondrial composition from raccoon dogs.
- results/contigs/counts_and_coverage/processed_counts.csv: results for aligning assembled contigs to full genomes for selected samples and genomes.
Note that the pipeline also produces many other results files (some of which are very large) that are not tracked in this repo.
Interactive plots of the results created using Altair are rendered from the ./docs/ subdirectory via GitHub Pages at https://jbloom.github.io/Huanan_market_samples/
The entire analysis can be run in automated fashion using snakemake.
The pipeline itself is in Snakefile.
The configuration for the pipeline is specified in config.yaml.
The pipeline uses the conda environment in environment.yml, which specifies the precise versions of all software used.
The one exception is that the rule build_contigs
in Snakefile uses an environment module that is pre-built on the Fred Hutch computing cluster to run the Trinity to build contigs---to run this rule, you will need to specify a comparable module for whatever computing system you are using, or skip the contig building by commenting out the file results/contigs/counts_and_coverage/processed_counts.csv
as an input to the all
rule in Snakefile.
The scripts and Jupyter notebooks used by the pipeline are in ./scripts/ and ./notebooks/, respectively.
Most data used by the pipeline is downloaded by the pipeline, but it takes the following to input files, both found in ./data/:
-
data/CRA010170.xlsx is the GSA BioProject metadata sheet downloaded from the NGDC GSA page https://ngdc.cncb.ac.cn/gsa/browse/CRA010170 on March-29-2023.
-
data/positive_table.csv is a version of Supplementary Table 2 from Liu et al (2023), taken from this link (archived here).
To run the pipeline on the Fred Hutch computing cluster, use the commands in run_Hutch_cluster.bash.