jbloom / Huanan_market_samples

Analysis of SARS-CoV-2 read counts versus metagenomic content for Huanan Seafood market sequencing data from Chinese CDC

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Association between SARS-CoV-2 and metagenomic content of samples from the Huanan Seafood Market

This repository contains a fully reproducible computational pipeline for analyzing the SARS-CoV-2 and chordate mitochondrial metagenomic content of the deep sequencing of environmental samples taken from the Huanan Seafood market by Liu et al (2023).

The pipeline was created by Jesse Bloom.

The analysis and results are described in Bloom et al, Virus Evolution, 9:vead050 (2023).

Results and plots

Key results are in the ./results/ subdirectory. These include:

Note that the pipeline also produces many other results files (some of which are very large) that are not tracked in this repo.

Interactive plots of the results created using Altair are rendered from the ./docs/ subdirectory via GitHub Pages at https://jbloom.github.io/Huanan_market_samples/

Understanding and running the pipeline

The entire analysis can be run in automated fashion using snakemake.

The pipeline itself is in Snakefile. The configuration for the pipeline is specified in config.yaml. The pipeline uses the conda environment in environment.yml, which specifies the precise versions of all software used. The one exception is that the rule build_contigs in Snakefile uses an environment module that is pre-built on the Fred Hutch computing cluster to run the Trinity to build contigs---to run this rule, you will need to specify a comparable module for whatever computing system you are using, or skip the contig building by commenting out the file results/contigs/counts_and_coverage/processed_counts.csv as an input to the all rule in Snakefile. The scripts and Jupyter notebooks used by the pipeline are in ./scripts/ and ./notebooks/, respectively.

Most data used by the pipeline is downloaded by the pipeline, but it takes the following to input files, both found in ./data/:

To run the pipeline on the Fred Hutch computing cluster, use the commands in run_Hutch_cluster.bash.

About

Analysis of SARS-CoV-2 read counts versus metagenomic content for Huanan Seafood market sequencing data from Chinese CDC

License:MIT License


Languages

Language:TeX 81.9%Language:Jupyter Notebook 12.6%Language:Python 5.4%Language:Shell 0.1%