rneher / augur

Flu divination

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Augur

Augur is Python package to track (and eventually forecast) flu evolution. It currently

  • imports public sequence data
  • subsamples, cleans and aligns sequences
  • builds a phylogenetic tree from this data

The program is live on Amazon EC2 with results pushed to Amazon S3. The latest JSON-formatted flu tree is available as tree_streamline.json. This tree is visualized at blab.github.io/auspice/.

Run

You can run across platforms using Docker. An image is up on the Docker hub repository as trvrb/augur. With this public image, you can immediately run augur with

docker pull trvrb/augur
docker run -ti -e "GISAID_USER=$GISAID_USER" -e "GISAID_PASS=$GISAID_PASS" -e "S3_KEY=$S3_KEY" -e "S3_SECRET=$S3_SECRET" -e "S3_BUCKET=$S3_BUCKET" --privileged trvrb/augur

This starts up Supervisor to keep augur and helper programs running. This uses supervisord.conf as a control file.

To run augur, you will need a GISAID account (to pull sequences) and an Amazon S3 account (to push results). Account information is stored in environment variables:

  • GISAID_USER: GISAID user name
  • GISAID_PASS: GISAID password
  • S3_KEY: Amazon S3 key
  • S3_SECRET: Amazon S3 secret
  • S3_BUCKET: Amazon S3 bucket

Develop

Full dependency information can be seen in the Dockerfile. To run locally, pull the docker image with

docker pull trvrb/augur

And start up a bash session with

docker run -ti -e "GISAID_USER=$GISAID_USER" -e "GISAID_PASS=$GISAID_PASS" trvrb/augur /bin/bash

From here, the build pipeline can be run with

python augur/run.py

Pipeline notes

Virus ingest, alignment and filtering

Using Selenium to automate downloads from GISAID. GISAID requires login access. User credentials are stored in the ENV as GISAID_USER and GISAID_PASS.

Keeps viruses with full HA1 sequences, fully specified dates, cell passage and only one sequence per strain name. Subsamples to 100 sequences per month for the last 3 years before present.

Align sequences with mafft. Testing showed a much lower memory footprint than muscle.

Keep only sequences that have the full 1701 bases of HA in the alignment.

Tree processing

Using FastTree to get a starting tree. FastTree will build a tree for ~5000 sequences in a few minutes. Then using RAxML to refine this initial tree. A full RAxML run on a tree with ~5000 sequences could take days or weeks, so instead RAxML is run for a fixed 1 hour and the best tree found during this search is kept. This will always improve on FastTree.

Reroot the tree based on outgroup strain, collapse nodes with zero-length branches and ladderize the tree.

About

Flu divination