Syllabus

Comparative Genomics

Yale University, Spring 2019
E&EB 723

Time: Thursday 1-3
Location: ESC 100

Instructor: Casey Dunn
Prepend the subject line of all course related emails with "genomics: "
Office hours: Thursdays 10AM-11:30AM

The field of evolutionary biology is increasingly drawing on genomic data. The field of genomic biology is becoming more evolutionary as genomes are sequenced for a broader diversity of organisms. This course focuses on the evolution of genome sequence and function at macroevolutionary timescales, with an emphasis on building practical computational skills for genomic and phylogenetic comparative analyses. There will be more focus on using phylogenies to understand genome evolution than on using genomes to build phylogenies.

Technical details

Git

This course is organized with github education tools. The course will make heavy use of git for sharing, communication, and collaboration. All students need to have a github account, preferably one that is registered to their Yale email address so that they get the full academic features.

HPC

The class will be given student cluster accounts on one of the Yale High Performance Compute (HPC) clusters. Access to both compute and storage resources will last the course of the semester, after which data need to be copied elsewhere if they are important enough to be saved.

Course format

Classes will consist of lectures, student led discussions, and computational labs.

Course site

All materials for the course, including the syllabus, are available at the course site. The syllabus will be updated as the course progresses, please check it weekly. Please submit suggestions and corrections for the class via the issue tracker.

Assignments

All assignments will be distributed and submitted via GitHub. Here are the basic steps:

Go to the class repository for the assignment, for example https://github.com/Yale-EEB723/finalproject . Click the "Fork" button to create a fork. This creates your own copy of the repository with its own url, for example https://github.com/YourGitHubUsername/finalproject .
Clone the repository to your laptop. On your repository website, click on the green "Clone or download" button and copy the link there. It will be something like https://github.com/YourGitHubUsername/finalproject.git
Open a Terminal window on your laptop. cd to a directory where you keep your git repositories. Then clone your form of the repository (substitute the link copied above): git clone https://github.com/YourGitHubUsername/finalproject.git
Edit the files and sage changes. Use git add to add any new changes.
Commit the changes with git commit -am "my message" where "my message" describes the changes you made. commit often as you work.
Push the changes back to GitHub with git push.
Once the assignment is complete, return to the repo page at GitHub, for example https://github.com/YourGitHubUsername/finalproject. Make sure all your changes are reflected there. Save, commit, and push again if not.
Click the "New pull request" button. Then click the "Create pull request" button and submit the pull request. This will notify the instructor that you have submitted the assignment. Some assignments will be submitted multiple times as sequential tasks are completed, which will require multiple pull requests.

Final Projects

Each student will work on a project, either in collaboration or individually. The project must relate to one or more themes covered in the course of the class. Final project plans will be presented in week 3 of the course. After the team and topic are set, fork the repository at https://github.com/Yale-EEB723/finalproject to create a repository for your project. Submit the project as a pull request. Your forked repository can be private if it includes unpublished original data, but if private all course members should be granted access so they can view it and provide feedback.

The final project can consider a research project already in progress (eg something that is part of thesis research), analysis of publicly available data, analysis of simulated data, development and testing of statistical methods or software, etc. Ideally each project will advance the existing research goals of each student, or advance an interesting topic identified in the course.

Here are some suggested final project ideas:

A deep dive on a specific technical challenge of de novo genome sequencing and assembly, eg repeats or heterozygosity
Assembly and annotation of an original or publicly available de novo genome
Examine the evolution of genome structure (eg synteny, size, intron distribution, etc) with phylogenetic methods
Explore the fit of models of evolution to genomic or functional genomic data
Test phylogenetic hypotheses with genomic data
Analyze one or more categories of functional genomic data in a phylogenetic context to test hypotheses about the evolution of genome function
Use comparative functional genomic and/or genomic data to identify genes that may relate to specific phenotypes
Compare within population genome variation to variation at broader phylogenetic scales

Exercises

Several exercises will be assigned. They will usually be started in class, and then due by the following week.

Reading

Reading includes manuscripts, book chapters, online resources, and videos to be watched ahead of class. The readings will be posted by the Monday before each class. In most weeks, the 15-20 minute discussion of the reading will be led by a group of students. All students will get a chance to participate in these groups. A bibliogrpay at the end of this document includes a variety of references that readings can be selected from.

In addition to reading assigned for each class, the following will be used as references throughout the course:

Haddock, SHD and CW Dunn (2011). Practical Computing for Biologists. amazon I wrote this book with my colleague Steve Haddock as an introduction to general computing skills for biologists. If you are not already comfortable at the command line then you should get this book as a reference.
Whickham, H (2017). R for Data Science. http://r4ds.had.co.nz This book is free online at the provided link. It is an excellent introduction to data analysis in R, and more broadly how to think about data structure and analysis. It presents a coherent introduction to the Tidyverse, a set of R packages for general data manipulation and analysis. Our R coding will follow conventions in this book.

Setting up your computer

In this course, you will perform some exercises and analyses on your own laptop in class, and some on the cluster. Below are instructions on how to set up your laptop.

Setup an account at GitHub using your educational email address.

Install git.

Install the Atom text editor.

Install Docker.

Schedule

Week 1, January 17 - Intro to comparative genomics. Questions, methods, history

Reading

Discussion leader: Casey Dunn

Felsenstein, J. 1985. Phylogenies and the Comparative Method. American Naturalist, 125:1–15. https://www.jstor.org/stable/2461605
Dunn CW, Zapata F, Munro C, Siebert S, Hejnol A. 2018 Pairwise comparisons across species are problematic when analyzing functional genomic data. Proc. Natl. Acad. Sci., 115:E409–E417. https://www.doi.org/10.1073/pnas.1707515115

Agenda and notes

Introductions
Discussion of course goals and structure
- Readings
- Projects
- Class formats
Course logistics
- Bring laptop to each class
- github account required
- YCRC account setup description, needed prior to week 3
Review readings
Overview of computational framework and tools

Exercises

First, confirm that docker is working by running a container:

docker run -it rocker/rstudio /bin/bash

Next, we will walk through regular expressions in the exercises at https://github.com/Yale-EEB723/syllabus/blob/master/regular_expressions.txt .

Week 2, January 24 - Sequencing technology and applications

Reading

Discussion leader: Ian Gilman

Goodwin et al. 2016. Coming of age: ten years of next-generation sequencing technologies. Nature Reviews Genetics. https://doi.org/10.1038/nrg.2016.49 This review covers a lot of ground. Focus on the bits about Illumina, PacBio, and Oxford Nanopore Technologies (ONT).
Practical Computing for Biologists Chapters 2-3. This optional reading provides background for, and builds on, the regular expressions exercises.

Agenda and notes

Sequencing technology and instruments
- Conceptual overview
  - Single molecule vs. populations of molecules
  - Multiplexing
  - Sequencing overview
    - Sample preparation
    - Data acquisition
    - Data preprocessing
    - Base calling
    - Read processing (trimming, binning, etc) and export
    - Downstream analysis (application specific)
  - Tradeoffs
    - Cost (initial and realtime)
    - Read length
    - Error rate and error profile (base miscalls, phasing noise, homopolymer length, etc)
    - Throughput
    - Hands-on limitations (sample prep cost, instrument portability, ease of use, run time, etc)
- Current sequencing technologies
  - Illumina
    - https://www.youtube.com/watch?v=fCd6B5HRaZ8
    - The recent shift to reduced colors
  - PacBio
    - https://www.youtube.com/watch?v=NHCJ8PtYCFc
    - Very long molecules can be sequenced or the same molecule can be sequenced repeatedly with Circular Consensus Sequencing
  - Oxford Nanopore
    - https://www.youtube.com/watch?v=GUb1TZvMWsw
    - https://www.youtube.com/watch?v=hs0FdiTHMbc
Genome sequencing
- Challenging factors
  - Large size
  - Repeats
  - Heterozygosity
  - Tissue limitation
Take homes
- Focus on inputs and outputs, not intermediates. For example, assembly quality is usually much more important than read quality.
- Take a wholistic perspective on costs, including your time. Saving a bit of money on sequencing can sometimes incur large data analysis costs, for example.
- Focus your time and resources on what differentiates your project from others.
- Always, always be thinking about the end goal and evaluate intermediate decisions in terms of these final objectives.

Exercises

We will walk through regular expressions in the exercises at https://github.com/Yale-EEB723/syllabus/blob/master/regular_expressions.txt .

To get the files for the exercises, make a local clone of the syllabus repository:

git clone https://github.com/Yale-EEB723/syllabus.git

Other

Hand out git chapter
Ask students to prepare to present preliminary ideas on final projects next week

Week 3, January 31 - Practical computing skills

Reading

Jain et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature Biotech. https://doi.org/10.1038/nbt.4060. (Discussion leaders: Edgar Benavides and Vincent Dimassa)
Practical Computing for Biologists Chapters git chapter draft (to be provided as hard copy in previous week)
Practical Computing for Biologists Chapters chapters 4-6, 20 This optional reading provides background on working in bash and remote access to computers.

Agenda and notes

Quick (<1 minute) description of final project plan
Working on your laptop
Working on the cluster via your account
Getting started with git

Ben's cluster presentation

Week 4 - Genome assembly

Reading

Sedlazeck et al. 2018. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nature Reviews Genetics. https://doi.org/10.1038/s41576-018-0003-4 (Discussion leader: Ava Ghezlayagh )

Agenda

The agenda:

Review paper
Discuss final projects
Walk through alignment exercise
Discuss items below about assembly

Notes

We often want to know the full sequence of a genome, but data are fragmented and redundant because:

DNA isolation leads to fragmentation by mechanical and chemical processes
Sample preparation leads to fragmentation by mechanical and chemical processes, often deliberately to adjust the length of input molecules
Sequencing often doesn’t span the full length of a molecule, due to technical limitations, damaged template, damaged sequencing
The same regions of the genome are sequenced multiple times because:
- It is far easier to randomly sequence regions of the genome than to systematically tile sequencing effort across the genomes. To sequence with enough depth to ensure coverage everywhere, some places will have quite deep coverage
- Sequencing is error prone. Sequencing each spot multiple times enables error detection and correction.

In the broad sense,

Sequencing takes large biological molecules in and generates character strings in computer memory that are redundant overlapping estimates (reads) of subsequence of of the original molecules. Underlying this is a generative model, ie an idea of how molecular structure impacts observed read sequences.
Assembly is in some respects the reverse process - it starts with reads and generates an estimate of the sequences of the input biological molecules. It runs the generative model in reverse, and generates descriptions of molecules in computer memory rather than

Common tasks:

de novo assembly: going from reads (and sometimes additional new structural data) to a genome assembly without reference to an existing assembly.
mapping: Tiling reads onto an existing reference genome sequence. Used to assess how the reads cover the reference sequence (more on this later when we discuss functional genomics), or to identify how the genome from which the reads are derived differs from the reference genome.
reference based assembly: assembling a genome by reference to an existing genome, usually by mapping new reads to an existing reference genome sequence. Does not require as much sequencing depth as de novo assembly, also much easier computationally.

The challenge:

Find similar sequences
Categorize differences between similar sequences according to whether they are
- Sample prep errors
- Sequencing errors
- Different regions of the genome (eg paralogy, repeats)
- Different alleles
- Mixtures (eg somatic variations)
Estimate the sequence of the original molecules

The general assembly process (see Figure 1 from https://doi.org/10.1038/s41576-018-0003-4)

Identify overlaps between reads
Construct a string graph
- Nodes are unambiguous sequences
- Edges are possible connections between those sequences
- Each path through the string graph is a possible assembly
Contig construction
- modify and traverse the string graph to derive contig sequences
  - Errors create bubbles with low coverage that can be popped
  - Some nodes cannot be combined with neighboring nodes because there isn't enough information to know which alternative path to take, and those nodes are emitted as contigs
  - Adjacent nodes (ie nodes connected directly by edges) can often be combined
- Ends of contigs usually due to ambiguity of some sort, there are multiple paths and the assembler doesn't know which to take so it chops them all off
  - Sources of ambiguity include error, repeats not spanned by reads, or heterozygosity
  - As read quality and length improve, extent of contigs determined in larger part by heterozygosity
    - Makes heterozygous-aware assembly even more important
Scaffolding
- Physical ordering of contigs, sometimes introducing gaps
- Usually based on additional structural information
  - HiC
  - Optical mapping

Error correcting can occur at multiple steps

Improvements to base calling
Use short high quality reads to correct lower quality long reads
- Risks introducing errors, eg by mistaking one instance of a repeat for another similar instance of the same repeat
Use short high quality reads to correct contigs derived from lower quality long reads
- Same risks as above
Use lower quality long reads to correct each other (requires greater depth)
- In the last year this is where things have started to head

Characterizing an assembly

contiguity (eg N50)
completeness (eg BUSCO)
correctness (eg base level, structural, phasing)

Genome assembly challenging factors

Large size
Repeats
Heterozygosity
There are tools for assessing all of these before attempting a full assembly, eg https://github.com/schatzlab/genomescope

Phasing

Collapse haplotypes into a single consensus. Can introduce many errors and fragment the assembly
Assemble into regions that are collapsed and unzipped
- Can arbitrarily resolve into primary assembly and alternate haplotigs, or pseudophased diploid genome
- Phase into two haploid genomes

Core algorithmic concepts

Similarity and extension
- Identify similarity be identification of similar seed sequences in different reads
  - Expensive because each read needs to be compared to every other read
  - Exact matches are really fast, but often need to allow for variation due to errors
  - A few methods
    - Canu uses MHAP, a kmer method
    - Falcon etc use DALIGNER, dynamic programming and kmers
- Extension searches for regions of similarity beyond the seed.
  - It essentially sees if if is possible to zip the reads together starting at the seed
  - Extension is generally not as expensive is initial identification of similarity, because extension is
k-mers
- Short sequences of length k (often 15-70 nucleotides)
- Very cheap to work with
  - Defined memory footprint
  - If short relative to frequency of errors, can focus on exact matches
  - Easy to code
- Hash tables are sorted lists of k-mer sequences, often with a count of how many times the sequence exists
- de Bruijn assembly
- identification of similarity seeds

Week 5 - Genome annotations

Reading

Yandell and Ence. 2012. A beginner's guide to eukaryotic genome annotation. Nature Review Genetics. https://doi.org/10.1038/nrg3174 (Discussion leader: Diego and Jasmine)

Agenda

The agenda:

Discuss assembly notes from last class
Go over today's paper
Identify two papers for next week
Walk through forking, cloning, and editing final project git repository. Make preliminary edits and push them.
- Add a few sentences for Goal and Data sections. Make a list of tools you plan to use in the Methods section. Send a pull request when you have added these stubs.
Explain how to finish alignment exercise from last class as an assignment
Discuss genome annotation notes

Notes

A genome assembly is just a big fasta file. Much more interesting with annotations.

Annotation goals include:

Identifying repeats
- Biologically interesting
- Technically important for understanding genome assembly and so that repeats can be masked for some downstream analyses that they can negatively impact
Identifying protein coding genes
- Build inventory of protein coding genes
- Identify introns, exons, promoters
- Predict mRNA structure
Identifying other regions,
- Noncoding RNA sequences
- Promoters
- etc...
Understanding genome structure
- Centromeres, telomeres, how scaffolds map to chromosomes

The set of annotations can be encoded in standard formats and loaded into graphical browsers or interrogated computationally (as when comparing genome features across species)

Annotations can be made

With ab initio methods based on the understanding of what these structures look like or particular properties they have
Based on new evidence, like mRNA-seq
Based on comparison to a reference of similar sequences
- Looking for known repetitive DNA elements
- Blasting known protein coding genes

Assessing annotations

Sensitivity = (True positives) / ( True positives + False negatives )
- BUSCO is a common tool for this
- Specificity = (True positives) / ( True positives + False positives )
- Accuracy = (Sensitivity + Specificity) / 2. If accuracy is high you know you have good sensitivity and specificity, but if it is low could be problems with sensitivity or specificity or both.

Repeats

The extent and composition of repeats varies widely across species
Many repeats are well-known conserved sequences
- LINEs Long Interspersed Nuclear Elements. Retrotransposons about 7kb in length. They are transcribed and encode a reverse transcriptase that facilitates integration at new sites.
- SINEs Short Interspersed Nuclear Elements. 100-700 bp retrotransposons. They do not have their on reverse transcriptase, so they are dependent on those of other elements
Repeat annotation is complicated by the fact that different instances of the same repeat are not identical, and may be quite different
Tools for identifying and masking repeats
- RepeatMasker

Protein coding genes

Multiple nested annotation steps
- Gene finding
- What is entire region needed for gene to function, including upstream upstream regions, transcribed region, and downstream regions
- What region is transcribed
  - Identify promoter where transcription is initiated
    - For eukaryotic protein coding genes, the binding site of RNA Polymerase II
  - Identify the location where transcription is terminated and mRNA is polyadenylated
- Introns and exons
- Mature mRNA sequence prediction
- Protein sequence prediction
Commonly used tools
- AUGUSTUS http://bioinf.uni-greifswald.de/augustus/ . Developed for human genome.
  - MAKER http://www.yandell-lab.org/software/maker.html . Annotation workflow that integrates multiple tools.
  - Funanotate https://funannotate.readthedocs.io/en/latest/ . Initially made for fungal genomes.

Week 6 - Genome annotations

Reading

Discussion leaders: Elyse Parker and Dan MacGuigan

Holt and Yandell. 2011. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics. https://doi.org/10.1186/1471-2105-12-491

Kim et al. 2018. The genome of common long-arm octopus Octopus minor. GigaScience. https://doi.org/10.1093/gigascience/giy119

MarkDown guide - https://guides.github.com/features/mastering-markdown/

Agenda

Notes

Success of ab initio methods

Tests of ab initio methods work well in species with well annotated genomes, because the gene training datasets are so good. It is easy to find genes when you know what they look like.
Ab initio methods work less well alone in poorly studied species without good existing gene models to server as training datasets. Additional evidence, like RNAseq, greatly improves outcome in these projects.

Annotation Edit Distance (AED)

A measure the distance between intron and exon coordinates of two annotations for the same gene, where 0 is identical and 1 is completely different
This distance indicate incongruence between annotation methods and is interpreted as uncertainty about an annotation
Genes with higher AED tend to
- Have fewer PFAM domains
- Change more in subsequent annotations that include additional information
- Have less evidence of orthologs in closely related species

MAKER is now part of the GMOD family of genomic tools - http://gmod.org/wiki/Main_Page

Ian notes that conda install -c bioconda maker works well for installation

A detailed guide on how to use MAKER - https://gist.github.com/darencard/bb1001ac1532dd4225b030cf0cd61ce2

Identify and mask repeats
Build gene models based on RNAseq data and homology to genes of related species
Train ab initio programs based on these evidence-based models
Rerun MAKER with ab initio prediction
Rerun MAKER again to refine everything

Mapping

Mapping is a general task that comes up in many types of genomic analyses.
It is a highly asymmetric sequence comparison, usually between reads and reference sequences
Burrows-Wheeler transform
- Transforms the sequence to a sorted string that is easy to compress
- Can work with compressed strings, which is more computationally efficient
- Has the very special property that the original string can be recovered from the sorted form

Week 7 - Functional genomics

Reading

Discussion leader: Andrew Verdegaal

Primary paper:

Zhang et al. 2019. Comparative Analysis of Droplet-Based Ultra-High-Throughput Single-Cell RNA-Seq Systems. Molecular Cell. https://doi.org/10.1016/j.molcel.2018.10.020

Background (optional paper that covers many technical concepts in the above paper):

Hrdlickova et l. 2016. RNA-Seq methods for transcriptome analysis. WIREs RNA. https://doi.org/10.1002/wrna.1364

Agenda

Notes

RNAseq

Randomly sequence RNA from a cell or collection of cells
In nearly all cases mRNA is reverse transcribed into complimentary DNA (cDNA) and that is analyzed with with short read DNA sequencing (eg Illumina)
The frequency of reads for each gene are used as a proxy for abundance of transcripts of each gene (often referred to as expression)
If a suitable high quality reference is available, it can also be used as a proxy for abundance of splice variants.
Replaced Expressed Tags (ESTs), shotgun sequencing of cDNA with Sanger sequencing
Full length sequencing of cDNA (eg PacBio isoseq, Oxford Nanopore cDNA sequencing) provides a better understanding of slice variants than short reads, but lower throughput provides less statistical power.
What it does and doesn't do
- It does not measure absolute transcript count
- It does not measure the relative expression of different genes
- It does come close to measuring the differential expression of the same gene in different tissues.
Analysis steps
- Map to reference, which can be gene models from genome or transcriptome assemblies. Many short read aligners can do this, eg Bowtie
- Process mapping data to derive the counts of reads for transcripts and genes.
GRO-seq
- Actively transcribed mRNA is labeled with BrdU, isolated, reverse transcribed, and sequenced.
- Similar to RNA-seq, but only sequences mRNA that is transcribed during BrdU pulse.
- Gives a way to observe active transcription rather than just transcript abundance.

Week 8 - Genome structure

Please post two commits during the week before this class.

Reading

Discussion leader: Nick Fisk

Rowley and Victor. 2018. Organizational principles of 3D genome architecture. Nature Reviews Genetics. https://doi.org/10.1038/s41576-018-0060-8

Agenda

Notes

Definitions - Topologically Associating Domains (TADs) - Nucleosome - A set of 8 histone proteins (two each of H2A, H2B, H3, and H4) wrapped by about 147bp of DNA, along with about 80bp of linker DNA that connects to the next nucleosome. If not packed further, they look like beads on a string. Addition of H1 packs nucleosomes into further coiled 30 nm fiber - Compartmental domains - a linear domain containing one or more genes that is in the same transcriptional or chromatin state. - Transcription complex - the association of DNA bound proteins including polymerase and transcription factors that is resposible for the initiation if transcription. - Promoter - the site where the transcription complex assembles. Includes the TATA box. - Enhancer - a sequence that enhances the transcription rate of a gene. May be upstream, within, or downstream of a gene. Can be quite far. - Conserved Noncoding Elements (CNE) region outside of coding regions that shows high sequence conservation between species - Genomic Regulatory Blocks (GRB) - clusters of syntenic CNEs.

Several components of genome consideration

1D linear structure
- Linkage, physical proximity as originally detected by low recombination rate
- Synteny, the conservation of linear gene location (ie linkage)
Chromatin accessibility and association with proteins such as transcription factors
3D packing of chromatin in nucleus

Packing of DNA in space and time

https://en.wikipedia.org/wiki/Chromatin#/media/File:Chromatin_Structures.png
Interphase packing is most relevant to most functional genomics questions
Loop extrusion
- https://www.youtube.com/watch?v=Tn5qgEqWgW8
- Acts on beads-on-a-string packed DNA
- Loop extent controlled by CTCF

Sample pre/ sequencing tools relevant to genome structure

Innovations in sample prep and enrichment allow sequencers to be adapted into instruments that measure all kinds of structural and functional features.
Hi-C
- Crosslink chromatin
- Digest, label, and ligate
- Fragment, isolate, and sequence
- Results in read pairs that come from regions that were in close physical proximity
- Uses
  - Identify sequences that were in same cell together, eg to assist with metagenomes
  - Identify linked sequences, eg to assist with assembly scaffolding
  - Identify how chromosomes are packed in nucleus
- HiC related methods https://www.nature.com/articles/nrg3454 Box 1
  - 3C, chromosome conformation capture. Provides many interactions for sites throughout the genome.
  - 4C, circular 3C. Observe the regions that interact with a particular locus. Allows deeper data for one spot.
  - 5C, investigate how associations correspond to other processes, like transcription.
  - ChiaPET, interrogate long range interactions facilitated by particular proteins.
- Interpreting HiC data - https://www.nature.com/articles/nrg3454/figures/1
ATACseq
- Assay for Transposase Accessible Chromatin using sequencing
- Transposase preferentially targets regions that are free of nucleosomes
- Can also map nucleosome composition
ChIPseq
- CHromatin ImmunopreciPitation Sequencing
- Use antibodies to enrich chromatin with specific proteins, eg transcription factors
- Sequence isolated Chromatin
- Identify transcription binding sites etc...

The relationship between genome structure and genome function

Structure may regulate transcription
- Harmston https://doi.org/10.1038/s41467-017-00524-5
  - "GRB boundaries coincide with the boundaries of TADs"
  - TADs may be highly conserved regulatory domains
  - Presents a cautionary tale
    - An intronic SNP in FTO is strongly associated with obesity.
    - Many theses and drug companies tried to figure out what FTO does to target it
    - But turns out that the intronic FTO SNP is in a CNE (conserved non-coding element) physically associates with a distant gene and regulates IRX3
Structure may be a result of transcriptional state
- Rowley https://doi.org/10.1038/s41576-018-0060-8
Compartmental domains may be more highly conserved than specific sequences

Week 9 - Phylogenetic comparative methods

Reading

Please read the following material from Harmon 2018 https://lukejharmon.github.io/pcm/ :

Chapter 3 Brownian Models (a detailed reading of section 3.3 is not required, though it is quite interesting).
Chapter 7

Agenda

Trait evolution

Background from first Paul Lewis Likelihood lecture, through slide 40.
- Exercise: Make a Newick tree, for example (A,(B,(C,(D,E)))) or (A:4,(B:3,(C:2,(D:2,E:1)))), and view it at http://etetoolkit.org/treeview/ .
Lecture on the evolution of discrete traits in the context of molecular sequence data. Slides at https://github.com/Phylogenetics-Brown-BIOL1425/phylogeneticbiology/blob/master/lectures/Lecture_3.pdf .
Discussion of discrete traits Chapter 7 from Harmon reading.
Discussion of continuous traits Chapter 3 from Harmon reading - https://lukejharmon.github.io/pcm/chapter3_bmintro/.

Application of model based approaches to genomes: beyond sequence evolution

Notes

Week 10 (April 4) - Genome evolution

Reading

Discussion leader: Arianna Lord

Zhao and Schranz. 2019. Network-based microsynteny analysis identifies major differences and genomic outliers in mammalian and angiosperm genomes. PNAS. https://www.pnas.org/content/116/6/2165

Agenda

Paper discussion
Everyone provides summary of goal and state of their project
Identify overlap between projects
Time to work on projects, including discussion of shared aspects of projects

Notes

Week 11 (April 11) - Comparative functional genomics

Make at least 4 commits to your project before today's class.

Reading

Discussion leaders: Jessica Glass and Spencer Irvine

Bhattacharya et al. 2016. Comparative genomics explains the evolutionary success of reef-forming corals. eLife. https://doi.org/10.7554/eLife.13288

Agenda

Paper discussion
Pair-review projects
Work on projects

Notes

Week 12 (April 18) - Project

Project presentations:

Nick Fisk
Ian Gilman
Arianna Lord

Week 13 (April 25)- Project

Bibliography

You can suggest references to add to this list via a pull request or the issue tracker. The intent of this bibliography is to serve as a resource for class participants in their own work and as a list of potential readings for class.

Genome sequencing

Goodwin et al. 2016. Coming of age: ten years of next-generation sequencing technologies. Nature Reviews Genetics. https://doi.org/10.1038/nrg.2016.49

Heather et al. 2015. The sequence of sequencers: The history of sequencing DNA. Molecular Cell. https://doi.org/10.1016/j.molcel.2015.05.004

Reuter et al. 2015. High-Throughput Sequencing Technologies. Molecular Cell. https://doi.org/10.1016/j.molcel.2015.05.004

Shendure et al. 2017. DNA sequencing at 40: past, present and future. Nature. https://doi.org/10.1038/nature24286

Vurture et al. 2017. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics. https://doi.org/10.1093/bioinformatics/btx153

Genome assembly

Alkan et al. 2011. Genome structural variation discovery and genotyping. Nature Reviews Genetics. https://www.nature.com/articles/nrg2958

Bradnam et al. 2013. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience. https://doi.org/10.1186/2047-217X-2-10

Gurevich et al. 2013. QUAST: quality assessment tool for genome assemblies Bioinformatics. https://doi.org/10.1093/bioinformatics/btt086

Koren et al. 2018 De novo assembly of haplotype-resolved genomes with trio binning. Nature Biotechnology. https://doi.org/10.1038/nbt.4277 Good overview of phasing

Paajanen et al. 2019. A critical comparison of technologies for a plant genome sequencing project. GigaScience. https://doi.org/10.1093/gigascience/giy163

Rice and Green. 2019. New Approaches for Genome Assembly and Scaffolding. Annual Review of Animal Biosciences. https://doi.org/10.1146/annurev-animal-020518-115344

Schatz et al. 2010. Assembly of large genomes using second-generation sequencing. Genome Research. https://doi.org/10.1101/gr.101360.109

Sedlazeck et al. 2018. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nature Reviews Genetics. https://doi.org/10.1038/s41576-018-0003-4

Sohn and Nam 2016. The present and future of de novo whole-genome assembly. Briefings in Bioinformatics. https://doi.org/10.1093/bib/bbw096

Example genome projects

Edgar et al. 2018. Single-molecule sequencing and optical mapping yields an improved genome of woodland strawberry (Fragaria vesca) with chromosome-scale contiguity. Gigascience. https://doi.org/10.1093/gigascience/gix124

Jain et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature Biotech. https://doi.org/10.1038/nbt.4060 . (For an interesting set of followup analyses see https://genomeinformatics.github.io/na12878update/ )

Kim et al. 2018. The genome of common long-arm octopus Octopus minor. GigaScience. https://doi.org/10.1093/gigascience/giy119

Mohr et al. 2017. Improved de novo Genome Assembly: Linked-Read Sequencing Combined with Optical Mapping Produce a High Quality Mammalian Genome at Relatively Low Cost. BioRxiv. https://doi.org/10.1101/128348

Wenger et al. 2018. Highly-accurate long-read sequencing improves variant detection and assembly of a human genome. https://www.biorxiv.org/content/10.1101/519025v2

Jiang et al. 2018. A Hybrid de novo Assembly of the Sea Pansy (Renilla muelleri) Genome. https://doi.org/10.1101/424614

Vertebrate Genomes Project - https://vgp.github.io/genomeark/

Genome annotation

Aken et al. 2016. The Ensembl gene annotation system. Database. https://doi.org/10.1093/database/baw093

Holt and Yandell. 2011. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics. https://doi.org/10.1186/1471-2105-12-491

Mudge and Harrow. 2016 The state of play in higher eukaryote gene annotation. https://doi.org/10.1038/nrg.2016.119

Yandell and Ence. 2012. A beginner's guide to eukaryotic genome annotation. Nature Review Genetics. https://doi.org/10.1038/nrg3174

Genome structure

A collection of papers on 3D genome structure in Nature: https://www.nature.com/collections/rsxlmsyslk

Li et al. 2011. Chromosome Size in Diploid Eukaryotic Species Centers on the Average Length with a Conserved Boundary. Molecular Biology and Evolution. https://doi.org/10.1093/molbev/msr011

Harmston et al. 2017. Topologically associating domains are ancient features that coincide with Metazoan clusters of extreme noncoding conservation. Nature Communications. https://doi.org/10.1038/s41467-017-00524-5

Rowley and Victor. 2018. Organizational principles of 3D genome architecture. Nature Reviews Genetics. https://doi.org/10.1038/s41576-018-0060-8

Spielmann et al. 2018. Structural variation in the 3D genome. Nature Reviews Genetics. https://doi.org/10.1038/s41576-018-0007-0

Functional genomics

Hrdlickova et l. 2016. RNA-Seq methods for transcriptome analysis. WIREs RNA. https://doi.org/10.1002/wrna.1364

Klemm et al. 2019. Chromatin accessibility and the regulatory epigenome. Nature Reviews Genetics. https://doi.org/10.1038/s41576-018-0089-8

La Manno et al. 2018. RNA velocity of single cells. Nature. https://doi.org/10.1038/s41586-018-0414-6

Zhang et al. 2019. Comparative Analysis of Droplet-Based Ultra-High-Throughput Single-Cell RNA-Seq Systems. Molecular Cell. https://doi.org/10.1016/j.molcel.2018.10.020

Other sequencing applications and perspectives

Cheifet 2019. Where is genomics going next? Genome Biology. https://genomebiology.biomedcentral.com/track/pdf/10.1186/s13059-019-1626-2

Pomerantz et al. 2018. Real-time DNA barcoding in a rainforest using nanopore sequencing: opportunities for rapid biodiversity assessments and local capacity building GigaScience. https://academic.oup.com/gigascience/article/7/4/giy033/4958980

Phylogenetic comparative methods

Harmon 2018. Phylogenetic Comparative Methods. https://lukejharmon.github.io/pcm/

My previous course on phylogenetic biology - https://github.com/Phylogenetics-Brown-BIOL1425/phylogeneticbiology

Reconstructing the history of genome evolution

Kim et al. 2017. Reconstruction and evolutionary history of eutherian chromosomes. PNAS. https://doi.org/10.1073/pnas.1702012114

Demas et al. 2018. Reconstruction of avian ancestral karyotypes reveals differences in the evolutionary history of macro- and microchromosomes. BMC Genome Biology. https://doi.org/10.1186/s13059-018-1544-8

O'Connor et al. 2018. Reconstruction of the diapsid ancestral genome permits chromosome evolution tracing in avian and non-avian dinosaurs. https://www.nature.com/articles/s41467-018-04267-9

Comparative genomics

Compara - pre-built comparative genomics analyses. https://useast.ensembl.org/info/genome/compara/index.html

Alföldi and Lindblad-Toh. 2013. Comparative genomics as a tool to understand evolution and disease. Genome Research. https://doi.org/10.1101/gr.157503.113

Roy and Gilbert. 2005. Rates of intron loss and gain: Implications for early eukaryotic evolution. PNAS. https://doi.org/10.1073/pnas.0500383102

Simakov and Kawashima. 2017. Independent evolution of genomic characters during major metazoan transitions. Developmental Biology. https://doi.org/10.1016/j.ydbio.2016.11.012

Zhao and Schranz. 2019. Network-based microsynteny analysis identifies major differences and genomic outliers in mammalian and angiosperm genomes. PNAS. https://www.pnas.org/content/116/6/2165

Making associations between genomes and phenotypes with comparative methods

Hiller et al. 2012. A "Forward Genomics" Approach Links Genotype to Phenotype using Independent Phenotypic Losses among Related Species. Cell Reports. https://doi.org/10.1016/j.celrep.2012.08.032

Sharma et al. 2018 A genomics approach reveals insights into the importance of gene losses for mammalian adaptations. Nature Communications. https://doi.org/10.1038/s41467-018-03667-1

General computational skills and methods

Links and other general resources

MarkDown guide - https://guides.github.com/features/mastering-markdown/ . Useful for writing text that will be pushed to GitHub, among many other things.

Syllabus

Comparative Genomics

Technical details

Git

HPC

Course format

Course site

Assignments

Final Projects

Exercises

Reading

Setting up your computer

Schedule

Week 1, January 17 - Intro to comparative genomics. Questions, methods, history

Reading

Agenda and notes

Exercises

Week 2, January 24 - Sequencing technology and applications

Reading

Agenda and notes

Exercises

Other

Week 3, January 31 - Practical computing skills

Reading

Agenda and notes

Week 4 - Genome assembly

Reading

Agenda

Notes

Week 5 - Genome annotations

Reading

Agenda

Notes

Week 6 - Genome annotations

Reading

Agenda

Notes

Week 7 - Functional genomics

Reading

Agenda

Notes

Week 8 - Genome structure

Reading

Agenda

Notes

Week 9 - Phylogenetic comparative methods

Reading

Agenda

Notes

Week 10 (April 4) - Genome evolution

Reading

Agenda

Notes

Week 11 (April 11) - Comparative functional genomics

Reading

Agenda

Notes

Week 12 (April 18) - Project

Week 13 (April 25)- Project

Bibliography

Genome sequencing

Genome assembly

Example genome projects

Genome annotation

Genome structure

Functional genomics

Other sequencing applications and perspectives

Phylogenetic comparative methods

Reconstructing the history of genome evolution

Comparative genomics

Making associations between genomes and phenotypes with comparative methods

General computational skills and methods

Links and other general resources

About

Languages