Here is a constantly updating set of resources that I have found useful and think will help to prime you with working with 'high dimensional data'. I will keep updating the table as you go along, but please add to the list and put your own notes in if you recommend something you have found. This field is constantly developing, and the techniques described are used in a spectrum of disciplines, so you may find answers to your questions on websites not immediately obvious to you!

  • Please sign up for an Amazon Web Services account using your educational email address, to access their 'free tier', where you can experiment. Please then join 'AWS Educate', so you can obtain free credits for your account, to use more powerful hardware. On AWS, always use 'Europe - London (EU-West-2)'. Set up a t2.micro (free-tier) 'EC2' instance, with 30GB (free-tier) of 'EBS' storage. If this is confusing, I agree! Google helps a lot.

  • Please sign up for a GitHub account.

  • Note: Do not install R version 4.0.0! Stay on 3.6.1. Do a quick google to find out why, and we can discuss it if you are interested!


Topic Link(s) Notes
Next-generation sequencing StatQuest Intro to RNA-Seq NGS is getting cheaper every day, and the volume of data being output is huge. It is vital to understand the principles of NGS and to understand the basics of how an 'Illumina Sequencer' works. Find a video/article that works for you and we will talk about it
Learn the command line CodeAcademy A large amount of pre-processing has to be done to our raw data, before we visualise and analyse it in software such as R or Python. The tools we use to pre-process raw data are designed to be run 'from the command line', specifically, the 'Bash' command line. Codeacademy provides a great primer in how to use some of its functions, which will be essential for when you need to handle the raw data files
Fast-what? Fasta/Fastq/SAM An example of a great StackOverflow answer discussing advice about different file formats you will encounter in genetic data
PCA and many other guides PCA StatQuest StatQuest often provides a good basic introduction to a many statistical concepts. PCA is integral to our work, as we will see later on. I recommend watching other StatQuest videos, and then exploring other videos/online resources if you are interested
Learn Base R Codeacademy This site provides an environment to learn the very basics of R in a short space of time.
Learn R the 'tidy' way R4DS Follow this guide's instructions on installing RStudio, the tidyverse package, and use the example datasets it provides to test out some of R's analysis and plotting features. We will use these features with our own data, once we have pre-processed it
StackExchange/Overflow StackOverflow I highly encourage getting in the habit of google errors you encounter, as you will find the answer to these errors on sites such as StackOverflow 99% of the time. Googling problems and finding solutions is a valid learning method!
Biostars Biostars A forum providing discussion topics on many aspects of genetic data analysis.
Bioconductor Bioconductor Data analysis packages for R are often submitted to 'Bioconductor'; a repository of packages that are required to be maintained and documented to a certain standard, to ensure the public can use the package properly. An example package is 'DESeq2', below. While you are on Bioconductor, check out the list of top packages to see what packages people are using, and think why you might want to use them. They are going to help you!
Differential Expression Analysis DESeq2 Vignette DESeq2 is the often the best R package for differential gene expression analysis. This link will take you to the user guide, which is referred to as a 'vignette' for Bioconductor packages. Start to notice the formats of data that DESeq2 accepts, the importance of how DESeq2 normalises samples, as is plainly explained in the StatQuest video below.
DESeq2 Analysis Workflow Vignette The title author of DESeq2, Mike Love, also maintains a workflow vignette that you may find easier to interpret in the early stages. This workflow uses an example RNA-seq dataset; 'airway'. This dataset can be installed in R, as a package (google this!). Notice how in the workflow, he refers to the PubMed accession and GEO accession for 'metadata' and 'raw data': These are important sources of information about the experimental details.
Library normalisation StatQuest The first video of a series on normalisation. This is one of many resources you will find on the internet about the importance of normalisation in data analysis. You may find that watching the FPKM/TPM video alongside helps.
Visualising data with Shiny Shiny Gallery Getting familiar with different ways of plotting data helps you understand your data better and share your findings. R provides 'Shiny', a service whereby your analysis results can be input into a great-looking website that you can share. Check out the gallery and then google some examples of RNA-seq analyses that are in R Shiny format.


Here are some of the papers from our lab that concern RNA-seq data. Make a note of the methods used to produce and analyse the data, and think of the challenges that may be present in analysing it. We will be using these datasets by first replicating the analysis, then looking to compare independent studies (GBA & LRRK2 - Tara), or investigate splicing (LRRK2 transcriptome-wide - Eugenio, LRRK2 specifically - Guusje

Analysis package papers/examples of interesting analysis methods

RNA-Seq/Statistics papers

Packages/commands often used

This is growing list of packages/commands that I use very often

- ls, cd, mv, cp, rm, htop, ssh, screen, cut, sort, uniq, >, |, for i in *, parallel, echo
- BioMaRt
- Tidyverse (ggplot2, dplyr, tidyr...)
- prcomp
- kallisto
- tximport
- deseq2
- fastqc
- multiqc
- trim_galore
- Cluster Window Manager for Google Chrome: A Godsend

Week 1 Task list - Updated Wednesday

  • Learn command line essentials (cd, ls, mv, cp, rm, wget, echo)
  • Set up an AWS t2.micro instance with 30 GB EBS storage and ssh into this instance
  • Install conda on your AWS instance
  • Install fastqc in a conda environment named 'week_1' on your AWS instance
  • Download some fastq files of RNAseq project (GBA bulk, LRRK2 neurons, etc) using wget
  • Install 'filezilla'/WinSCP, or any program capable of 'sftp' and download your fastqc report(s) to your computer
  • Evaluate the quality of a fastq file from the report generated by fastqc
  • Practice using tab autocompletion on the command line
  • Practice using the up and down arrows to cycle through command history
  • Copy your command 'history' to a GitHub document for reference later
  • Start using 'screen' to run programs in the background (Ctrl-a, Ctrl-d, Ctrl-k, screen -S name, screen -r name)

Wednesday: Guide

  1. Find a paper with bulk RNA sequencing. I recommend this airway paper, as we may be using this paper's data for more practice later on.
  2. Locate the EBI 'Nucleotide Sequencing' Project page for the paper. Here is the page for the above paper
  3. Use 'Select columns' to choose which columns you need. We need the 'fastq ftp' column, and a column that helps us identify which sample is which
  4. Download the 'TSV' (tab-separated values) file and open it in excel.
  5. Use some of the 'fastq ftp' links to 'wget' the fastq files to our AWS instance
  6. Run fastqc on each fastqc file
  7. Install Filezilla, WinSCP or equivalent on your local machine
  8. Add a new 'site': Protocol is SFTP, Host is the 'public DNS' of your AWS instance, Port is 22, User is ubuntu and key file is the '.pem' file you made yesterday.
  9. Connect and download all the 'fastqc' output files
  10. Open the '.html' files in your browser and take a look at the information. Google these quality checks and read the 'Babraham' guidance on them: Example

Thursday: Guide

  1. Access the EC2 instance (ec2-3-11-80-159.eu-west-2.compute.amazonaws.com) (password: trinity)
  2. Install miniconda in your home folder
  3. Install STAR in a conda environment
  4. Download the fastq files from the airway project (perhaps divide up the task of downloading)
  5. Download the Gencode primary assembly FASTA and GTF for human
  6. Generate a genome index using STAR: Follow section 2.1 of the manual
  7. Use a for loop to map your fastq reads to the genome index in STAR: See section 3.1 of the manual
Example for loop:
for file in *1.fastq.gz ; do echo "STAR --numThreadN 4 --genomeDir folder_where_genome_index_is $file ${file%1.fastq.gz}2.fastq.gz" ; done

Friday: Guide

  1. Our objective is complete mapping the airway fastq files to the human genome reference 38 from GENCODE.
  2. Your AWS instance is accessible using: ssh -i "msc_tt2020.pem" ubuntu@ec2-18-132-67-54.eu-west-2.compute.amazonaws.com from the directory in which your .pem file is stored.
  3. One person install miniconda, then notify everyone when it is installed.
  4. Each person creates their own environment, including their initials.
  5. Thursday's work is stored in the /home/ubuntu/thursday folder
  6. Friday's space is /home/ubuntu/friday
  7. To see how much space is free, type df -h: You will see that the thursday folder is currently almost full.
  8. To see how much space each folder in a given directory takes up, type du -chd 1
  9. Using these commands, try to decide whether you can optimise the amount of space you are using in the thursday folder (you could do this while waiting for STAR to generate an index in the friday folder)
  10. One way to save space is to avoid having duplicates of large files. You could agree upon a single folder to store the fastq files, reference genome files, and STAR index, and then create symbolic links to your own working folders (or just reference this single shared folder in each of your commands).

Monday 11th May: Guide

  1. Sync your personal AWS S3 folder to a personal folder within the Monday folder on the instance.
  2. Set up miniconda + an environment with samtools, picard, rseqc and subread.
  3. Convert your SAM files to BAM files, sort them, index them.
  4. Run Picard CollectRNASeqMetrics on each bam file
  5. Run any RSeqQC modules you see as informative (e.g. junction saturation, inner_distance, read_distribution, read_duplication...)
  6. Compile QC reports from fastqc, STAR, Picard and RSeqQC into a multiQC report (see multiQC documentation)

Wednesday 13th May: Guide

  • Merge/Join Featurecounts counts tables.
  • Convert sample names to just SRR accession number
  • Create sample metadata/coldata table with relevant biological and technical groups labelled
  • Import count data and metadata into DESeq2 in R
  • Run a treatment group comparison in DESeq2, by expressing a 'design' (e.g. ~ treatment_group)
  • Produce the DESeq2 results table and filter for genes below an adjust P value of 0.01 and with a greater log2Foldchange than 1, both up and down
  • Plot a gene's expression between treatment groups
  • Export a normalised counts table from DESeq2
  • Use a normalised counts table to run PCA with prcomp
  • Plot a PCA biplot and label the samples according to their biological or techincal groups.

Monday 18th May: Info

  • Google drive link to RWM data

  • Go to 'experiments' folder and download the fastq files for gba_bulk and lrrk2

  • Apply Fastqc/multiqc approach

  • Acquaint yourself with trim_galore

  • Acquaint yourself with kallisto

  • For kallisto, you will need a transcriptome reference, not a primary genome reference. Beware of this when downloading a reference Fasta from Gencode/Ensembl.

  • Kallisto does not produce a log file, but outputs useful information to the terminal. MultiQC can use this output, so pipe the output to a file

  • You will need for loops for some of this

  • Guusje: Read the Snaptron paper and Snaptron User Guide.

  • We should be able to query LRRK2 splicing using the snaptron client to their web service. We then want all the metadata possible for this.

Adding directories to the PATH variable

  • If we remember, the PATH variable is where our shell looks for applications. We can see what it is currently equal to by running echo $PATH:
  • If we want to add a folder containing an application, or "binary" to our PATH, we have to add it to the PATH. You can do this with a command, but often the simplest way is to edit the file where the PATH variable is set. In linux, this is ~/.bashrc, on mac, this is ~/.bash_profile. In that file, you will see the PATH variable.
  • Add a line at the bottom: export PATH="/path/to/new/binaries:$PATH"
  • Save and load the new bash: source ~/.bashrc
  • Test it: echo $PATH
