AWS Tutorial Resources

Overview of Page Contents

Biomedical Workflows on AWS
Artificial Intelligence
Clinical Informatics
Download SRA Data
GWAS
Medical Imaging
RNAseq
scRNAseq
BLAST
Protein Folding
Long Read Sequencing Analysis
Drug Discovery
CryoEM
Open Data

Biomedical Workflows on AWS

There are a lot of ways to run workflows on AWS. Here we list a few possibilities each of which may work for different research aims. As you walk through the various tutorials below, think about how you could possibly run that workflow more efficiently using one of the other methods listed here. If you are unfamiliar with any of the terms or concepts here, please review the AWS Jumpstart page.

The most simple is probably to spin up an EC2 instance, and run your command interactively, or using screen or, as a startup script attached as metadata. See the GWAS tutorial below for more info on how to run a pipeline using EC2.
You could also run your pipeline via a SageMaker notebook, either by splitting out each command as a different block, or by running a workflow manager (Nextflow etc.). See here about scheduling a notebook to let it run longer. You can find some example notebooks in the tutorials below.
If you are running bioinformatic workflows, you can leverage the serverless functionality of AWS using Amazon HealthOmics. Read this blog for more detailed information and also see if any new blogs have come out. If you want to get some hands on experience with HealthOmics using Cloud Lab, follow this on-demand workshop from Amazon! Since you already have an account set up, skip directly to the Workshop section and then you can decide if you want to complete the tutorial via the console, the CLI, or via Notebooks. If you go the notebook route, just spin up a notebook via Sagemaker. If you want to create a private workflow using Nextflow, you will need to migrate your containers to a private Amazon Elastic Container Registry (ECR). You can follow this workshop to learn how that process works.
If you are using a workflow manager other than WDL, Nextflow, or CWL (e. g. Snakemake), use AWS Genomics CLI, which is a wrapper for genomics workflow managers and AWS Batch (serverless computing cluster). See our docs on how to set up the AGC CLI for Cloud Lab. You can also just run Snakemake locally within a VM. See our Pangolin tutorial for one example.
Finally, one benefit of the cloud is access to GPUs for workflow acceleration. While a lot of focus on GPU implementation will focus on AI/ML workflows, NVIDIA has software called Parabricks that will accelerate genomic workflows for pretty low costs. See the full list of command line options here to see if your specific workflow is accelerated. The easiest way to run Parabricks right now is via AWS HealthOmics Ready2Run workflows, but to run it via EC2 see our guide.

For many of these tutorials, you will need Short Term Access Keys to create and use resources, particularly whenever a tutorial calls for "access key ID" and "secret key." Use this guide for an explanation of how to obtain and use Short Term Access Keys. If you are an NIH-affiliated researcher, in other words, you don't work at the NIH but have a Cloud Lab account, you will not have access to keys. If there is a tutorial you are unable to complete, reach out to us for help at CloudLab@nih.gov

Please also note, GPU machines cost more than most CPU machines, so be sure to shut these machines down after use, or apply an EC2 lifecycle configuration. You may also encounter service quotas to protect you from the accidental use of expensive machine types. If that happens, and you still want to use a certain instance type, follow these instructions.

Artificial Intelligence

Machine learning is a subfield of artificial intelligence that focuses on the development of algorithms and models that enable computers to learn from and make predictions or decisions based on data, without being explicitly programmed. Artificial intelligence and machine learning algorithms are being applied to a variety of biomedical research questions, ranging from image classification to genomic variant calling. AWS has a long list of AI/ML tutorials available and we have compiled a list here. Most recent development focuses on generative AI including use cases such as extracting information from text, transforming speech to text, and generating images from text. Sagemaker Studio allows the user to rapidly create, test, and train generative AI models and has ready to use models all contained with JumpStart. These models range from foundation models, fine-tunable models, and task-specific solutions.

For examples of generative AI, view our GenAI tutorials that use several AWS products such as Bedrock and Jumpstart and utilizes other tools like Langchain and Huggingface to deploy, train, prompt, and implement techniques like Retrieval-Augmented Generation (RAG) to GenAI models. Also take a look at the AWS GitHub repo for more Gen AI tutorials.
For other AI use cases, we recommend you start with this comprehensive on-demand workshop on how to use SageMaker Studio for a variety of AI/ML use cases including applying a classifier to RNAseq data, classifying tabular breast cancer data, buiding graph neural nets on HIV data, training a medical imaging model on chest scans, summarize scientific literature using foundation models, MLOps using gene expression data, and finally, performing antibody structure prediction.
To learn more about Bedrock check out this on-demand workshop featuring uses cases for prompt engineering, summarization, Q/A, chatbot, and image, code, and text generation within Bedrock.
AWS has a very general tutorial here on how to build out an AI pipeline on SageMaker.
These general examples will teach you how to use Sagemaker tools more broadly.
You can also submit a training job to SageMaker, and have your final model uploaded to S3 using PyTorch, Tensorflow or Apache MXNet.

Clinical Informatics

Clinical informatics, also known as healthcare informatics or medical informatics, is an interdisciplinary field that applies data science to healthcare data to improve patient care, enhance clinical processes, and facilitate medical research. It often involves integrating diverse data types including electronic health records, demographic, or environmental data. AWS offers two on demand workshops that walk you through AWS HealthLake for Population Health data analysis. This first workshop shows you how to ingest data to HealthLake, query those data using Athena, visualize these data using QuickSight, then join FHIR data with environmental data and visualize the combined dataset. The second workshop also ingests data into HealthLake, then visualizes medical device data, uses AI to summarize clinical notes, and then transcribes clinical audio files and summarizes them.

Download Data From the Sequence Read Archive (SRA)

Next Generation genetic sequence data is housed in the NCBI Sequence Read Archive (SRA). You can access these data using the SRA Toolkit. We walk you through this using this notebook, which also walks you through how to set up and search Athena tables to generate an accession list. You can also read this guide for more information on available dataset tables. Additional example notebooks can be found at this NCBI repo. In particular, we recommend this notebook(https://github.com/ncbi/ASHG-Workshop-2021/blob/main/3_Biology_Example_AWS_Demo.ipynb), which goes into more detail on using Athena to access the results of the SRA Taxonomic Analysis Tool, which often differ from the user input species name due to contamination, error, or due to samples being metagenomic in nature.

Genome Wide Association Studies

Genome-wide association studies (GWAS) are large-scale investigations that analyze the genomes of many individuals to identify common genetic variants associated with traits, diseases, or other phenotypes.

This NIH CFDE written tutorial walks you through running a simple GWAS using EC2. The tutorials asks you to select the Ohio region, make sure you change your region to N. Virginia otherwise you will have network issues. Note that the CFDE page has a few other bioinformatics related tutorials like BLAST and Illumina read simulation. We also converted the GWAS tutorial to a simplified notebook version if you prefer that format. See our notebook guide for help with setting up a Jupyter environment.

Medical Imaging Analysis

Medical imaging analysis requires the analysis of large image files and often requires elastic storage and accelerated computing.

Most medical imaging analyses are done using notebooks, so we would recommend accessing this Jupyter Notebook and cloning it into SageMaker. The tutorial walks through image segmentation.
This Sagemaker Studio on-demand workshop has a nice section on building a model on medical imaging data.
You can also view this AWS blog on how to annotate DICOM images and build a custom AI model with the data.
You can learn to deidentify medical images following this AWS tutorial.

RNAseq

RNA-seq analysis is a high-throughput sequencing method that allows the measurement and characterization of gene expression levels and transcriptome dynamics. Workflows are typically run using workflow managers, and final results can often be visualized in notebooks.

You can run this Nextflow tutorial for RNAseq a variety of ways on AWS. Following the instructions outlined above, you could use EC2, SageMaker, or AWS Batch(/docs/Genomics_Workflows.md).
This AWS on-demand workshop shows how to analyze gene expression data using Amazon Sagemaker Studio.
For a notebook version of a complete RNAseq pipeline from Fastq to Salmon quantification from the King Lab of the University of Maine INBRE use this notebook, which we re-wrote to work on AWS.

Single Cell RNAseq

Single-cell RNA sequencing (scRNA-seq) is a technique that enables the analysis of gene expression at the individual cell level, providing insights into cellular heterogeneity, identifying rare cell types, and revealing cellular dynamics and functional states within complex biological systems.

This AWS blog lays out a potential method that integrates a lot of the AWS native tools for running an scRNAseq pipeline. It is less of a tutorial, and more of a demo of what is possible.
This NVIDIA blog details how to run an accelerated scRNAseq pipeline using RAPIDS. You can find a link to the GitHub that has lots of example notebooks here. For each example use case they show some nice benchmarking data with time and cost for each machine type. You will see that most runs cost less than $1.00 with GPU machines. If you want a CPU version that users Scanpy you can use this notebook. Pay careful attention to the environment setup as there are a lot of dependencies for these notebooks. Create a conda environment in the terminal, then run the notebook. Consider using mamba to speed up environment creation. We created a guide for conda environment set up as well.

ElasticBLAST

NCBI BLAST (Basic Local Alignment Search Tool) is a widely used bioinformatics program provided by the National Center for Biotechnology Information (NCBI) that compares nucleotide or protein sequences against a large database to identify similar sequences and infer evolutionary relationships, functional annotations, and structural information. The NCBI team has written a version of BLAST for the cloud called ElasticBLAST, and you can read all about it here. Essentially, ElasticBLAST helps you submit BLAST jobs to AWS Batch and write the results back to S3. Feel free to experiment with the example tutorial in Cloud Shell, or try our notebook version.

Protein Folding

You can run several protein folding algorithms including Alpha Fold on AWS. Because the databases are so large, the setup is normally pretty difficult, but AWS has created a StackFormation stack that automates spinning up all the resources necessary for running Alpha Fold and other protein folding algorithms. You can read about the AWS resources here, and view the GitHub page here. To get this to work, you will need to modify your security groups following these instructions. You will also likely have to grant additional permissions to the Role that CloudFormation is using. If you get stuck, reach out to CloudLab@nih.gov. You can also run ESMFold using this tutorial.

Long Read Sequence Analysis

Long read DNA sequence analysis involves analyzing sequencing reads typically longer than 10 thousand base pairs (bp) in length, compared with short read sequencing where reads are about 150 bp in length. Oxford Nanopore has a pretty complete offering of notebook tutorials for handling long read data to do a variety of things including variant calling, RNAseq, Sars-Cov-2 analysis and much more. Access the notebooks here. These notebooks expect you are running locally and accessing the epi2me notebook server. To run them in Cloud Lab, skip the first cell that connects to the server and then the rest of the notebook should run correctly, with a few tweaks. If you are just looking to try out notebooks, don't start with these. If you are interested in long read sequence analysis, then some troubleshooting may be needed to adapt these to the Cloud Lab environment. You may even need to rewrite them in a fresh notebook by adapting the commands. Feel free to reach out to our support team for help.

Drug Discovery

The Accelerating Therapeutics for Opportunities in Medicine (ATOM) Consortium created a series of Jupyter notebooks that walk you through the ATOM approach to Drug Discovery.

These notebooks were created to run in Google Colab, so if you run them in AWS, you will need to make a few modification. First, we recommend you use a Sagemaker Studio Notebook rather than a User-Managed notebook simply because it will have Tensorflow and other dependencies installed. Be sure to attach a GPU to your instance (T4 is fine). Also, you will need to comment out %tensorflow_version 2.x since that is a Colab-specific command. You will also need to pip install a few packages as needed. If you get errors with deepchem, try running pip install --pre deepchem[tensorflow] and/or pip install --pre deepchem[torch]. Also, some notebooks will require a Tensorflow kernel, while others require Pytorch. You may also run into a Pandas error, reach out to the ATOM GitHub developers for the best solution, or review their issues.

CryoEM

Cryo-Electron Microscopy (cryoEM), is a powerful imaging technique used in structural biology to visualize the structures of biological macromolecules, such as proteins, nucleic acids, and large molecular complexes, at near-atomic or even atomic resolution. It has revolutionized the field of structural biology by providing detailed three-dimensional structures of biomolecules, which is crucial for understanding their functions.

AWS created a hands-on workshop for you to stand up a cryoEM environment using RELION.
You can also read this blog on how to set up cryoSPARC, as well as docs from cryoSPARC.

Open Data

AWS has a lot of public data that you can integrate into your testing or use in your own research. You can access these datasets at the Registry of Open Data on AWS. There you can click on any of the datasets to view the S3 path to the data, as well as publications that have used those data and tutorials if available. To demonstrate, we can click the gnomad dataset, then get the S3 path and view the files at the command line by pasting https://registry.opendata.aws/broad-gnomad/.

STRIDES / NIHCloudLabAWS