GTPB/2021-11-ml-elixir-pt

Overview of the course material for the ELIXIR-PT "Introduction to Machine Learning Using R" course

When: 15-17 November 2021, 09:30 - 18:30 UTC

Where: Instituto Gulbenkian de Ciencia, Oeiras, PT

Registration: People should express interest by mailing bicourses [at] igc.gulbenkian.pt as explained under "Contact" in https://tess.elixir-europe.org/events/machine-learning

Instructors and helpers

Instructors:

Wandrille Duchemin (ELIXIR-CH, Basel University, SIB Swiss Institute of Bioinformatics)
Crhistian Cardona (ELIXIR-UK, University of Tuebingen)

Overview

With the rise in high-throughput sequencing technologies, the volume of omics data has grown exponentially in recent times and a major issue is to mine useful knowledge from these data which are also heterogeneous in nature. Machine learning (ML) is a discipline in which computers perform automated learning without being programmed explicitly and assist humans to make sense of large and complex data sets. The analysis of complex high-volume data is not trivial and classical tools cannot be used to explore their full potential. Machine learning can thus be very useful in mining large omics datasets to uncover new insights that can advance the field of bioinformatics.

This 3-days course will introduce participants to the machine learning taxonomy and the applications of common machine learning algorithms to omics data. The course will cover the common methods being used to analyse different omics data sets by providing a practical context through the use of basic but widely used R libraries. The course will comprise a number of hands-on exercises and challenges where the participants will acquire a first understanding of the standard ML processes, as well as the practical skills in applying them on familiar problems and publicly available real-world data sets.

Learning objectives

At the end of the course, the participants will be able to:

Understand the ML taxonomy and the commonly used machine learning algorithms for analysing “omics” data
Understand differences between ML algorithms categories and to which kind of problem they can be applied
Understand different applications of ML in different -omics studies
Use some basic, widely used R packages for ML
Interpret and visualize the results obtained from ML analyses on omics datasets
Apply the ML techniques to analyse their own datasets

Audience and requirements

This course is intended for master and PhD students, post-docs and staff scientists familiar with different omics data technologies who are interested in applying machine learning to analyse these data. No prior knowledge of Machine Learning concepts and methods is expected nor required.

Prerequisites

Knowledge / competencies

Familiarity with any programming language will be required (familiarity with R will be preferable).

Technical

This course will be in person. You are not required to have your own computer. In order to ensure clear communication between Instructors and participants, we will be using collaborative tools, such as Google Drive and/or Google Docs.

Maximum participants: 20

Schedule

Note: this schedule is fairly tentative and will adapt to the trainees needs and questions, with the expection of start, stop, break and lunch time which will be scrupulously respected.

Day 1

Time	Details
09:30 - 10:00	Course Introduction. - Welcome. - Introduction and CoC. - Way to interact - Practicalities (agenda, breaks, etc). - Setup Link to material
10:00 - 10:30	Introduction to Machine Learning (theory)
10:30 - 11:00	What is Exploratory Data Analysis (EDA) and why is it useful? (hands-on) - Loading omics data - PCA Link to material
11:00 - 11:30	Coffee Break
11:30 - 12:30	Exploratory Data Analysis - continued (hands-on)
12:30 - 14:00	Lunch break
14:00 - 14:30	Introduction to Unsupervised Learning (theory)
14:30 - 15:00	Agglomerative Clustering: k-means (practical) Link to material
15:00 - 15:30	Coffee Break
15:30 - 18:30	Agglomerative Clustering: k-means - continued (practical)
18:30	Closing of Day 1

Day 2

Time	Details
09:30 - 10:00	Welcome Day 2. - Questions from Day 1 - Recap
10:00 - 10:30	Divisive Clustering: hierarchical clustering (theory)
10:30 - 11:00	Divisive Clustering: hierarchical clustering (practical) Link to material
11:00 - 11:30	Coffee Break
11:00 - 12:30	Divisive Clustering: hierarchical clustering - continued (practical)
12:30 - 14:00	Lunch break
14:00 - 15:00	Classification - didactical introduction (practical) - Decision trees - the classification pipeline Link to material
15:00 - 15:30	Coffee Break
15:30 - 17:30	Classification - metrics and evaluation (theory/practical) - F1 Score, Precision, Recall - Confusion Matrix, ROC-AUC Link to material
17:30 - 18:30	Classification - random forests (practical) Link to material

Day 3

Time	Details
09:30 - 10:00	Welcome Day 3. - Questions from Day 2 - Recap
10:00 - 11:00	Classification - more algorithms (theory) - Naive Bayes - SVMs
11:00 - 11:30	Coffee Break
11:30 - 12:00	Regression (theory)
12:00 - 12:30	Linear regression (practical) Link to material
12:30 - 14:00	Lunch break
14:00 - 15:00	Linear regression - continued (practical)
15:00 - 15:30	Coffee Break
15:30 - 17:00	Generalized Linear Model (GLM) (practical) Link to material
17:00 - 17:30	Recap and overture to advanced topics (theory)
17:30 - 18:30	Closing questions, Discussion

Other examples

If you finish all the exercises and wish to practice on more examples, here are a couple of good examples to help you get more familiar with the different ML techniques and packages.

RNASeq Analysis in R
Use the Iris R built-in data set to run clustering and also some supervised classification and compare results obtained by different methods.

Sources / References

The material in the workshop has been based on the following resources:

ELIXIR CODATA Advanced Bioinformatics Workshop
Machine Learning in R, by Hugo Bowne-Anderson and Jorge Perez de Acha Chavez
Practical Machine Learning in R, by Kyriakos Chatzidimitriou, Themistoklis Diamantopoulos, Michail Papamichail, and Andreas Symeonidis.
Linear models in R, by the Monash Bioinformatics Platform
Relevant blog posts from the R-Bloggers website.
Predicting the breast cancer by characteristics of the cell nuclei present in the image

Relevant literature includes:

Pattern Recognition and Machine Learning by Christopher M. Bishop.
Machine learning in bioinformatics, by Pedro Larrañaga et al.
Ten quick tips for machine learning in computational biology, by Davide Chicco
Statistics versus machine learning
Machine learning and systems genomics approaches for multi-omics data
A review on machine learning principles for multi-view biological data integration
Generalized Linear Model

Additional information

Coordination: Pedro L. Fernandes, Training Coordinator of ELIXIR-PT, Instituto Gulbenkian de Ciência

ELIXIR-PT abides by the ELIXIR Code of Conduct. Participants in this course are also required to abide by the same code.

License

This material is made available under the Creative Commons Attribution 4.0 International license. Please see LICENSE for more details.

Citation

Additionnaly, we would like to acknowledge that this training materials draws heavily from :

Shakuntala Baichoo, Wandrille Duchemin, Geert van Geest, Thuong Van Du Tran, Fotis E. Psomopoulos, & Monique Zahn. (2020, July 23). Introduction to Machine Learning (Version v1.0.0). Zenodo. http://doi.org/10.5281/zenodo.3958880

GTPB / 2021-11-ml-elixir-pt