BhadraNivedita / Machine_Learning_Resource

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Biological concepts:

What is 'housekeeping gene'?

Housekeeping genes are genes that are essential for the maintenance of basic cellular functions and are typically expressed in all cells of an organism under normal and healthy conditions. They perform fundamental roles in the upkeep of cellular physiology and survival. Here are some key points about housekeeping genes:

Key Characteristics of Housekeeping Genes:

  1. Essential Functions: Housekeeping genes are involved in crucial cellular processes such as energy production, metabolism, cell structure maintenance, DNA repair, and protein synthesis.

  2. Constitutive Expression: These genes are usually expressed at relatively constant levels across different cell types and conditions because their protein products are required continuously for the cell to function properly.

  3. Universal Presence: Housekeeping genes are present in all cells of an organism, irrespective of the tissue type or developmental stage. They are fundamental to the basic operations of every cell.

  4. Stable Expression Levels: The expression levels of housekeeping genes are relatively stable, making them reliable internal controls in various experimental settings, such as quantitative PCR and gene expression studies.

Examples of Housekeeping Genes:

  • GAPDH (Glyceraldehyde-3-phosphate dehydrogenase): Involved in glycolysis, the process of breaking down glucose to produce energy.
  • ACTB (Beta-actin): Part of the cytoskeleton, playing a critical role in cell structure and integrity.
  • RPLP0 (Ribosomal protein, large, P0): A component of the ribosome, essential for protein synthesis.
  • HPRT1 (Hypoxanthine-guanine phosphoribosyltransferase): Involved in nucleotide synthesis and metabolism.
  • B2M (Beta-2-microglobulin): Part of the major histocompatibility complex (MHC) class I molecule, important for immune response.

Importance in Research:

  • Normalization Controls: Due to their stable expression, housekeeping genes are often used as reference genes to normalize data in gene expression studies, ensuring that variations in experimental conditions do not affect the results.
  • Cellular Health Indicators: Consistent expression of housekeeping genes is an indicator of normal cellular function and health, whereas deviations can signal cellular stress or pathology.

In summary, housekeeping genes are essential for the fundamental operations of cells, consistently expressed to support critical cellular functions, and serve as vital tools in molecular biology research for ensuring accurate and reliable experimental results.

Tools for bioinformatics analysis

  1. Gene ID conversion tool: https://www.syngoportal.org/convert
  2. Connectivity Map analysis tool: https://clue.io/query
  3. Gprofiler: https://biit.cs.ut.ee/gprofiler/gost
  4. Gene set enrichment tool: http://bioinformatics.sdstate.edu/go/
  5. Pantherdb: https://pantherdb.org/tools/compareToRefList.jsp?&showAll=false

GWAS studies practical guide

  1. https://www.youtube.com/watch?v=nrbgly0Bcv8
  2. https://www.r-bloggers.com/2017/10/genome-wide-association-studies-in-r/

What is GWAS study?

A Genome-Wide Association Study (GWAS) is a research approach used to identify genetic variations associated with specific diseases or traits. Here's a detailed explanation of what GWAS involves and its significance:

What is GWAS?

  1. Objective:

    • The primary goal of a GWAS is to uncover the genetic basis of complex traits or diseases by scanning the genome for single nucleotide polymorphisms (SNPs) that occur more frequently in individuals with a particular condition compared to those without.
  2. Methodology:

    • Sample Collection: GWAS begins with the collection of DNA samples from two groups: individuals with the disease or trait of interest (cases) and individuals without it (controls).
    • Genotyping: The DNA samples are genotyped to identify SNPs across the genome. Modern GWAS typically use high-throughput genotyping arrays that can examine hundreds of thousands to millions of SNPs simultaneously.
    • Statistical Analysis: Each SNP is statistically analyzed to determine if there is a significant association between the SNP and the disease or trait. This involves comparing the frequency of each SNP in cases versus controls.
    • Correction for Multiple Testing: Given the large number of SNPs tested, corrections for multiple comparisons are necessary to reduce the likelihood of false positives. Common methods include the Bonferroni correction or the False Discovery Rate (FDR) approach.
    • Replication: Findings from the initial analysis are often validated in independent cohorts to confirm the associations.
  3. Output:

    • The results of a GWAS are typically presented as a Manhattan plot, where each dot represents a SNP and its association with the trait. Peaks in the plot indicate regions of the genome that are significantly associated with the trait.

Significance of GWAS

  1. Understanding Genetic Architecture:

    • GWAS has helped identify numerous genetic loci associated with a wide range of diseases and traits, providing insights into their genetic architecture and biological pathways.
  2. Disease Mechanisms:

    • By pinpointing genetic variants linked to diseases, GWAS can reveal new biological mechanisms and pathways involved in disease development, which can inform the development of new therapeutic targets.
  3. Personalized Medicine:

    • GWAS findings contribute to the field of personalized medicine by identifying genetic markers that can predict disease risk, treatment response, or adverse drug reactions, allowing for more tailored healthcare strategies.
  4. Polygenic Risk Scores:

    • GWAS data can be used to create polygenic risk scores, which aggregate the effects of multiple genetic variants to estimate an individual's genetic predisposition to a particular disease.

Challenges and Limitations

  1. Complex Traits:

    • Many complex traits and diseases are influenced by numerous genetic variants, each contributing a small effect, as well as environmental factors. This makes it challenging to identify all relevant variants.
  2. Population Stratification:

    • Genetic differences between populations can lead to spurious associations if not properly controlled for, making it crucial to include diverse populations in GWAS to ensure findings are broadly applicable.
  3. Missing Heritability:

    • Despite identifying many genetic associations, a large portion of the heritability of complex traits remains unexplained. This "missing heritability" suggests that other factors, such as rare variants, gene-gene interactions, and gene-environment interactions, also play significant roles.

Conclusion

GWAS is a powerful tool for uncovering the genetic underpinnings of diseases and traits. By scanning the genome for associations between genetic variants and specific conditions, GWAS has significantly advanced our understanding of human genetics and contributed to the development of personalized medicine. However, it also faces challenges that require ongoing research and methodological improvements to fully realize its potential.

About PLINK: https://www.cog-genomics.org/plink/2.0/input

Algorithm and Datasctructure

https://www.youtube.com/watch?v=8hly31xKli0

Website for specific tools:

Plotting Upsetplot: https://jokergoo.github.io/ComplexHeatmap-reference/book/upset-plot.html

Machine_Learning_Resource

I list down some of the resource links I often come across and take help from for my research.

  1. https://github.com/louisfb01/Best_AI_paper_2020

  2. Machine Learning Mastery by Jason - https://machinelearningmastery.com/ (This one is my favourite.)

  3. Probability concepts explained: Maximum likelihood estimation (https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1)

  4. Some intuitive questions on Data Science: https://career-accelerator.corsairs.network/99-questions-every-entry-level-analyst-should-be-able-to-answer-68cb45f9c91a

5.https://whats-ai.medium.com/top-10-computer-vision-papers-2020-aa606985f688

6.https://github.com/louisfb01/Top-10-Computer-Vision-Papers-2020

  1. https://medium.com/towards-artificial-intelligence/start-machine-learning-in-2020-become-an-expert-from-nothing-for-free-f31587630cf7 by Louis (What’s AI) Bouchard

Some uploaded files are collections of interview questions from different resources. Enjoy reading!

Resources on specific topics(collected over the years working with several collaborators and my research)

Deep learning

Bayesian inference/Advanced Statistics/Probabilistic models

All MCMC/SMC pacakges: https://gabriel-p.github.io/pythonMCMC/ Bayesian deep learning: https://zhusuan.readthedocs.io/en/latest/ Duke: https://people.duke.edu/~ccc14/sta-663/MCMC.html BayesPy: http://bayespy.org/examples/examples.html Statsmodels: https://github.com/statsmodels/statsmodels XGBoost: https://github.com/dmlc/xgboost LightGBM: https://github.com/Microsoft/LightGBM Catboost: https://github.com/catboost/catboost PyBrain: https://github.com/pybrain/pybrain Eli5: https://github.com/TeamHG-Memex/eli5

Deep Learning- Variational autoencoders

  1. https://zhusuan.readthedocs.io/en/latest/tutorials/vae.html Variational autoencoders
  2. https://github.com/kvfrans/variational-autoencoder.
  3. https://wiseodd.github.io/techblog/2016/12/10/variational-autoencoder/

Python big data

Parallel computing: https://wiki.python.org/moin/ParallelProcessing GPU Compatbilit6y: https://docs.anaconda.com/anaconda/user-guide/tasks/gpu-packages/ PyCUDA: https://documen.tician.de/pycuda/tutorial.html PyGPU: http://fileadmin.cs.lth.se/cs/Personal/calle_lejdfors/pygpu/ AWS: https://aws.amazon.com/developer/language/python/ (see sample code and 10 mins tutorial) Apache Spark: https://spark.apache.org/docs/0.9.1/python-programming-guide.html PySpark: https://spark.apache.org/docs/0.9.1/python-programming-guide.html Apache Hadoop: https://hadoop.apache.org/

Reinforcment Learning

https://github.com/keras-rl/keras-rl OpenAI: https://github.com/openai/gym

Optimization

Convex: https://cvxopt.org/ Platypus: https://platypus.readthedocs.io/en/latest/ PyGMO: http://esa.github.io/pygmo/ DEAP: https://deap.readthedocs.io/en/master/examples/index.html GAFT: https://github.com/pytlab/gaft

Some short reads:

5 Beginner-Friendly Steps to Learn Machine Learning and Data Science with Python — Daniel Bourke What is Machine Learning? — Roberto Iriondo

Machine Learning for Beginners: An Introduction to Neural Networks — Victor Zhou

A Beginners Guide to Neural Networks — Thomas Davis

Understanding Neural Networks — Prince Canuma

Reading lists for new MILA students — Anonymous

The 80/20 AI Reading List — Vishal Maini

##Some useful youtube link for simple demonstartion of ML topics:

1.https://www.youtube.com/watch?v=8HyCNIVRbSU---LSTM

#Interview Questions:

1.https://www.analyticsvidhya.com/blog/2020/04/comprehensive-popular-deep-learning-interview-questions-answers/

2.https://intellipaat.com/blog/interview-question/deep-learning-interview-questions/

  1. https://www.edureka.co/blog/interview-questions/machine-learning-interview-questions/. from Edureka

  2. https://medium.com/modern-nlp/nlp-interview-questions-f062040f32f7 -- medium by

Pratik Bhavsar

  1. A simplictic way of understanding transformer model in NLP :https://www.analyticsvidhya.com/blog/2019/06/understanding-transformers-nlp-state-of-the-art-models/

  2. A brief discussion on BERT: https://towardsdatascience.com/understanding-bert-bidirectional-encoder-representations-from-transformers-45ee6cd51eef

AI conference paper and presentation list: https://crossminds.ai/explore/

  1. An excellent resource for text similarities in NLP

  2. What is CNN? Convolutional Neural Networks: The Biologically-Inspired Model

https://www.codementor.io/@james_aka_yale/convolutional-neural-networks-the-biologically-inspired-model-iq6s48zms

9 AI in drug discivery

https://practicalcheminformatics.blogspot.com/2021/01/ai-in-drug-discovery-2020-highly.html

  1. An interesting github page on Data Science and Machine Learning: https://github.com/achuthasubhash/Complete-Life-Cycle-of-a-Data-Science-Project

Bayesian Neural Network

  1. https://www.google.com/url?q=https://analyticsindiamag.com/hands-on-guide-to-bayesian-neural-network-in-classification/&sa=D&source=hangouts&ust=1620269400965000&usg=AFQjCNEteFofBga-tgHNRzxZraQwPYYWEA

  2. https://keras.io/examples/keras_recipes/bayesian_neural_networks/

How Can You Distinguish Yourself from Hundreds of Other Data Science Candidates?

https://towardsdatascience.com/how-to-distinguish-yourself-from-hundreds-of-data-science-candidates-62457dd8f385

Good blog post on NLP problem solving

Machine Learning in Bioinformatics

1.https://www.kdnuggets.com/2019/09/explore-world-bioinformatics-machine-learning.html

2.https://medium.com/@alenaharley/tumor-classification-using-gene-expression-data-poking-at-a-problem-using-fast-ai-again-8633c2256c85

Pathway analysis

Pathway enrichment analysis of metabolites

  1. Lilikoi: an R package for personalized pathway-based classification modeling using metabolomics data (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6290884/)

Feataure importance :

https://machinelearningmastery.com/calculate-feature-importance-with-python/

Phylogenetic association analysis related resources:

1.https://dendropy.org/programs/sumtrees.html

AutoML Resources:

  1. H2O AutoML - https://lnkd.in/gcxQSEW2
  2. AutoGluon - https://lnkd.in/gXcrqnU9
  3. AutoKeras - https://lnkd.in/ghsjphDt
  4. Auto-PyTorch - https://lnkd.in/gbbNQy5R
  5. Auto-sklearn - https://lnkd.in/g4MxeeVT
  6. EvalML - https://lnkd.in/gDjQX3At
  7. FLAML - https://lnkd.in/gUkiwqyb
  8. LightAutoML - https://lnkd.in/gU2-jccZ
  9. MLJAR - https://mljar.com/
  10. PyCaret AutoML - https://lnkd.in/gvw8DNv8
  11. TPOT - https://lnkd.in/g3z9YtuU
  12. GradsFlow - https://docs.gradsflow.com/en/latest/

A notebook by Rohan Rao with examples on the above mentioned tools/libraries.

https://www.kaggle.com/rohanrao/automl-tutorial-tps-september-2021

Some online resources for motivation

  1. Kobe Bryant: https://www.youtube.com/watch?v=VSceuiPBpxY
  2. Bollywood Actor Anupam Kher with Gaur Gopal Das Best Indian Motivational Speaker
  3. https://www.youtube.com/watch?v=DGIjuVbGP_A

Videos on ML and healthcare

1.https://www.youtube.com/watch?v=oyVnONlEZoA 2.

Some free youtuber sharing how to learn DataScience:

  1. Ken Jee: https://www.youtube.com/c/KenJee1
  2. Dhaval Patel: https://www.youtube.com/playlist?list=PLeo1K3hjS3us_ELKYSj_Fth2tIEkdKXvV
  3. Tina Huang: https://www.youtube.com/channel/UC2UXDak6o7rBm23k3Vv5dww
  4. Andrew Mo: https://www.youtube.com/channel/UC23emuGbNM7twofQIrEgPBQ

Some blogs on Statistics:

https://learningstatisticswithr.com/ https://advstats.psychstat.org/book/power/index.php (Online book for Free with example codes in R, I found it handy.) https://worthylab.org/statistics/ https://r4ds.had.co.nz/ https://xcelab.net/rm/statistical-rethinking/

What is odd ratio in exact test?

In statistics, especially in the context of hypothesis testing, the odds ratio (OR) is a measure of association between an exposure and an outcome. It quantifies the strength and direction of the relationship between two variables. The odds ratio is often used in logistic regression analysis and in studies where the outcome of interest is binary (e.g., success or failure, presence or absence).

In the context of an exact test, such as Fisher's exact test, the odds ratio is used to compare the odds of an event (e.g., having a certain characteristic or outcome) between two groups. Fisher's exact test is used to determine if there is a significant association between two categorical variables by examining the relationship between their frequencies.

Here's how the odds ratio is typically calculated in the context of Fisher's exact test:

  • For a 2x2 contingency table: If you have a 2x2 table where rows represent two groups (e.g., treatment and control) and columns represent the presence or absence of an outcome (e.g., success or failure), the odds ratio is calculated as the ratio of the odds of success in one group to the odds of success in the other group.

  • Formula: Let's say the 2x2 table looks like this:

            Outcome Present   Outcome Absent
    Group A      a                 b
    Group B      c                 d
    

    Then the odds ratio (OR) is given by:

    [ \text{OR} = \frac{ad}{bc} ]

  • Interpretation: An odds ratio greater than 1 indicates that the event (e.g., success) is more likely to occur in the first group compared to the second group. An odds ratio less than 1 indicates the opposite. An odds ratio of 1 suggests that there is no association between the exposure and the outcome.

In Fisher's exact test, the p-value associated with the odds ratio is used to determine if the observed association between the two variables is statistically significant. If the p-value is below a predetermined significance level (often 0.05), it indicates that the observed association is unlikely to have occurred by chance alone, and there is evidence of a significant relationship between the variables.

About