carpentries-incubator / high-dimensional-stats-r

High-dimensional statistics with R

Home Page:https://carpentries-incubator.github.io/high-dimensional-stats-r

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issues spotted during second delivery

ailithewing opened this issue · comments

Regression with many features:

coef_df not defined/created in episode code. (vector of p-values)
Requested code for heatmap (may we could hide it in a box like solutions?)

Regularisation:

Sum of squared residuals isn't squared (Fixed)
glmnet scales and centres internally so no need to scale/center separately

Regularisation:

Currently, scaling is applied to the whole dataset before train/test split. However, scaling is not actually required as glmnet performs it by default when performing regularisation on the test set. Scaling is not required either on the test set, as glmnet transforms the coefficients back to the original scale. Code and related text needs to be updated. We should also include a brief explanation of how scaling affects the inferred regression coefficients and how the original scale can be recovered post-hoc.

Regularisation:

Issue detected by one of the students (see here). Whilst

fit_horvath <- lm(train_age ~ train_mat)

and

fit_horvath <- lm(train_age ~ ., data = as.data.frame(train_mat))

can be used to fit the same model, the first gives a warning when used in combination with predict, the second does not:

pred_lm <- predict(fit_horvath, newdata = as.data.frame(test_mat))

It's unclear why this happens.

the first gives a warning when used in combination with predict, the second does not:

Here's an example with mtcars. The coefficients have the variable name slapped on the front, so when you go to predict the variable names in newdata are all wrong. eg here it's all matcyl, matdisp etc not just cyl, disp

r$> mat <- as.matrix(mtcars[-1])                                                
r$> lm(mtcars[[1]] ~ mat)                                                       

Call:
lm(formula = mtcars[[1]] ~ mat)

Coefficients:
(Intercept)       matcyl      matdisp        mathp      matdrat        matwt  
   12.30337     -0.11144      0.01334     -0.02148      0.78711     -3.71530  
    matqsec        matvs        matam      matgear      matcarb  
    0.82104      0.31776      2.52023      0.65541     -0.19942  

thanks, that's very useful!

Cheers. Actually I ran into that issue myself last time so I should've noted it down. Hope the delivery's going okay

Some things I noticed

K-means clustering

To make silhouette plot work, specify border arg: plot(sil, border = NA)

Factor analysis

Change first figure?
The first figure of the blue table makes it look like one can specify which features to subsume into which factors. But this is not the point of FA, I understand. Would it be good to make it (more) obvious in the figure that students' achievements in each subject group (writing and maths) are correlated? Could colour cells according to values in Excel/use heatmap. The grouping factors could then be shown beneath the figure to make clear they are a result of the analysis. Also, I realise that each feature can contribute to multiple factors. So, perhaps a different figure would be better.

factana() with high-dim data?
I've input these student data into R and factanal did not seem to run until I removed features. Does factanal actually work on high-dim data in the sense of "more features than observations"?

High-dim data definition

We started off saying that (number of features) > (number of observations) in high-dim data. But later on we repeatedly used datasets where there were more observations than features. The Wikipedia article states "the field of high-dimensional statistics studies data whose dimension is larger than typically considered in classical multivariate analysis". Would be on the safe side to follow this view.

re: High-dim data definition, it may be worth pointing out somewhere that large n (observations) has its own problems for computation but is generally good news re: asymptotic assumptions, etc. While large p (features) is mainly what we're dealing with.

Episode: Introduction to high-dimensional data

Not sure what is the motivation behind the green/red/black dots and the exercise in section "What statistical methods are used to analyse high-dimensional data?"

Could this be dropped?

Aim of an analysis

Sometimes we use stats to predict/classify and sometimes we want to infer parameters or test hypotheses. High-dim data may appear in all these contexts. Perhaps that could be mentioned in the intro.

Also, could we make clearer which analyses are used when? E.g. regression with many features like (SNPs/transcripts/proteins ) is used to find outliers whereas regularisation is used if we want to predict. (I hope this is correct.)

ATM the course is structured by methods. It would be mean some work, but would it be better to re-structure by applications? At least there could be an overview in the intro/at the end to show what's used when.

I just noticed that inference and prediction is talked about in episode "Feature selection for regression (optional lesson)". I think this could be moved to a more prominent place.

Episode: Regularised regression

This took us ages to get through in the last run. Could we shorten or make optional the LASSO part?

Cross validation and model selection are touched on here. They could be treated in their own right. (I realise it's in the next (optional) episode.)

Also, is regularised regression covered in the machine learning course? Could we drop this episode altogether?

Transferability of this training

I understand that the point of this course is to teach how to analyse high-dim data (a transferable skill). Once an attendee has learned the principles, they may go away and apply these using different software packages and programming languages of their choice. In the latest run of this course, one attendee asked whether there were Python examples.

I think it might be good not to (over) use tidyverse functions and paradigms such as piping and ggplot. While they are useful, it's a steep learning curve for people unfamiliar with them. Making familiarity with tidyverse a requirement would be overkill as it is not necessary at all to understand stats for high-dim data. Also tidyverse functionalities do often not generalise well to other languages.

I'd be more than happy to suggest alternative code if that is OK.

Hi @hannesbecher could you please check if the issues here have been addressed already or if there are in #64 .

Most points are implemented, about to be implements or covered elsewhere.