Issues spotted during second delivery

Question

Issues spotted during second delivery

ailithewing opened this issue 3 years ago · comments

Regression with many features:

coef_df not defined/created in episode code. (vector of p-values)
Requested code for heatmap (may we could hide it in a box like solutions?)

Regularisation:

Sum of squared residuals isn't squared (Fixed)
glmnet scales and centres internally so no need to scale/center separately

Catalina Vallejos · Answer 1 · Thu Feb 17 2022 17:12:02 GMT+0800 (China Standard Time)

Regularisation:

Currently, scaling is applied to the whole dataset before train/test split. However, scaling is not actually required as glmnet performs it by default when performing regularisation on the test set. Scaling is not required either on the test set, as glmnet transforms the coefficients back to the original scale. Code and related text needs to be updated. We should also include a brief explanation of how scaling affects the inferred regression coefficients and how the original scale can be recovered post-hoc.

Catalina Vallejos · Answer 2 · Thu Feb 17 2022 17:15:28 GMT+0800 (China Standard Time)

Regularisation:

Issue detected by one of the students (see here). Whilst

fit_horvath <- lm(train_age ~ train_mat)

and

fit_horvath <- lm(train_age ~ ., data = as.data.frame(train_mat))

can be used to fit the same model, the first gives a warning when used in combination with predict, the second does not:

pred_lm <- predict(fit_horvath, newdata = as.data.frame(test_mat))

It's unclear why this happens.

Alan O'Callaghan · Answer 3 · Thu Feb 17 2022 19:06:16 GMT+0800 (China Standard Time)

the first gives a warning when used in combination with predict, the second does not:

Here's an example with mtcars. The coefficients have the variable name slapped on the front, so when you go to predict the variable names in newdata are all wrong. eg here it's all matcyl, matdisp etc not just cyl, disp

r$> mat <- as.matrix(mtcars[-1])                                                
r$> lm(mtcars[[1]] ~ mat)                                                       

Call:
lm(formula = mtcars[[1]] ~ mat)

Coefficients:
(Intercept)       matcyl      matdisp        mathp      matdrat        matwt  
   12.30337     -0.11144      0.01334     -0.02148      0.78711     -3.71530  
    matqsec        matvs        matam      matgear      matcarb  
    0.82104      0.31776      2.52023      0.65541     -0.19942

Catalina Vallejos · Answer 4 · Thu Feb 17 2022 19:08:00 GMT+0800 (China Standard Time)

thanks, that's very useful!

Alan O'Callaghan · Answer 5 · Thu Feb 17 2022 19:11:59 GMT+0800 (China Standard Time)

Cheers. Actually I ran into that issue myself last time so I should've noted it down. Hope the delivery's going okay

Hannes Becher · Answer 6 · Fri Feb 18 2022 22:45:28 GMT+0800 (China Standard Time)

Some things I noticed

K-means clustering

To make silhouette plot work, specify border arg: plot(sil, border = NA)

Factor analysis

Change first figure?
The first figure of the blue table makes it look like one can specify which features to subsume into which factors. But this is not the point of FA, I understand. Would it be good to make it (more) obvious in the figure that students' achievements in each subject group (writing and maths) are correlated? Could colour cells according to values in Excel/use heatmap. The grouping factors could then be shown beneath the figure to make clear they are a result of the analysis. Also, I realise that each feature can contribute to multiple factors. So, perhaps a different figure would be better.

factana() with high-dim data?
I've input these student data into R and factanal did not seem to run until I removed features. Does factanal actually work on high-dim data in the sense of "more features than observations"?

High-dim data definition

We started off saying that (number of features) > (number of observations) in high-dim data. But later on we repeatedly used datasets where there were more observations than features. The Wikipedia article states "the field of high-dimensional statistics studies data whose dimension is larger than typically considered in classical multivariate analysis". Would be on the safe side to follow this view.

Alan O'Callaghan · Answer 7 · Fri Feb 18 2022 23:43:28 GMT+0800 (China Standard Time)

re: High-dim data definition, it may be worth pointing out somewhere that large n (observations) has its own problems for computation but is generally good news re: asymptotic assumptions, etc. While large p (features) is mainly what we're dealing with.

Hannes Becher · Answer 8 · Mon Feb 21 2022 23:23:47 GMT+0800 (China Standard Time)

Episode: Introduction to high-dimensional data

Not sure what is the motivation behind the green/red/black dots and the exercise in section "What statistical methods are used to analyse high-dimensional data?"

Could this be dropped?

Aim of an analysis

Sometimes we use stats to predict/classify and sometimes we want to infer parameters or test hypotheses. High-dim data may appear in all these contexts. Perhaps that could be mentioned in the intro.

Also, could we make clearer which analyses are used when? E.g. regression with many features like (SNPs/transcripts/proteins ) is used to find outliers whereas regularisation is used if we want to predict. (I hope this is correct.)

ATM the course is structured by methods. It would be mean some work, but would it be better to re-structure by applications? At least there could be an overview in the intro/at the end to show what's used when.

I just noticed that inference and prediction is talked about in episode "Feature selection for regression (optional lesson)". I think this could be moved to a more prominent place.

Episode: Regularised regression

This took us ages to get through in the last run. Could we shorten or make optional the LASSO part?

Cross validation and model selection are touched on here. They could be treated in their own right. (I realise it's in the next (optional) episode.)

Also, is regularised regression covered in the machine learning course? Could we drop this episode altogether?

Hannes Becher · Answer 9 · Mon Feb 21 2022 23:45:03 GMT+0800 (China Standard Time)

Transferability of this training

I understand that the point of this course is to teach how to analyse high-dim data (a transferable skill). Once an attendee has learned the principles, they may go away and apply these using different software packages and programming languages of their choice. In the latest run of this course, one attendee asked whether there were Python examples.

I think it might be good not to (over) use tidyverse functions and paradigms such as piping and ggplot. While they are useful, it's a steep learning curve for people unfamiliar with them. Making familiarity with tidyverse a requirement would be overkill as it is not necessary at all to understand stats for high-dim data. Also tidyverse functionalities do often not generalise well to other languages.

I'd be more than happy to suggest alternative code if that is OK.

Catalina Vallejos · Answer 10 · Wed Jun 08 2022 18:04:20 GMT+0800 (China Standard Time)

Hi @hannesbecher could you please check if the issues here have been addressed already or if there are in #64 .

Hannes Becher · Answer 11 · Thu Oct 13 2022 17:29:16 GMT+0800 (China Standard Time)

Most points are implemented, about to be implements or covered elsewhere.