yet-another-ml-lectures

Books followed

Week	Topic	Extra notes
[4](##Week 4)	Introduction	---
[5](##Week 5)	Probability	Notes
[8](##Week 8)	Information theory, bias-variance tradeoff	---
[9](##Week 9)	Pytorch hands-on	Code
[11](##Week 11)	Linear Algebra 1	---
[12](##Week 12)	Metrics for un-balanced data	---
[12](##Week 13)	Baysian Shit	---

Week 4

Introduction and the expectation from these sessions.
Difference between classical ML approach and Deep learning approach. pros and cons.
Importance of intuitive understanding by having basic math knowledge.
Practical approach and importance of coding excersise. Gap between theory and practical implementation.

Week 5

Basics in probability
Important descrete and continous distributions
Self-information, mutual information, KL divergence and cross entropy.
A good article on cross entropy theory here
Clarification of binary and multi class cross entropy interface in pytorch
- torch.nn.functional.binary_cross_entropy takes logistic sigmoid values as inputs
- torch.nn.functional.binary_cross_entropy_with_logits takes logits as inputs
- torch.nn.functional.cross_entropy takes logits as inputs (performs log_softmax internally)
- torch.nn.functional.nll_loss is like cross_entropy but takes log-probabilities (log-softmax) values as inputs

Week 8

Self-information: Number of bits required to convey an event x, given by -log(p(x)).
Self-entropy: Average uncertainity in whole probability distribution, given by E_{x~p(x)}[-log(p(x))]
KL divergence: Measures distance between two distribution P and Q over same random variable X, given by E_{x~P}[log(P(x)/Q(x))]
KL divergence provides extra bits of info sampled from P but using the codes designed to encode info sampled from Q.
The characteriscs of forward KL (zero spreading) and backward KL (zero inducing) is described here
(Advanced) Density ratio estimation in KL divergence in tensorflow here
Cross-entropy: Uncertainity in a distribution Q averaged over samples from another (cross) distribution P, given by E_{x~P}[-log(Q(x))].
Cross entropy is also equal to Self-entropy of P and KL divergence between P and Q. So, minimizing Cross entropy with respect to Q minimizes forward KL divergence between P and Q.
Empirical distribution: Since real distribution is not available and we only get the data points sampled from real distribution, empirical distribution replaces the real distribution by putting equal probabilities on each data samples (with IID assumption).
So cross entropy minimization often means minimizing E_{x~P_hat}[-log(Q(x))] where P_hat(x_i) = 1/N (N is the number of samples).
Point estimators: To estimate the parameters, a point estimator is a function of random variables (signifying the data observations). For example, in neural networks, the point estimator funtion is gradient calculation plus addition.
Bias: It is distance of estimated parameter from real parameters (which is unknown), given by E[theta] - real_theta.
Variance: It represents the variation in parameters for small perturbation in observed data (and hence point estimation).
MSE and bias-variance tradeoff: E[(theta - real_theta)^2] = E[(theta -E[theta] + E[theta] - real_theta)^2] = E[(theta - E[theta])^2 + (real_theta-E[theta])^2 + 2*(theta-E[theta])* (real_theta-E[theta])] = Bias^2 + Var + (real_theta - E[theta])* 0

Week 9

When pytorch? (cheap, easy code, not fast), when tensorflow? (production level code, very fast, established infra)
Start with pytorch, once proof-of-concept is ready, migrate code to tensorflow.
Structure code: refer framework test directory, ready to do a lot of hyper-parameter search (but not blindly).
Debugging through visualization is very very very important. Eg. distribution plot of floats out of a layer, weights.

Week 11

Vector space and sub-spaces. Sub-spaces should have zero element and should follow clouser property.
Generating set is a set of vectors that spans the vector space. Basis of a vector space are linearly independent vector set that spans the vector space.
Rank is the number of linearly independent rows or linearly independent columns.
Linear mapping (and its properties such as homo-morphism) and linear transformation matrix.
Kernels (with relation to matrix transformation) are all vectors that are zero vectors in original vector space.
Norms and inner-product maps V or VXV to R, where V is the vector space. They measure the length and distance.
Orthonormal/orthogonal basis and orthogonal projection. Gram Smidt process to find such basis.

Week 12

True positive: hit, Red squres with dog face.
False negative: miss, Yelow squares with dog face.
False positive: Low skill model, Req squares without dog face.
True negative: All regions other red and yellow squares.
Precision: Out of positive predictions, how many were actually correct.
Recall: Out of all actual postitive labels, how many we got.
AUC curve measure the models skill for balanced data as we vary the probability threshold.
Precision-recall curve summarizes the model's skill for un-balanced data where positive examples are the real concern.

Week 13

Baysian approach to machine learning
MLE and MAP
Functional approximation

vishal-keshav / yet-another-ml-lectures

yet-another-ml-lectures

Books followed

Week 4

Week 5

Week 8

Week 9

Week 11

Week 12

Week 13

About

Languages