We know the Bayes rule. How does it relate to machine learning? Bayesian inference is based on using probability to represent all forms of uncertainty.
- Aleatory variability is the natural(intrinsic) randomness in a process; it is supposed irreducible and inherent natural to the process involved.
- Heteroscedastic: No one can sure the measurements done by your collegues are perfect..damn noise...(heteroscedastic means a different uncertainty for every input)
- Homoscedastic: model variance? you assumes identical observation noise for every input point x? Instead of having a variance being dependent on the input x, we must determine a so-called model precision
τ
and multiply it by the identity matrix I, such that all outputs y have the same variance and no co-variance among them exists. This model precisionτ
is the inverse observation standard deviation.
- Epistemic uncertainty is the scientific uncertainty in the model of the process; it is supposedly reducible with better knowledge, since it is not inherent in the real-world process under consideration (due to lack of knowledge and limited data..This can be reduced in time, if more data are collected and new models are developed).
- Inference for θ aims to understand the model.
- Prediction for Data aims to utilize the model you discovered.
- Frequentists' probability refers to past events..Do experiment and that's it.
- Bayesians' probability refers to future events..Do update !
As Bayesians, we start with a belief, called a prior. Then we obtain some data and use it to update our belief. The outcome is called a posterior. Should we obtain even more data, the old posterior becomes a new prior and the cycle repeats. It's very honest. We cannot 100% rely on the experiment result. There is always a discrepency and there is no guarantee that the relative frequency of an event will match the true underlying probability of the event. That’s why we are approximating the probability by the long-run relative frequency in Bayesian. It's like calibrating your frequentist's subjective belief
P( θ | Data ) = P( Data | θ ) * P( θ ) / P( data )
P( θ )
is a prior, our belief of what the model parameters might be.- Prior is a weigth or regularizor.
- The final inference should converge to probable
θ
as long as it’s not zero in the prior. - Two aspects of your prior selection:
-
Subjective: Informative Prior ... your belief based Prior
- conjugate prior
- a class of distributions that present the same parametric form of the likelihood and their choice is frequently related to mathematical convenience and the likelihood.
- conjugate prior
-
Objective: Non-Informative (vague) Prior when there is no information about the problem at hand.
-
Flat prior
- Uniform, Normal with huge variance, etc. The use of a flat prior typically yields results which are not too different from conventional statistical analysis.
-
Improper prior
- It, in their parametric space, does not integrate to 1. For instance, in some cases Jeffery's priors are improper, but the posterior distribution is proper.
- Jeffery's prior is proportional to the Fisher Information, which is the expected value of the second derivative of the log-likelihood function with respect to the parameter. Although it is non-informative, improper prior, the Fisher Information quantifies the variability of the parameter based on the available data. That is, the higher the value of the Fisher Information, the more concave is the log-likelihood, thus evidencing that the data helps to estimate the quantity of interest.
- *He argues that any "non-informative prior" should be invariant to the parameterization(transformation) that we are using. If we create a
prior
that is proportional to theSqrt(FisherInf)
then theprior
is invariant to the parameterization used.
- *He argues that any "non-informative prior" should be invariant to the parameterization(transformation) that we are using. If we create a
-
Non-conjugate prior
- When the posterior distribution does not appear as a distribution that we can simulate or integrate.
- It makes the posterior to have an Open-form, but Metropolis-Hasting of MCMC solves the problem.
-
-
- why a paricular prior was chosen?
- The reality is that many of these prior distributions are making assumptions about the
type of data
we have. - There are some distributions used again and again, but the others are special cases of these dozen or can be created through a clever combination of two or three of these simpler distributions. A prior is employed because the assumptions of the prior match what we know about the parameter generation process. *Actually, there are multiple effective priors for a particular problem. A particular prior is chosen as some combination of
analytic tractability
+computationally efficiency
, which makes other recognizable distributions when combined with popular likelihood functions. - Examplary Prior Distributions
-
Uniform Prior
Beta(1,1)
= Unif(0,1)- Whether you use this one in its continuous case or its discrete case, it is used for the same thing:
You have a set of events that are equally likely
.- ex) see "binomial likelihood" case. Unif(0,1) says
θ
can be any value (ranging from 0 to 1) for any X.
- ex) see "binomial likelihood" case. Unif(0,1) says
- Note, the uniform distribution from ∞ to −∞ is not a probability distribution.
- Need to give lower and upper bounds for our values.
- Not used as often as you’d think, since its rare we want hard boundaries on our values.
-
Gaussian Prior
- Taking a center and spread as arguments, it states that 67% of your data is within 1SD of the center, and 95% is within 2SD.
- No need to check our value boundaries.
- coming up a lot because if you have multiple signals that come from any distribution (with enough signals), their average always converges to the normal distribution.
hist(np.array([np.mean(your_distribution) for i in range(your_samples)]))
.
- Taking a center and spread as arguments, it states that 67% of your data is within 1SD of the center, and 95% is within 2SD.
-
Beta Prior [0,1]
-
Gamma Prior [0,∞]
- It comes up all over the place. The intuition for the gamma is that it is the prior on positive real numbers.
- Now there are many ways to get a distribution over positive numbers.
- take the
absolute-value of a normal distribution
and get what’s called a Half-Normal distribution. - take the
exp(Y)
andY^2
...Log-Normal, and χ-square.
- take the
- Now there are many ways to get a distribution over positive numbers.
- So why use the gamma prior?
- If you use a Log-Normal, you are implicitly saying that you expect the log of your variable is symmetric.
- If you use a χ-square, you are implicitly saying that your variable is the sum of k?-squared factors, where each factor came from the normal(0, 1) distribution.
- Some people suggest using gamma because it is conjugate with lots of distributions. so it makes performing a computation easier...but it would be better to have your priors actually encode what you believe.
- When gamma is a used as the prior to something like normal, the posterior of this distribution also is a gamma.
- The gamma distribution is the main way to encode something to be a postive number. Actually many distributions can be built from gamma.
- It’s parameters
shape
(k) andscale
(θ) roughly let you tune gamma like the normal distribution. kθ specifies the mean, and kθ^2 specifies the variance. - Taking the reciprocal of a variable from the gamma gives you a value from the Inv-gamma distribution.
- If we normalize this positive number, we get the Beta distribution.
def beta(a, b): def samples(s): x = r.gamma(a, 1, s) y = r.gamma(b, 1, s) return x/(x + y) return(samples)
- If we want to a prior on "categorical", which takes as an argument a list of numbers that sum to 1, we can use a gamma to generate k-numbers and then normalize. This is precisely the definition of the Dirichlet distribution.
- It’s parameters
- It comes up all over the place. The intuition for the gamma is that it is the prior on positive real numbers.
-
Heavy-tailed Prior
- The major advantage of using a heavy-tail distribution is it’s more robust towards outliers (we cannot be too optimistic about how close a value stays near the mean..)..let's start to care outliers..
t-distribution
can be interpretted as the distribution over a sub-sampled population from the normal distribution sample. Since here our sample size is so small, atypical values can occur more often than they do in the general population. As our sub-population grows, the t-distribution becomes the normal distribution.- The t-distribution can also be generalized to not be centered at 0.
- The parameter
ν
lets you state how large you believe this subpopulation to be.
Laplace-distribution
as an interesting modification to the normal distribution(replacingexp(L2-norm)
withexp(L1-norm)
in the formula). A Laplace centered on 0 can be used to put a strong sparsity prior on a variable while leaving a heavy-tail for it if the value has strong support for another value.
-
- The reality is that many of these prior distributions are making assumptions about the
P( Data | θ )
is called likelihood of data given model parameters. The goal is to maximize the likelihood function probabilityL(x,x,x,x..|θ)
to choose the best θ.
- The formula for likelihood is model-specific.
- People often use likelihood for evaluation of models: a model that gives higher likelihood to real data is better.
- If one also takes the prior into account, then it’s maximum a posteriori estimation (MAP).
P(Data|θ)
xP(θ)
. What it means is that, the likelihood is now weighted with some weight coming from the prior. MLE and MAP are the same if the prior is uniform.
P( θ | Data )
, a posterior, is what we’re after. It’s a parametrized distribution over model parameters obtained from prior beliefs and data. The goal is to maximize the posterior probabilityL(x,x,x,x..|θ)*P(θ)
that is thevalue x Distribution
to choose the best θ.
- we assume the model - Joint:
P(θ, Data)
which isP(Data|θ)
xP(θ)
- MAP can unlike MLE, avoid overfitting. MAP gives you the
L2 Regularization
term. - But we still anyhow prefer to obtain Full Distribution rather than just point estimate. We want to address the uncertainty.
- They are similar, as they compute a single estimate, instead of a full distribution.
- "Inference" refers to how you learn parameters of your model. Unlike MLE and MAP, Bayesian inference means that it fully calculates the posterior probability distribution, hence the output is not a
single value
but apdf or pmf
. - It's complex since we now have to deal with the Evidence(with the integral computation). But if we are allowed to use conjugation method, we can do Bayesian inference since it's easy. However, it’s not always the case in real-world applications. We then need to use MCMC or other algorithms as a substitute for the direct integral computation.
- There are three main flavours:
- 0. Conjugation method
- Find a conjugate prior(very clear) based on the given likelihood then compute posterior using math!
- It simply implies the integral of the
joint
is a closed form!
- 1. MCMC: a gold standard, but slow. We still need a prior? Yes! even fake prior because we still need the
joint
!- It implies the integral of the
joint
is an open form! - Obtain a posterior by sampling from the "Envelop".
- It implies the integral of the
- 2. Variational inference: faster but less accurate. Its drawback is that it’s model-specific..(use when likelihood & prior is
clear
)- It implies the integral of the
joint
is an open form! - Obtain a posterior by "appropriating other distribution".
- It implies the integral of the
- 0. Conjugation method
- If you have a truly infinite computational budget, MCMC should give more accurate solution than Variational Inference that trades some accuracy for speed. With a finite budget (say 1 year of computation), Variational Inference can be more accurate for very large models, but if the budget is large enough MCMC should give a better solution for any model of reasonable size.
Variational inference seeks to approximate the true posterior with an approximate variational distribution, which we can calculate more easily. The difference of EM-algorithm and Variational-Inference is the kind of results they provide; EM is just a point while VI is a distribution
. However, they also have similarities. EM and VI can both be interpreted as minimizing some sort of distance between the true value and our estimate.
- For EM: which is the
Maximum-Likelihood
- ...assign fake param -> develop soft fake villages -> calculate weighted param from the village -> obtain the MLE value of villages by developing new soft fake villages again based on the weighted param -> Repeat from the start until the MLE value gets to the maximum...Now, finally you obtain the best param.
- For VI: which is the
Kullback-Leibler divergence
.
The term variational comes from the field of variational calculus. Variational calculus is just calculus over functionals instead of functions. Functionals are just a function of function(inputs a function and outputs a value). For example, the KL-divergence are functionals. The variational inference algorithms are simply optimizing functionals which is how they got the name "variational Bayes".
- We have perfect likelihood and prior. But we don't have Evidence. So the un-normalized posterior(joint) is always the starting point.
- The main idea behind variational methods is to pick a fake? posterior
q(z)
as a family of distributions over thelatent variables
with its own variational parameters. Go with the exponential family in general? This is your FINGERS!!!! - Then,find the setting of the best parameters that makes
q(z)
close to the posterior of interest. Useq(z)
with the fitted parameters as a proxy for the posterior to predict about future data or to investigate the posterior distribution of the hidden variables (Typically, the true posterior is not in the variational family). - Typically, in the true posterior distribution, the latent variables are not independent given the data, but if we restrict our family of variational distributions to a distribution that
factorizes over each variable in Z
(this is called a mean field approximation), our problem becomes a lot easier. - We can easily pick each variational distribution(V_i) when measured by Kullback Leibler (KL) divergence because we compare this
Q(Z)
with ourun-normalized posterior
that we already have (KL divergence formula has a sum of terms involving V, which we can minimize...So the estimation procedure turns into an optimization problem). Once we arrive at the bestV*
, we can useQ(Z|V*)
as our best guess at the posterior.
- Step_01: Select the family distribution Q called a "variational family": a pool of Q
- Step_02: Try to approximate the full posterior
P*(z)
with some variational distributionQ(z)
by searching the best matching distribution, minimizing "KL-divergence" value.- minimizing KL-divergence value(E[log Q over P]) between
Q(z)
andP*(z)
- minimizing KL-divergence value(E[log Q over P]) between
- Kullback Leibler-Divergence measures the difference(distance) b/w two distributions, so we minimize this value between your variational distribution choice and the un-normalized posterior (not differ from normalized real posterior...coz the evidence would become a constant...in the end.)
If you additionally require that the variational distribution factors completely over your parameters, then this is called the variational mean-field approximation.
- Step_01: Select the family distribution Q called a "variational family" by product of
Q(z1)
,Q(z2)
,...where z is the latent variable. - Step_02: Try to approximate the full posterior
P*(z)
with some variational distributionQ(z)
by searching the best matching distribution, minimizing "KL-divergence" value.- minimizing KL-divergence value(E[log Q over P*]) between
Q(z)
andP*(z)
- minimizing KL-divergence value(E[log Q over P*]) between
Evidence is discussed in the process of inference (not in the prediction...?) Bayesian methods are appealing for prediction problems thanks to their ability to naturally incorporate both sample variability
and parameter uncertainty
into a predictive distribution. Let's train data points X and Y. We want predict the new Y at the end. In Bayesian Prediction, the predicted value is a weighted average of output of our model for all possible values of parameters.
Alternative perspective on the prediction method is Bayesian Prediction with Copulas. Handling data arriving in real time requires a flexible non-parametric model, and the Monte Carlo methods necessary to evaluate the predictive distribution in such cases can be too expensive to rerun each time new data arrives. With respect to this, Bayesian Prediction with Copulas' approach facilitates the prediction without computing a posterior
.
-
However, in cases where it is not possible to work directly with the posterior, this natural Bayesian updating formula is out of reach.
-
In our context of estimating the predictive distribution in real time, it is not possible to look at the entire dataset all at once, thus we seek the flexibility of a non-parametric model, largely to avoid potential model misspecification. That is, it is necessary to start with a sufficiently flexible model that can adapt to the shape of the distribution as they arrive. In these non-parametric cases, θ is not a finite-dimensional parameter, but it is an infinite-dimensional index - - of the distribution clusters(Gaussian, Beta, whatever...)that explaining the dataset. The most common strategy, in the present context of modelling densities, is the so-called Dirichlet process mixture model. The problem is that given the posterior based on the full data, when new data arrives, the MCMC must be rerun on the full data to get the posterior or the predictive density . This can be prohibitively slow, thereby motivating a fast recursive approximation.
-
To circumvent the aforementioned computational difficulties in Bayesian updating in the predictive models, we turn to a new strategy: A Recursive Approximation with Copulas. A Copula as a mathematical object captures the joint behavior of two different Random Variables, each of which follows different distribution, and returns a single bivariate distribution formula. Sklar theorem implies that there exists a symmetric copula density such that That is, for each Bayesian model, there exists a unique sequence {} of copula densities. This representation reveals that it is possible to directly and recursively update the predictive distribution without help of MCMC. It has the advantage of directly estimating the predictive density and does not require numerical integration to compute normalising constants.
For a Dirichlet process mixture model, with
Gaussian kernel
- N(x|u, 1) - andDP sample prior
- where .The
Gaussian Copula Density
is In particular, we consider the following recursive sequence of predictive densities But it's too complicate... On the CDF distribution function scale, the algorithm is a bit more transparent, that is, The take-away message is that there exists a recursive update of the predictive density in the Dirichlet process mixture model formulation, characterised by a copula density. (???really???) -
The choice of
ρ
is entirely up to the discretion of the researcher, with values closer to 1 corresponding to less smoothing (ρ=0.90 is a reasonable choice?). For the weights, a choice like =(i+1)^-r for r ∈ (0.5, 1]...asi
grows,α
decreases (r=1 as a default choice?). In choosing the initial guess of , try to capture the support of given dataset distribution? Since this predictive function is not sure (if there is little or no data to use as a guide), we go with some kernel density? such as t-distribution?? But we totally ignore DP prior or kernel likelihood????????
In parametric method, we define a model that depends on some parameter "theta" and then we find optimal values for "theta" by taking MLE, or MAP. And as data becomes more and more complex, we need to add more and more parameters(think about LM's coefficients, linear? polynomial?) so we can say the number of parameters are fixed.
- Fixed number of parameters => so the complexity is limited.
- Fast Inference coz you just simply feed the weights then the prediction would be just the scalar multiplication.
- But training is complicated and takes time.
In Non-parametric method, the number of parameters depend on the dataset size. That is, as the number of data points increases, the decision boundary becomes more and more complex.
- Not Fixed number of parameters => so the complexity is arbitrary.
- Slow Inference coz you have to process all the data points to make a prediction.
- But training is simple coz it in most cases just remembers all points .
[Parametric]
- A. Bayesian Network as PGM
- Bayesian Network is "Directed" and "Acyclic". It cannot have interdependent variables.
In the settings where data is scarce and precious and hard to obtain, it is difficult to conduct a large-scale controlled experiment, thus we cannot spare any effort to make the best use of available input. With small data, it is important to **quantify uncertainty**
and that’s precisely what Bayesian approach is good at. In Bayesian Modeling, there are two main flavours:
- B. Statistical Modeling:
- Multilevel/Hierarchical Modeling(Regression?)
- C. probabilistic Machine Learning approach: using data for a computer to learn automatically from it. It outputs probabilistic predictions...that's why probabilistic.. also these probabilities are only statements of belief from a classifier.
- 1) Generative modeling: One can sample or generate examples from it. Compare with classifiers(discriminative model to model
P(y|x)
to discriminate between classes based on x), a generative model is concerned with joint distributionP(y,x)
. It’s more difficult to estimate that distribution, but it allows sampling and of course one can getP(y|x)
fromP(y,x)
.- LDA: You start with a matrix where
rows
are documents,columns
are words andeach element
is a count of a given word in a given document. LDA “factorizes” this matrix of size n x d into two matrices, documents/topics (n x k) and topics/words (k x d). you can’t multiply those two matrices to get the original, but since the appropriate rows/columns sum to one, you can “generate” a document.
- LDA: You start with a matrix where
- 1) Generative modeling: One can sample or generate examples from it. Compare with classifiers(discriminative model to model
[Non-Parametric]
- A. Bayesian non-parametrics Modeling: the number of parameters in a model can grow as more data become available. This is similar to SVM, for example, where the algorithm chooses support vectors from the training points. Nonparametrics include Hierarchical Dirichlet Process version of LDA(where the number of topics chooses itself automatically), and Gaussian Processes.
-
1) Gaussian Processes: It is somewhat similar to SVM - both use kernels and have similar scalability(which has been vastly improved throughout the years by using approximations).
- A natural formulation for GP is
regression
, with classification as an afterthought. For SVM, it’s the other way around. - As most "normal" methods provide point estimates, "Bayesian" counterparts like GP also output uncertainty estimates while SVM are not.
- Even a sophisticated method like GP normally operates on an assumption of homoscedasticity, that is, "uniform noise" levels. In reality, noise might differ across input space (be heteroscedastic).
- GP outputs a mean curve and CI(cov) curves.
- A natural formulation for GP is
-
2) Dirichlet Process: The infinite-dimensional generalization of the Dirichlet distribution is the Dirichlet process. In short, the Dirichlet Process is a generalization of Dirichlet distributions where a sample from DP generates a Dirichlet distribution. Interestingly, the generalization allows the Dirichlet Process to have an infinite number of components (or clusters), which means that there is no limit on the number of Hyper-parameters. Using DP, we sample proportion of each element in a vector or multinomial random variable from the undefined dimension that can go to infinity.
-
-
a) Frequentist LM
- typically go through the process of checking the 1.
residuals against a set of assumptions
, 2.adjusting/selecting features
, 3.rerunning the model
, 4.checking the assumptions again
....- Frequentist diagnose is based on the
fitted model
using MLE of the model parameters.- "likelihood":
f(x|β)
- "likelihood function":
L(x,x,x,x|β)
by fitting a distribution to the certain data so...producting them, then differentiating to get the bestβ
. But the result is just a point estimate(also subject to the overfitting issue)...it cannot addressUncertainty
! - subject to overfitting!
- "likelihood":
- Frequentist diagnose is based on the
- typically go through the process of checking the 1.
-
b) Bayesian Hierarchical LM
- It allows a useful mechanism to deal with insufficient data, or poorly distributed data. If we have fewer data points, the posterior distribution will be more spread out. As the amount of data points increases, the likelihood washes out the prior.
- It puts a prior on the coeffients and on the noise so that in the absence of data, the priors can take over !
- Once fitting it to our data, we can ask:
- What is the estimated
linear relationship
- what is the confidence on that relation, and the full posterior distribution on that relation?
- What is the estimated
noise
and the full posterior distribution on that noise? - What is the estimated
gradient
and the full posterior distribution on that gradient?
- What is the estimated
-
Posterior Computation by Bayesian Inference: How to avoid computing the Evidence?
- A> When we want to get the model parameter, the Evidence is always a trouble. There is a way to avoid
computing the **Evidence**
. In MAP, we don't need the "Evidence". But the problem is that we cannot use its result as a prior for the next step since the output is a single point estimate.- Below is MAP for LM parameter vector
w
. - The result says it's the traditional MLE value +
L2 regularization
term (because of the prior) that fix overfitting. - But it still does not have any representation of Uncertainty!
- Below is MAP for LM parameter vector
-
B> There is another way to avoid
computing the **Evidence**
- Use Conjugate prior. We can, but do not need to compute the Evidence.-
Conjugate
Prior
as a member of certain family distributions, is conjugate to alikelihood
if the resulting posterior is also the member of the same family.- Discrete Likelihood
Beta prior
is conjugate to Bernoulli likelihood. (so Bernoulli model? then choose Beta)Beta prior
is conjugate to Binomial likelihood. (so Binomial model? then choose Beta)Dirichlet prior
is conjugate to Muiltinomial likelihood. (so Multinomial model? then choose Dirichlet)Gamma prior
is conjugate to Poisson likelihood. (so Possion model? then choose Gamma)Beta prior
is conjugate to Geometric likelihood. (so Geometric model? then choose Beta)
- Continous Likelihood
Gaussian prior
is conjugate to Gaussian likelihood + known SD.Inverse Gamma prior
is conjugate to Gaussian likelihood + Known μ.Pareto prior
is conjugate to Uniform likelihood.Gamma prior
is conjugate to Pareto likelihood.Gamma prior
is conjugate to Exponential likelihood.
- If the likelihood is a member of Exponential-family, it always guarantees the presence of the conjugate prior.
- Discrete Likelihood
-
Gaussian Prior for Gaussian likelihood + known SD
- Now we can take advantage of having access to the full posterior distribution of the model parameter(Coefficient): we can either obtain a point estimator from this distribution (e.g. posterior mean, posterior median, ...) or conduct the same analysis using this estimate...now we can say
Uncertainty
. - Check the goodness of fit of the estimated model based on the predictive residuals. It is possible to conduct the same type of diagnose analysis of Frequentist's LM.
-
-
C> To approximate the posterior, we use the technique of drawing random samples from a posterior distribution as one application of Monte Carlo methods.
-
- Specify a prior
π(β)
.
- Specify a prior
-
- Create a model mapping the training inputs to the training outputs.
-
- Have a MCMC algorithm draw samples from the posterior distributions for the parameters.
-
- A> When we want to get the model parameter, the Evidence is always a trouble. There is a way to avoid
- What do you want the model to do well at?
- How both
regularizing priors
, andinformation criteria
help you improve and estimate the "out-of-sample"(yet-to-be-observed) deviance of a model ?- deviance: approximation of relative distance from "perfect accuracy".
What is "information"? How much we have learned? It refers to the reduction in uncertainty when we learn an outcome.
How to measure uncertainty? There is only one function: Information Entropy
.
- The uncertainty contained in a probability distribution can be expressed as: log-probability of an event
E[ log(p) ]
p <- c(0.3, 0.7)
H <- -sum( p*log(p) )
It gives...0.61: it's quite uncertain... High Entropy..Chaos..high disorder..very big uncertainty!
p <- c(0.01, 0.99)
H <- -sum( p*log(p) )
It gives...0.06: That's nice! it's quite certain... Low Entropy..low disorder..very small uncertainty!
How to use Information Entropy to say how far your model is from the target model? The key lies in: Kullback-Leibler Divergence
.
- Suppose there is a true distribution (with
p1, p2,..
), but we only have a slightly different distribution (withq1, q2,..
) to describe the true distribution. How much additional uncertainty we might introduce as a consequence? - The additional uncertainty introduced from using the distribution in our hand can be expressed as:
E[ log(p)-log(q) ]
, but there is a catch! You need to use "cross entropy"..because you are trying to findp
, usingq
.
Since predictive models specify probabilities of events(obv), We can use KL Divergence to compare the accuracy of models.
Then How to estimate the divergence? There is "no way" to access the target p
directly. Luckily, we simply compare the divergences of different candidates - r
vs q
-, using 'deviance' (model fit measure). But we need to know E[log(r)]
and E[log(q)]
, which are exactly like what you've been using in MLE. Hence, summing the log probabilities of (x
,r
) or (x
,q
) gives an approximation of E[log(r)]
or E[log(q)]
, but we don't have to know the real parameter p
inside the expectation terms. So we can compare E[log(r)]
VS E[log(q)]
to get an estimate of the relative distance of each model from the target. (Having said that, however, the absolute magnitude of E[log(r)]
or E[log(q)]
cannot be known, we do not know they are good model or bad model. Only the difference E[log(r)] - E[log(q)]
informs about divergence from the target p
.
SUM( log(pdf) )
(total log probability score) is the gold standard way to compare the predictive accuracy of different models. It is an estimate of the cross entropy: E[log(pdf)]
w/o multiplying the probability term. To compute this, we need the full posterior distribution because in order to get log(pdf)
, we need to find log( E[probability of obv] )
where the E[.]
is taken over the full posterior distribution of θ
...This is called total "Log-point wise predictive density" a.k.a "total log probability score".
Deviance
is simply the "log pointwise predictive density" multiplied by - 2
. The model with a "smaller deviance value" is expected to show higher accuracy.
However, just like R^2,...it's a measure of retrodictive accuracy rather than predictive accuracy. It always improves as the model gets complex. So they are absurd!
Then...what predictive criteria are available?
- AIC (Akaike IC) is shit coz...
-
- prior should be flat
-
- posterior should follow Gaussian
-
- sample size should be greater than the No.of parameters
-
- DIC (Deviance IC)
-
- Ok.
-
- posterior should follow Gaussian
-
- sample size should be greater than the No.of parameters
-
- WAIC (Widely Applicable IC)
- No Assumptions....OK?
WAIC
is simply the "log pointwise predictive density" plus a "penalty proportional to the variance" in the prediction.
The penalty term means the summation of the variance in log probability(likelihood) for each obv. Each obv has its own penalty score that measures overfitting risk (we are assessing overfitting risk at th level of each obv). FYI, this penalty term, in the analogy of AIC, is the number of parameters.
Q. Which observed data point contribute to overfitting the most?
Q. WAIC computation on train/test gives different value because...the sample size
scales the deviance. It is the distance b/w models that is useful, not the absolute value of the deviance.
When there are several plausible (and hopefully un-confounded) models for the same set of observations, how should we compare the accuracy of these models? Following the fit to the sample is no good, because fit will always favor more complex models. Information divergence is the right measure of model accuracy, but even it will just lead us to choose more and more complex and wrong models. We need to somehow evaluate models out-of-sample. How can we do that? A meta-model of forecasting tells us two important things.
- First, flat priors produce bad predictions. Regularizing priors—priors which are skeptical of extreme parameter values—reduce fit to sample but tend to improve predictive accuracy.
- Second, we can get a useful guess of predictive accuracy with the criteria: CV, Pareto Smoothed Importanmt Sampling-CV, and WAIC.
Regularizing priors and CV / PSIS-CV / WAIC are complementary. Regularization reduces overfitting, and predictive criteria measure overfitting... ....T.B.D