Frequently, we are interested in measuring the uncertainty of some event of interest. The most common measure for this type of goal is classical probability. Through it, we can say that for a given variable, the probability of a particular event happening is a number between 0 and 1.
Some of the most famous applications of probability are:
- Prediction in a Classification Model (Logistic Regression, Decision Tree)
- Outlier detection
- Measurement of an event of interest
- Greater interpretability of some observed variable
For example, if we have the histogram of the height of the employees from a certain company:
We may be interested to find whether a certain value is an outlier or not. In the case of the normal distribution, to consider whether an observation is an outlier or not, we can check whether the probability of that observed value happening is small enough. One possible choice is to measure whether this value is beyond the center of the data mass.
In the normal distribution, if the value is greater than three standard deviations beyond the mean or less than minus three standard deviations from the mean, this value can be considered an outlier. This region has approximately 0.26% probability of happening, very small indeed. The figure below illustrates the shape of the normal distribution in relation to its standard deviation and probabilities:
We can fit the normal distribution to see if it is a reasonable choice:
It sounds like a proper fit. In this way, we can take several important conclusions and interpretations for this variable. Let us carefully review some analysis possibilities:
- Check whether the observed value of 1.87 is an outlier or not
- Calculate events of interest such as
- Obtain confidence interval
- Obtain Hypothesis Tests
Let us calculate the probability that 1.87 is extreme. As it deals with continuous variable, we can calculate this probability through the integral in the desired interval, given by:
This is exactly the equivalent of calculating the following area:
As the probability of this event is greater than 0.00013, we can consider it to be a typical value of the dataset.
Now, let us calculate some events of interest through probability:
These probabilities are calculated from the following areas, respectively:
If the distribution is normal, we can get the confidence intervals of the parameters. Thus, we can get a better idea of the interval estimation and draw conclusions about the variability of the estimates, thus drawing more reliable conclusions for our results.
In our case, we obtain a confidence interval for the population mean very accurately, with almost no variability.
In other situations, a more skewed probability distribution may better fit some variables of interest. To illustrate, imagine that we are studying employees' salary at a company suspected of corruption. The histogram of this variable is given below:
The histogram suggests an asymmetric distribution, an option would be the exponential distribution, the fit and given below: