Japan covid-19 correlation matrix analysis
Stage 1
The dataset is obtained from kaggle
This is a covid-19 dataset in Japan. The dataset include the following.
- Total number of cases in Japan
- The number of cases at prefecture level
- Metadata of each prefecture
There are attributes for respective dataset. The variables are the following.
Date
: YYYY-MM-DDPrefecture
: charPositive
: intTested
: intDischarged
: intFatal
: intHosp_required
: intHosp_severe
: int
Additionally, I have added one variable Active_Case
to track the actual positive cases.
Active_Case
= Positive
- Discharged
There is no other missing data or errors (NA etc) in this dataset.
Basic Distribution Graph
Stage 2
Compute Central Tendency and Variability Measures
In our dataset, the variable Positive
of the latest date is picked.
As per boxplots/hisogram of this variable, the following factors are discovered.
-
Mode
: It is extremely higher (5.2k)and derivative scalar in the graph. This is due to the fact that Tokyo has so much higher infection rate and population. -
Median
: This was (161) inHyogo Prefecture
. -
Mean
: The average positive cases is (434.4) among 47 prefectures. This is higher thanMedian
value '161'. It can be inferred that there are scalar valuesPositive
with significant higher values which gives a skew to thisMean
andMedian
values in comparison.
Central tendency
Stage 3
Compute correlation matrix.
Correlation Analysis
Correlated variables are the following
Fatal
andHosp_require
Positive
andPopulation
Fatal
and Hosp_require
are correlated since the more people are in severe infected condition, the fatality cases are likely to increase.
Positive
and Population
are also correlated because the more population means higher probablity of infection.
However, it is observable that Hosp_severe
is only correlated with Fatal
and shares lower correlation rate with other variables.
Presumably, Severe cases on covid-19 is due to the fact that although many population can be infected, only minority number of the infected experience severe cases because of variant factors such as their age, pre-medical condition and limited treatment in the hospital in country-side area.
Therefore, We can conclude that Hosp_severe
is indepent variable except with Fatal
.
We also conclude for dependent variable in this dataset is Positive
and the rest are more or less dependent variables except Hosp_severe
Stage 5
After multiple linear regression analysis, Positive
is defined as dependent variable.
However, straight regression line is not able to calculate the patterns which in the dataset is underfitting. In order to resolve this issue, the complexity of the model must be increased which the original features of the model are converted into the higher order polynomial terms. This operation is possible with the help of scikit-learn
module.
The model is trained with 70-30 ratio.
According to the different types of regression models used for the analysis, the R-squared gets the value around 0.80~. This almost represents a model that explains all of the variation in the response variable around its mean.
Various analysis are conducted by using Linear and Polynomial Regression Models.
The x variable is Positive
and Date
(Dataframe Date's Index Values)
Caution for the Date
, it is converted into the number of days passed. (Integer)
X = pd.DataFrame(np.c_[df.Positive], columns=['Infect Toll'])
Y = df.index.values
Linear Regression coefficient was calculated as 0.00162564
RMSE, R2's value and MAPE are calculated respectively.
Root Mean Square Error is higher in Linear Model since there's considerable deviation in predicted values. Mean Absolute Percentage Error is significantly lower in the Polynomial Model because of the dataset with 80% pf R-squared percentage.
The Linear Model's performance for training:
- RMSE : 35.755953979357685
- R2 : 0.8491951960374975
The Linear Model's performance for testing:
- RMSE : 31.36388399092678
- R2 : 0.8816123287196085
- MAPE: 39.87470903917527
The Polynomial Model's performance for training:
- RMSE : 21.040343639302094
- R2 : 0.9477815358294585
The Polynomial Model's performatnce for testing:
- RMSE :17.514006809561216
- R2 :0.9630837287177351
- MAPE: : 18.110589685632135
Redisual Plot
Residuals in regression models are basically the difference between the observed value of the target variable Positive
and the predicted value Date
in this case.
The Linear Model's Performance on Redisual Compulation
The Polynomial Model's Performance on Redisual Compulation
Deviation between Training and Testing in Residuals Plot