rajeshidumalla / Classification-tree-to-the-housing-data-using-the-R-package-rpart

This project about the fitting a classification tree to the housing data sing R package rapart.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Classification tree to the housing data using the R package rpart

The aim of this project is to fit a classification tree to the housing data using the R package rpart. So, I begin by loading the data, and indicating which variables are categorical (note that some of them are ordinal variables, and we should treat them as such!)

library(rpart)
info=c("Y","sex","marital_status","age","education","occupation","annual_income","length",
"dual_incomes","size_household","nb_minors","householder_status","ethnic","language")
X=read.table("Housetype_Data.csv",sep=",",header=F)
Y=X[,1]
X=data.frame(X)
colnames(X)=info
X$Y<-as.factor(X$Y)
X$sex<-as.factor(X$sex)
X$marital_status<-as.factor(X$marital_status)
X$occupation<-as.factor(X$occupation)
X$dual_incomes<-as.factor(X$dual_incomes)
X$ethnic<-as.factor(X$ethnic)
X$householder_status<-as.factor(X$householder_status)
X$language<-as.factor(X$language)
#### The goal is too infer Y from X by fitting an optimal classification tree

Let's begin by analyzing the data: nb of missing values, spread, etc (just to get the feel of the data)

##### We begin by analyzing the data: nb of missing values, spread, etc
K= 5 ## number of possible classes for the housing
summary(X)

Screen Shot 2021-10-10 at 12 47 05 am

# grow tree
fit <- rpart(Y~ .,method="class", data=X,cp=0) ### Y vs X (the rest of the variables)
### This method does exactly what we want: it gives back a classification tree according to the formula given
### The complexity paramter
plotcp(fit) # visualize cross-validation results

Screen Shot 2021-10-10 at 12 48 29 am

### Prune the tree: select the CP that is 1MSE above the cp-value that minimizes the misclassification error
# get index of CP with lowest xerror
opt_cp <- which.min(fit$cptable[,'xerror'])
candidate_cp<- which(fit$cptable[,'xerror']>(fit$cptable[opt_cp,'xerror']+fit$cptable[opt_cp,'xstd']))
cp_mse_ind<- min(which(fit$cptable[,'xerror']<(fit$cptable[opt_cp,'xerror']+fit$cptable[opt_cp,'xstd'])))
cp_mse=fit$cptable[cp_mse_ind,'CP']
#get its value
#prune tree
pruned_model <- prune(fit,cp_mse)
### as input
printcp(pruned_model) # display the results in terms of complexity paramter
#plot tree
library(rpart.plot)

Warning: package  rpart.plot  was built under R version 3.3.2

rpart.plot(pruned_model, uniform=TRUE,main="Classification Tree for Household_Data")

Screen Shot 2021-10-10 at 12 49 49 am

We note that:

• class 2 (Condominium) and 4 (mobile home) are not predicted by our classification tree. This is due to the fact that these are relatively rare classes, which account for only respectively 7.5 and 1.8 % of the observations.

• The most important variable in our case if the householder status (own, rent, live with parent)

• The misclassification cross-validated error rate is roughly 0.409 x 0.636 = 0.2601

About

This project about the fitting a classification tree to the housing data sing R package rapart.


Languages

Language:R 100.0%