nicolehhy / Kaggle

The projects on Kaggle, using R or Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Kaggle

The projects on Kaggle, using R or Python

This repository consists of three kaggle topics

  • Ames Housing Price Presiction
  • Titanic Survival Prediction
  • Tripadvisor Reviews Clustering

(i) Ames Housing Price Presiction

    1. Background
      Purpose: learn 3 different algorithms to do prediction. This project includes data split, data cleaning, data exploration, feature engineering and model building.
    1. data split
      Before building the model , we need to divide the data into training and test set. The training part is used for building the predictive model, and the test part is to check the accuracy of that.
    1. data cleaning
      Raw data usually has some outliers and missing values which could affect the accuracy of the model. So in this part, we need to deal with them before feature engineering
    1. feature engineering
      Before we conduct feature engineering. We should have a quick look at the importance of variables after implemented data cleaning. We pick out the variables we consider as important.
    1. prepare the data for model
      At first, we should drop highly correlated variables. Then we need to normalize the numeric predictors. For factor variables, we need to do the one hot encoding to normalize them, using model.martix() . After that, we need to drop the variables whose levels are less than 10, because less levels will not be that accurate in modeling
    1. logit the sale_price
      Dealing with skewness of response variable can switch the response variables into normal distribution.
    1. build the model
    • XGboost
      XGboost takes the shortest time when building the prediction model. And its accuracy is the best.
    xgbcv <- xgb.cv( params = default_param, data = dtrain, nrounds = 1000, 
    nfold = 7, showsd = T, stratified = T, print_every_n = 40, early_stopping_rounds = 10, maximize = F)
    xgb_mod <- xgb.train(data = dtrain, params=default_param, nrounds = 611)
    XGBpred <- predict(xgb_mod, dtest)
    Xgboost modeling Xgboost RMSE
    • Lasso
    • GAM

(ii) Classification (Titanic data set)

In this project, I used the Titanic data set which has been used on Kaggle (https://www.kaggle.com/broaniki/titanic) as the basis for a competition entitled Titanic: Machine Learning from Disaster. I used classification method to analyze the relationship between the predictors and passengers’ survival probability, and how to choose the most significant ones to build the model.

  • Note: the steps I conducted below involve data cleaning, building the model, estimation and plotting...

  • 1.Data overview **
    I used the function below to generate train and test set. And then select the °∞survived°± factor to be 1 or 0 to calculate the number in female, male and children who died or survived.

x_train, x_test, y_train, y_test = 
train_test_split(predictors, target, test_size = 0.10, random_state = 0)
  • 2.Balanced **
    We have to make sure that the train and test data are balanced.

  • 3.Logistic Regression **
    Logistic regression is also a good algorithm to do prediction. The ways to test the performance of the prediction model on the test data is also the accuracy score and confusion matrix. Then I finally used the code below to generate the data I want for which I chose ["Pclass","Sex","Age","Fare","Embarked"] to be the predictors and then I library the method to build the logistic model and ROC curve.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc
col_n = ["Pclass","Sex","Age","Fare","Embarked"]
x_train = pd.DataFrame(train,columns = col_n)
x_val = pd.DataFrame(test,columns = col_n)
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
  • 4.Linear Regression **
    For linear Regression, I used the data[°ÆFare°Ø, °ÆAge°Ø]. I do the Age regression based on Fare. The libraries I used for modeling, measurement and plotting are below.

  • 5.Decision Tree **
    After doing the feature engineering, I transfer the numeric features into categorical numeric features. Then I used the code below to library the method I need to build a decision tree,here I choose ["Pclass","Sex"] to be the two attributes for model:

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
col_n = ["Pclass","Sex"]
x_train = pd.DataFrame(train,columns = col_n)
y_train = train['Survived']
decisiontree = DecisionTreeClassifier()
decisiontree.fit(x_train, y_train)

The below is how decision tree looks like:
decision tree

(iii) Clustering (Trip Advisor Reviews data set)

In this project, I used Trip Advisor Reviews data set which is from (https:// archive.ics.uci.edu/ml/datasets/Travel+Reviews), which is a web site hosted at the University of California at Irvine (UCI) that contains a large number of data sets which people use for testing data mining methods. I used multiple clustering methods in this part to category the 10 attributes for the clustering model.

  • 1.Data overview **
    I used data.describe() to take a look of the data. There are 980 instances in the whole data set. And there is no missing values in every attributes which is a good start.

  • 2.K-means clustering **
    In this case, I put K into the list which is 3,5,10. There are 980 instances in the whole data set. And there is no missing values in every attributes which is a good start. The results of overall clustering score and silhouette coefficient and the Calinski score, also the clustering 2D plot. I used the code below to load the libraries I need for the process.

  • 3.Agglomerative clustering **
    A In order to do the comparison, K here is also should be one of 3,5,10. This algorithm is different from K-means method. It doesn°Øt have overall score to do estimation. However, it still involves many other ways including the 2D plot to test the performance. I used the code below to load the libraries I need for the process.

  • 4.DBSCAN clustering **
    The difference between the two above and DBSCAN is DBSCAN doesn°Øt need to set the clustering numbers. In this algorithm, eps and min_sample are two important parameters in the model. When eps goes up, the number of categories will be larger. When mini_sample goes down, the number of categories will be bigger. In order to do the comparison, K here is also should be one of 3,5,10. So in this case, I set the parameters specifically so that the clustering numbers would be the same with another two algorithms.

About

The projects on Kaggle, using R or Python


Languages

Language:Python 100.0%