limcheekin / r-flight-delay-prediction

Predict Flight Delay using R

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Predict Flight Delay using R

Use the Machine Learning Workflow to process and transform DOT data to create a prediction model. This model must predict whether a flight would arrive 15+ minutes after the scheduled arrival time with 70+% accuracy.

Download data

Download 2015 January raw data in csv file from the following URL: http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time

The downloaded data stored in raw_201501.csv.

Preparing data

Find out more from Steps of Data Preparation and it's corresponding source file.

Selecting algorithm

The selected initial algorithm is Logistic Regression as it is simple(easy to understand), fast(up to 100x faster) and stable to data changes.

Next, switched to Random Forest to improve prediction result.

Train model

  • Training the model using caret package

    install.packages('caret')
    library(caret)
  • Set the seed so that the random number generated in the same sequence to yield the same training results

    set.seed(12345)
  • Retrieve feature columns only

    featureColumns <- c('ARR_DEL15', 'DAY_OF_WEEK', 'CARRIER', 'DEST', 'ORIGIN', 'DEP_TIME_BLK')
    onTimeDataFiltered <- onTimeData[,featureColumns]
  • Retrieve 70% of data for training

    trainRows <- createDataPartition(onTimeDataFiltered$ARR_DEL15, p=0.7, list=FALSE)
    head(trainRows, 10)
    trainData <- onTimeDataFiltered[trainRows,]
  • Retrieve the remaining 30% of data for testing

    testData <- onTimeDataFiltered[-trainRows,]
  • Simple verification

    nrow(trainData)/(nrow(trainData) + nrow(testData))
    nrow(testData)/(nrow(trainData) + nrow(testData))
  • Train the model with train data

    logisticRegModel <- train(ARR_DEL15 ~ ., data=trainData, method="glm", family="binomial")
    Error in train.default(x, y, weights = w, ...) : 
    One or more factor levels in the outcome has no data: ''
    

About

Predict Flight Delay using R


Languages

Language:R 100.0%