- In the following midterm project , I decided to use the Hotel booking demand data set from Kaggle, to predict whenever a customer's booking will be cancelled or not. I thought this was a really good data set to practice and learn a lot about ML techniques and challenges , as this dataset contained tons of features and clean data to work with , furthermore , the proposed Article was really helpful to understand how this dataset worked.
This data set contains a single file which compares various booking information between two hotels: a resort hotel(H1), and a city hotel(H2) , comprehending bookings due to arrive between July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. Article
Notebook Description: -Reading the data set and Introduction -Exploratory Data Analysis with Pandas and -NumPy -Data Preparation using Sklearn -Selecting and Training a few Machine -Learning Models -Cross-Validation and Hyperparameter -Tuning using Sklearn
Description | Link |
---|---|
Notebook | Explanatory notebook with EDA and Training |
Hotel booking demand data set | csv file |
DockerFile commands | dockerfile |
Pipenv file | Pipfile |
Train.py | Training of final XGBoost model script |
Predict.py | Predict app |
predict_test.py | To predict customer cancelling probability |
Variable | Type | Description |
---|---|---|
ADR | Numeric | Average Daily Rate |
Adults | Integer | Number of adults |
Agent | Categorical | ID of the travel agency that made the bookinga |
ArrivalDateDayOfMonth | Integer | Day of the month of the arrival date |
ArrivalDateMonth | Categorical | Month of arrival date with 12 categories: “January” to “December” |
ArrivalDateWeekNumber | Integer | Week number of the arrival date |
ArrivalDateYear | Integer | Year of arrival date |
AssignedRoomType | Categorical | Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons |
Babies | Integer | Number of babies |
BookingChanges | Integer | Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation |
Children | Integer | Number of children |
Company | Categorical | ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons |
Country | Categorical | Country of origin. |
CustomerType | Categorical | Type of booking, assuming one of four categories: |
DaysInWaitingList | Integer | Number of days the booking was in the waiting list before it was confirmed to the customer |
DepositType | Categorical | Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: - No Deposit – no deposit was made; - Non Refund – a deposit was made in the value of the total stay cost; - Refundable – a deposit was made with a value under the total cost of stay. |
DistributionChannel | Categorical | Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
IsCanceled | Categorical | Value indicating if the booking was canceled (1) or not (0) |
IsRepeatedGuest | Categorical | Value indicating if the booking name was from a repeated guest (1) or not (0) |
LeadTime | Integer | Number of days that elapsed between the entering date of the booking into the PMS and the arrival date |
MarketSegment | Categorical | Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
Meal | Categorical | Type of meal booked. Categories are presented in standard hospitality meal packages: Undefined/SC – no meal package; BB – Bed & Breakfast; HB – Half board (breakfast and one other meal – usually dinner); FB – Full board (breakfast, lunch and dinner) |
PreviousBookingsNotCanceled | Integer | Number of previous bookings not cancelled by the customer prior to the current booking |
PreviousCancellations | Integer | Number of previous bookings that were cancelled by the customer prior to the current booking |
RequiredCardParkingSpaces | Integer | Number of car parking spaces required by the customer |
ReservationStatus | Categorical | Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why |
ReservationStatusDate | Date | Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel |
ReservedRoomType | Categorical | Code of room type reserved. Code is presented instead of designation for anonymity reasons |
StaysInWeekendNights | Integer | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel |
StaysInWeekNights | Integer | Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel |
TotalOfSpecialRequests | Integer | Number of special requests made by the customer (e.g. twin bed or high floor) |
Obviously there's a lot of features in this dataset, In this case ,I deleted some features I thought harmed the performance and generality of the model predictions , such as : Country
, ReservationStatus
,ReservationStatusDate
,AssignedRoomType
, among others.
With the main one being ReservationStatus
, this feature being the clear one that can cause data leakage , and the model may reach 100 % accurate predictions with it , causing over-fitting .
4 Types of Models were used in this project. Including :
Logistic Regression
Decision Trees
Random Forests
(Tuned and not Tuned)XGBoost
(Tuned and not Tuned)
Python version: Python 3.9.7
🐍
Versions/requirements used inside the virtual environment:
xgboost 1.5.0
scikit-learn 1.0
gunicorn 20.1.0
Flask 2.0.2
Before running this dockerbuild , please verify you got docker daemon running.
$ sudo systemctl start docker
OR:
$ sudo /etc/init.d/docker start
To build the docker image from this project , move inside the code
directory , and run the following command :
$ docker build -t zoomcamp_project .
Now, wait for it to install the pipenv dependencies, it should look like this:
Now run the docker build mapping the port 9696
to your host computer.
docker run -it --rm -p 9696:9696 zoomcamp_project
Inside another terminal , simply run:
python predict_test.py
You should see:
In this project , I used AWS Elastic Beanstalk to deploy my docker container to the cloud , I followed the steps described in zoomcamp week 5 .
To run this easily , you should simply uncomment the following lines from predict_test.py
host = 'hotel-serving-env.eba-sf8r37gc.us-east-1.elasticbeanstalk.com.'
url = f'http://{host}/predict'
#url = 'http://localhost:9696/predict'
Now, simply run
python predict_test.py
Inside the code Folder, you should see the following output: