seifip / hackathon-ml-kickstarter

HackerEarth Machine Learning Challenge #2: Kickstarter

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HackerEarth Machine Learning Challenge #2

Predicting success on Kickstarter

Brief: https://www.hackerearth.com/challenge/competitive/machine-learning-challenge-2/

Dataset: https://he-s3.s3.amazonaws.com/media/hackathon/machine-learning-challenge-2/funding-successful-projects/3149def2-5-datafiles.zip

Problem statement

Kickstarter is a community of more than 10 million people comprising of creative, tech enthusiasts who help in bringing creative project to life. Till now, more than $3 billion dollars have been contributed by the members in fuelling creative projects. The projects can be literally anything – a device, a game, an app, a film etc.

Kickstarter works on all or nothing basis i.e if a project doesn’t meet it goal, the project owner gets nothing. For example: if a projects’s goal is $500. Even if it gets funded till $499, the project won’t be a success.

Recently, kickstarter released its public data repository to allow researchers and enthusiasts like us to help them solve a problem. Will a project get fully funded ?

In this challenge, you have to predict if a project will get successfully funded or not.

Input data description

There are three files given to download: train.csv, test.csv and sample_submission.csv The train data consists of sample projects from the May 2009 to May 2015. The test data consists of projects from June 2015 to March 2017.

Variable Description
project_id unique id of project
name name of the project
desc description of project
goal the goal (amount) required for the project
keywords keywords which describe project
disable communication whether the project authors has disabled communication option with people donating to the project
country country of project author
currency currency in which goal (amount) is required
deadline till this date the goal must be achieved (in unix timeformat)
state_changed_at at this time the project status changed. Status could be successful, failed, suspended, cancelled etc. (in unix timeformat)
created_at at this time the project was posted on the website(in unix timeformat)
launched_at at this time the project went live on the website(in unix timeformat)
backers_count no. of people who backed the project
final_status whether the project got successfully funded (target variable – 1,0)

Custom features

Variable Description
goal_usd goal converted to USD using historic exchange rate
keywords_len length of keywords
keywords_count number of keywords
name_len length of project name
name_count number of words in project name
desc_len length of description
desc_count number of words in description
name_capitals number of uppercase letters in project name
name_digits number of digits in the project name
name_digits_any boolean indicating whether there are any digits in project name
desc_digits number of digits in description
launched_at_weekday day of week when the campaign was launched
deadline_weekday day of week of campaign deadline
campaign_len campaign length

Solution

After experimenting with several algorithms (Naive Bayes, SVM) and selecting optimal features & parameters using KBest and GridSearchCV respectively, I settled on AdaBoost with thw following features:

['campaign_len'
 ,'keywords_count'
 ,'keywords_len'
 ,'name_count'
 ,'name_capitals'
 ,'desc_digits'
 ,'name_digits_any'
 ]

This led to a prediction score of 0.67484 on the test version of the dataset.

Ideas for improvement

Seasonality is something that could be explored further. I found that the day of the week when the campaign starts or finishes play no role in the outcome, but longer-term seasonal fluctuations are likely. Could there be a downturn in funding around summer? or an upturn around the time when people get bonuses?

About

HackerEarth Machine Learning Challenge #2: Kickstarter


Languages

Language:Jupyter Notebook 100.0%