HackerEarth Machine Learning Challenge #2

Predicting success on Kickstarter

Brief: https://www.hackerearth.com/challenge/competitive/machine-learning-challenge-2/

Dataset: https://he-s3.s3.amazonaws.com/media/hackathon/machine-learning-challenge-2/funding-successful-projects/3149def2-5-datafiles.zip

Problem statement

Kickstarter is a community of more than 10 million people comprising of creative, tech enthusiasts who help in bringing creative project to life. Till now, more than $3 billion dollars have been contributed by the members in fuelling creative projects. The projects can be literally anything – a device, a game, an app, a film etc.

Kickstarter works on all or nothing basis i.e if a project doesn’t meet it goal, the project owner gets nothing. For example: if a projects’s goal is $500. Even if it gets funded till $499, the project won’t be a success.

Recently, kickstarter released its public data repository to allow researchers and enthusiasts like us to help them solve a problem. Will a project get fully funded ?

In this challenge, you have to predict if a project will get successfully funded or not.

Input data description

There are three files given to download: train.csv, test.csv and sample_submission.csv The train data consists of sample projects from the May 2009 to May 2015. The test data consists of projects from June 2015 to March 2017.

Variable	Description
project_id	unique id of project
name	name of the project
desc	description of project
goal	the goal (amount) required for the project
keywords	keywords which describe project
disable communication	whether the project authors has disabled communication option with people donating to the project
country	country of project author
currency	currency in which goal (amount) is required
deadline	till this date the goal must be achieved (in unix timeformat)
state_changed_at	at this time the project status changed. Status could be successful, failed, suspended, cancelled etc. (in unix timeformat)
created_at	at this time the project was posted on the website(in unix timeformat)
launched_at	at this time the project went live on the website(in unix timeformat)
backers_count	no. of people who backed the project
final_status	whether the project got successfully funded (target variable – 1,0)

Custom features

Variable	Description
goal_usd	goal converted to USD using historic exchange rate
keywords_len	length of keywords
keywords_count	number of keywords
name_len	length of project name
name_count	number of words in project name
desc_len	length of description
desc_count	number of words in description
name_capitals	number of uppercase letters in project name
name_digits	number of digits in the project name
name_digits_any	boolean indicating whether there are any digits in project name
desc_digits	number of digits in description
launched_at_weekday	day of week when the campaign was launched
deadline_weekday	day of week of campaign deadline
campaign_len	campaign length

Solution

After experimenting with several algorithms (Naive Bayes, SVM) and selecting optimal features & parameters using KBest and GridSearchCV respectively, I settled on AdaBoost with thw following features:

['campaign_len'
 ,'keywords_count'
 ,'keywords_len'
 ,'name_count'
 ,'name_capitals'
 ,'desc_digits'
 ,'name_digits_any'
 ]

This led to a prediction score of 0.67484 on the test version of the dataset.

Ideas for improvement

Seasonality is something that could be explored further. I found that the day of the week when the campaign starts or finishes play no role in the outcome, but longer-term seasonal fluctuations are likely. Could there be a downturn in funding around summer? or an upturn around the time when people get bonuses?

seifip / hackathon-ml-kickstarter