Brief: https://www.hackerearth.com/challenge/competitive/machine-learning-challenge-2/
Kickstarter is a community of more than 10 million people comprising of creative, tech enthusiasts who help in bringing creative project to life. Till now, more than $3 billion dollars have been contributed by the members in fuelling creative projects. The projects can be literally anything – a device, a game, an app, a film etc.
Kickstarter works on all or nothing basis i.e if a project doesn’t meet it goal, the project owner gets nothing. For example: if a projects’s goal is $500. Even if it gets funded till $499, the project won’t be a success.
Recently, kickstarter released its public data repository to allow researchers and enthusiasts like us to help them solve a problem. Will a project get fully funded ?
In this challenge, you have to predict if a project will get successfully funded or not.
There are three files given to download: train.csv, test.csv and sample_submission.csv The train data consists of sample projects from the May 2009 to May 2015. The test data consists of projects from June 2015 to March 2017.
Variable | Description |
---|---|
project_id | unique id of project |
name | name of the project |
desc | description of project |
goal | the goal (amount) required for the project |
keywords | keywords which describe project |
disable communication | whether the project authors has disabled communication option with people donating to the project |
country | country of project author |
currency | currency in which goal (amount) is required |
deadline | till this date the goal must be achieved (in unix timeformat) |
state_changed_at | at this time the project status changed. Status could be successful, failed, suspended, cancelled etc. (in unix timeformat) |
created_at | at this time the project was posted on the website(in unix timeformat) |
launched_at | at this time the project went live on the website(in unix timeformat) |
backers_count | no. of people who backed the project |
final_status | whether the project got successfully funded (target variable – 1,0) |
Variable | Description |
---|---|
goal_usd | goal converted to USD using historic exchange rate |
keywords_len | length of keywords |
keywords_count | number of keywords |
name_len | length of project name |
name_count | number of words in project name |
desc_len | length of description |
desc_count | number of words in description |
name_capitals | number of uppercase letters in project name |
name_digits | number of digits in the project name |
name_digits_any | boolean indicating whether there are any digits in project name |
desc_digits | number of digits in description |
launched_at_weekday | day of week when the campaign was launched |
deadline_weekday | day of week of campaign deadline |
campaign_len | campaign length |
After experimenting with several algorithms (Naive Bayes, SVM) and selecting optimal features & parameters using KBest
and GridSearchCV
respectively, I settled on AdaBoost
with thw following features:
['campaign_len'
,'keywords_count'
,'keywords_len'
,'name_count'
,'name_capitals'
,'desc_digits'
,'name_digits_any'
]
This led to a prediction score of 0.67484
on the test version of the dataset.
Seasonality is something that could be explored further. I found that the day of the week when the campaign starts or finishes play no role in the outcome, but longer-term seasonal fluctuations are likely. Could there be a downturn in funding around summer? or an upturn around the time when people get bonuses?