TalkingData_competition
Downloading the dataset
- First clone this repository by using
git clone https://github.com/minging234/TalkingData_competition.git
- For this project, we will use the
train.csv
file. Navigate to Kaggle download link, and download the zip file into the repo folder. - Unzip the csv file. It should automatically be placed in
./mnt/ssd/kaggle-talkingdata2/competition_files
folder.
Getting cleaned data
Due to the large number of features, it is not possible to work with all of the data. For now, I limit the size of dataset to the first 50 million samples (out of 178 million total samples). To get cleaned data that can be readily used for classification:
- Run the clean_data.py script by
python3 clean_data.py
- If you downloaded the data according instructions in Downloading the dataset, everything should run automatically. Otherwise, modify the
path-train
variable inclean_data.py
.
Running algorithms
Once the data is cleaned, you can use Jupyter notebook to open up workbook.ipynb
in this repository. This notebook will load the data for you in style that is similar to the ones used throughout class (X, y variables).