This is my entry into DiDi Research's machine learning algorithm competition (http://research.xiaojukeji.com/competition/main.action?competitionId=DiTech2016). Overall I ranked 460/775 in the competition (didi_comp_rank.png).
DiDi Research is a division of DiDi Chuxing, a Chinese ride-hailing company (https://en.wikipedia.org/wiki/Didi_Chuxing). The goal of this competition is to predict the demand-supply gap, i.e. the number of unfilled ride requests at a given time in a given physical district. For full details about this competition, please refer to http://research.xiaojukeji.com/competition/main.action?competitionId=DiTech2016
The provided data includes information about:
- Order details (can be used to calculate demand-supply gap)
- Points-of-interest (POI) for each district (e.g. entertainment, shopping, sports, etc.)
- Traffic time-series information for each district
- Weather time-series information for each district
DiDi provides the training data containing all the information categories above, in addition to test data without demand-supply gap information.
The full data description and download link are available at http://research.xiaojukeji.com/competition/detail.action?competitionId=DiTech2016
I pre-processed the time-invariant data to feed into a fully-connected neural network, to make predictions on the demand-supply gap. The demand-supply gap is a continuous variable, therefore I approached this as a regression problem.
Side note: Initially, I thought of using a recurrent neural network (RNN) to predict the demand-supply gap given the prior time-series data, as well as time-invariant data. However, given the timestamps of the data points in the test data set, I could not construct a continuous time-series using the test data. Given more time and a better understanding of time-series prediction and RNNs, a future enhancement would be to leverage the time-series data and possibly use RNNs as basis of prediction.
The data was pre-processed using python and numpy. The data I pre-processed were:
- Points-of-interest (POI) data for each district, represented as a tensor with shape (NUM_DISTRICTS, NUM_POI_CLASSES)
- Each of the NUM_POI_CLASSES "columns" represents how many POI instances are in each district, e.g. how many entertainment venues in this district, how many sports venues, etc.
- Date information, to create 2 feature tensors:
- Weekday (Mon/Tue/etc.) one-hot encoding, a tensor with shape (NUM_DATA_POINTS, 7)
- Daily 10-minute timeslot value, from 1-144 inclusive. A tensor with shape (NUM_DATA_POINTS, 144)
- Traffic data, shifted backwards in time by 2 timeslots (20 minutes). A tensor with shape (NUM_DATA_POINTS, 1)
- The traffic data was shifted backwards by 2 timeslots because the test data set requires a prediction of the demand-supply gap given traffic data from 20 minutes prior
- I created my own metric to measure the overall traffic based on the raw traffic data provided. More details are available in 'preprocess_data/calc_overall_traffic.py'. The overall traffic score is one float number.
All the data above was standardized (subtract mean and divide by standard deviation) as necessary.
Note I did not make use of the weather data, because weather data was missing for many timeslots.
I also pre-processed the order information to calculate the demand-supply gap for each district in all applicable timeslots.
Then, I tiled and concatenated all of the above as necessary, to create the training data feature tensor of shape (NUM_DATA_POINTS, NUM_FEATURES), and label tensor of shape (NUM_DATA_POINTS, 1). Further, I removed all data points where the demand-supply gap is 0, because a demand-supply gap of 0 is not factored into the final score of the competition.
Finally, I ran principle component analysis (PCA) on the data to reduce its dimensionality, since I had many one-hot feature vectors.
The creation of the neural network and training process was performed using TensorFlow.
Due to time contraints, I stuck with a 6 layer fully-connected neural network, with the following layer sizes: [128, 128, 128, 128, 128, 256]. ReLU non-linearity and dropout were added for all 6 layers. Since this is a regression problem, the final neural network output (prediction) is the sum of all outputs from the final layer (the 256-wide layer).
For regularization, L2 regularization in addition to dropout were used.
For the TensorFlow code used to construct the neural network, see the code between "#BEGIN model construction" and "#END model construction" in 'training/tf_nn_final_out.py'.
I used random uniform search to search across hyper-parameter value combinations for the following hyper-parameters:
- L2 regularization strength
- Dropout keep-probability
- Learning rate (for Adam optimizer)
The code for the hyper-parameter search is at 'training/tf_nn_sweep.py'.
After finding the optimal hyper-parameters, I used them to perform the final training and test result generation. The code to do this is located at 'training/tf_nn_final_out.py'.
- Python 2 or 3
- NumPy
- scikit-learn
- TensorFlow
- git clone (this repository)
- cd ml_competition_didi
- (download the training and test data into 'data/' directory)
- cd preprocess_data
- python gen_training_data_nozero.py
- Generates training data, stores the feature tensor and labels into 'x_train_nozero.npy' and 'y_train_nozero.npy'
- python gen_test_data.py
- Generates test set, stores the feature tensor into 'x_test.npy'
- python run_pca.py
- Runs PCA on both training and test data
- Somehow sklearn's PCA does not work in the TensorFlow environment, at least on my end. Running the PCA step on a new terminal with the vanilla Anaconda environment worked for me.
- cd ../training
- OPTIONAL: python tf_nn_sweep.py
- Runs the hyper-parameter search
- python tf_nn_final_out.py
- Performs training on the model, generates final test results
- cd ../preprocess_data
- python gen_test_result.py > final.csv
- 'final.csv' will be the csv file submitted to the competition