softengers / recsys2019

The complete code and notebooks used for the ACM Recommender Systems Challenge 2019

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ACM RecSys Challenge 2019

Recommender System 2018 Challenge Polimi

About the challenge

Trivago is a global hotel search platform focused on reshaping the way travelers search for and compare hotels, while enabling advertisers of hotels to grow their businesses by providing access to a broad audience of travelers via our websites and apps. We provide aggregated information about the characteristics of each accommodation to help travelers to make an informed decision and find their ideal place to stay. Once a choice is made, the users get redirected to the selected booking site to complete the booking.

It’s in the interest of the traveler, advertising booking site, and trivago to suggest suitable accommodations that fit the needs of the traveler best to increase the chance of a redirect (click­out) to a booking site. We face a few challenges when it comes to recommending the best options for our visitors, so it’s important to effectively make use of the explicit and implicit user signals within a session (clicks, search refinement, filter usage) to detect the users’ intent as quickly as possible and to update the recommendations to tailor the result list to these needs.

Goal of the challenge is to develop a session-based and context-aware recommender system to adapt a list of accommodations according to the needs of the user. In the challenge, participants will have to predict which accommodations have been clicked in the search result during the last part of a user session. Afterwards predictions are evaluated offline and scores will be displayed in a leaderboard.

Visit the challenge website for more information about the challenge.

You can read the article in which we describe our work here.

Team members

We partecipated in the challenge as PoliCloud8, a team of 8 MSc students from Politecnico di Milano:

We worked under the supervision of two PhD students:

Dataset

The dataset can be found at this link: challenge dataset page. You must be registered to download it.

Run the code

Preprocessing

Run this to preprocess the original data:

python preprocess.py

Files will be saved inside: dataset/preprocessed. It is possibile to work with different modes:

  • full: all the samples from train.csv are used as training set and all the test samples are used as test set
  • local: 80% of the train.csv is used, the remaining 20% is stashed as validation set
  • small: only a small number of sample is taken from the train.csv (you can choose how many, default is 100k samples). This is useful for debugging purposes.

Training and test set

Inside the folder preprocess_utils, there are files used to create the datasets suitable for each model:

Core models

  • dataset_xgboost.py: create the dataset for XGBoost
  • create_dataset_classification_rnn.py: create the dataset for RNN
  • tfranking_dataset_creator_2.py: create the dataset for TensorflowRanking
  • dataset_catboost.py.py: create the dataset for Catboost
  • dataset_lightGBM.py.py: create the dataset for LightGBM

Support models

  • dataset_xgboost_classifier.py: create the dataset for XGBoost classifier
  • create_dataset_binary_classification_rnn.py: create the dataset for RNN classifier

Stacking Ensemble

  • dataset_stacking.py: create the dataset for the Stacking Ensemble

Features

One of the hardest phase of the competition was feature engineering. We managed to craft around a hundred different features. You can find all of them in the folder extract_features. Each file creates a single feature and contains a description of it.

Models

You can find all the recommenders we developed inside the folder recommenders. They inherit from a common base class called RecommenderBase. This abstract class exposes some useful methods, like:

fit()

Fit the model on the data. Inherited class should extend this method in the appropriate way.

recommend_batch()

Return the list of recommendations.

evaluate()

Return the MRR computed on the local test set.

We developed several models that have been finally ensembled:

  • XGBoost
  • Tensorflow Ranking
  • Recurrent Neural Network
  • Catboost
  • LightGBM

Read the paper for more details.

Ensemble

We trained a new model in a new training set composed by the predictions of the previously trained models (stacking), together with all the other feature we crafted.

We report below an analysis of the permutation importance of the various features of the ensemble.

WeightFeature
0.4177 ± 0.0013xgb_700
0.0029 ± 0.0003scores_softmax_loss
0.0027 ± 0.0002rnn_no_bias_balanced
0.0025 ± 0.0003rnn_classifier
0.0009 ± 0.0003xgboost_impr_features
0.0009 ± 0.0002past_actions_involving_impression_session_clickout_item
0.0008 ± 0.0003personalized_popularity
0.0007 ± 0.0001future_actions_involving_impression_session_clickout_item
0.0005 ± 0.0002num_impressoins_in_clickout
0.0004 ± 0.0002last_pos_interacted
0.0003 ± 0.0001scores_pairwise_soft_zero_one_loss
0.0003 ± 0.0000num_clickouts
0.0003 ± 0.0003length_timestamp
0.0003 ± 0.0002past_time_from_closest_interaction_impression
0.0002 ± 0.0002rnn_GRU_2layers_64units_2dense_noclass0
0.0002 ± 0.0001top_pop_per_impression
0.0002 ± 0.0001impression_position
0.0002 ± 0.0002elapsed_last_action_click
0.0002 ± 0.0002session_length_timestamp
0.0002 ± 0.0001step_from_last_interaction
0.0002 ± 0.0001future_time_from_closest_interaction_impression
0.0002 ± 0.0001impression_time
0.0001 ± 0.0001fraction_pos_price
0.0001 ± 0.0001times_impression_appeared_in_clickouts_session
0.0001 ± 0.0002frenzy_factor
0.0001 ± 0.0001platform_similarity
0.0001 ± 0.0000past_actions_involving_impression_session_search_for_item
0.0001 ± 0.0001timestamp_from_last_interaction
0.0001 ± 0.0001feature
0.0001 ± 0.0001mean_price_interacted
0.0001 ± 0.0002mean_pos_interacted
0.0001 ± 0.0000actions_num_ref_diff_from_impressions_clickout_item
0.0001 ± 0.0001mean_time_action
0.0001 ± 0.0001past_mean_cheap_pos_interacted
0.0001 ± 0.0002variance_last_action
0.0001 ± 0.0001price
0.0001 ± 0.0001mean_time_per_step
0.0001 ± 0.0001scores_cosine
0.0001 ± 0.0001destination_change_distance_from_first_action
0 ± 0.0001 max_pos_interacted
0 ± 0.0001 percentage_of_total_city_clk
0 ± 0.0001 perc_click_appeared
0 ± 0.0000 past_pos_clicked_4_8
0 ± 0.0002 elapsed_last_action_click_log
0 ± 0.0000 future_pos_clicked_4_8
0 ± 0.0000 past_pos_clicked_2
0 ± 0.0000 past_pos_clicked_3
0 ± 0.0000 actions_involving_impression_session_clickout_item
0 ± 0.0000 past_pos_closest_reference
0 ± 0.0001 session_length_step
0 ± 0.0001 length_step
0 ± 0.0000 num_interactions_impr
0 ± 0.0000 search_for_poi_distance_from_first_action
0 ± 0.0000 search_for_poi_distance_from_last_clickout
0 ± 0.0000 satisfaction_percentage
0 ± 0.0001 top_pop_interaction_clickout_per_impression
0 ± 0.0001 past_actions_involving_impression_session_item_deals
0 ± 0.0001 past_times_interacted_impr
0 ± 0.0001 num_times_item_impressed
0 ± 0.0000 actions_num_ref_diff_from_impressions_interaction_item_rating
0 ± 0.0000 actions_num_ref_diff_from_impressions_interaction_item_info
0 ± 0.0000 past_pos_clicked_9_15
0 ± 0.0001 perc_clickouts
0 ± 0.0000 past_session_num
0 ± 0.0000 actions_involving_impression_session_interaction_item_rating
0 ± 0.0000 actions_num_ref_diff_from_impressions_search_for_item
0 ± 0.0000 past_pos_clicked_16_25
0 ± 0.0000 future_pos_clicked_3
0 ± 0.0000 actions_involving_impression_session_interaction_item_info
0 ± 0.0001 destination_change_distance_from_last_clickout
0 ± 0.0000 platform_FI
0 ± 0.0000 last_action_search for item
0 ± 0.0000 platform_FR
0 ± 0.0000 platform_CZ
0 ± 0.0000 last_action_search for destination
0 ± 0.0000 last_action_interaction item rating
0 ± 0.0000 last_action_search for poi
0 ± 0.0000 platform_GR
0 ± 0.0000 platform_DE
0 ± 0.0000 last_action_no_action
0 ± 0.0000 platform_CH
0 ± 0.0000 platform_AR
0 ± 0.0000 platform_CO
0 ± 0.0000 platform_AA
0 ± 0.0000 platform_AE
0 ± 0.0000 platform_AT
0 ± 0.0000 platform_AU
0 ± 0.0000 platform_BE
0 ± 0.0000 platform_BG
0 ± 0.0000 platform_ES
0 ± 0.0000 last_action_interaction item info
0 ± 0.0000 platform_EC
0 ± 0.0000 platform_BR
0 ± 0.0000 platform_CN
0 ± 0.0000 platform_DK
0 ± 0.0000 platform_CL
0 ± 0.0000 platform_CA
0 ± 0.0000 past_closest_action_involving_impression_no_action
0 ± 0.0000 day_6
0 ± 0.0000 last_action_interaction item image
0 ± 0.0000 future_closest_action_involving_impression_no_action
0 ± 0.0000 price_log
0 ± 0.0000 last_action_involving_impression_clickout item
0 ± 0.0000 last_action_involving_impression_interaction item deals
0 ± 0.0000 last_action_involving_impression_interaction item image
0 ± 0.0000 last_action_involving_impression_interaction item info
0 ± 0.0000 last_action_involving_impression_interaction item rating
0 ± 0.0000 last_action_involving_impression_no_action
0 ± 0.0000 last_action_involving_impression_search for item
0 ± 0.0000 session_device_d
0 ± 0.0000 session_device_m
0 ± 0.0000 session_device_t
0 ± 0.0000 future_closest_action_involving_impression_search for item
0 ± 0.0000 future_closest_action_involving_impression_not_present
0 ± 0.0000 std_last_action
0 ± 0.0000 day_0
0 ± 0.0000 last_action_interaction item deals
0 ± 0.0000 day_1
0 ± 0.0000 day_2
0 ± 0.0000 day_3
0 ± 0.0000 day_4
0 ± 0.0000 day_5
0 ± 0.0000 platform_HR
0 ± 0.0000 moment_A
0 ± 0.0000 moment_E
0 ± 0.0000 moment_M
0 ± 0.0000 moment_N
0 ± 0.0000 future_closest_action_involving_impression_interaction item rating
0 ± 0.0000 last_action_change of sort order
0 ± 0.0000 last_action_clickout item
0 ± 0.0000 last_action_filter selection
0 ± 0.0000 platform_HK
0 ± 0.0000 platform_KR
0 ± 0.0000 platform_HU
0 ± 0.0000 platform_TR
0 ± 0.0000 platform_UK
0 ± 0.0000 platform_US
0 ± 0.0000 platform_UY
0 ± 0.0000 platform_VN
0 ± 0.0000 platform_ZA
0 ± 0.0000 future_closest_action_involving_impression_interaction item |image
0 ± 0.0000 sort_order_active_when_clickout_distance and recommended
0 ± 0.0000 sort_order_active_when_clickout_distance only
0 ± 0.0000 sort_order_active_when_clickout_interaction sort button
0 ± 0.0000 sort_order_active_when_clickout_our recommendations
0 ± 0.0000 sort_order_active_when_clickout_price and recommended
0 ± 0.0000 sort_order_active_when_clickout_price only
0 ± 0.0000 sort_order_active_when_clickout_rating and recommended
0 ± 0.0000 sort_order_active_when_clickout_rating only
0 ± 0.0000 actions_num_ref_diff_from_impressions_no_action
0 ± 0.0000 platform_ID
0 ± 0.0000 actions_involving_impression_session_no_action
0 ± 0.0000 future_closest_action_involving_impression_interaction item |deals
0 ± 0.0000 future_closest_action_involving_impression_clickout item
0 ± 0.0000 past_closest_action_involving_impression_search for item
0 ± 0.0000 past_closest_action_involving_impression_not_present
0 ± 0.0000 past_closest_action_involving_impression_clickout item
0 ± 0.0000 past_closest_action_involving_impression_interaction item deals
0 ± 0.0000 past_closest_action_involving_impression_interaction item image
0 ± 0.0000 past_closest_action_involving_impression_interaction item info
0 ± 0.0000 platform_TW
0 ± 0.0000 actions_involving_impression_session_interaction_item_deals
0 ± 0.0000 platform_TH
0 ± 0.0000 platform_PH
0 ± 0.0000 platform_IE
0 ± 0.0000 platform_IL
0 ± 0.0000 platform_IN
0 ± 0.0000 platform_IT
0 ± 0.0000 platform_JP
0 ± 0.0000 past_closest_action_involving_impression_interaction item rating
0 ± 0.0000 platform_MX
0 ± 0.0000 platform_MY
0 ± 0.0000 platform_NL
0 ± 0.0000 platform_NO
0 ± 0.0000 platform_NZ
0 ± 0.0000 platform_PE
0 ± 0.0000 platform_PL
0 ± 0.0000 future_closest_action_involving_impression_interaction item info
0 ± 0.0000 platform_RU
0 ± 0.0000 platform_SI
0 ± 0.0000 platform_SG
0 ± 0.0000 platform_SE
0 ± 0.0000 platform_PT
0 ± 0.0000 platform_RS
0 ± 0.0000 platform_RO
0 ± 0.0000 platform_SK
0 ± 0.0000 past_actions_involving_impression_session_item_info
0 ± 0.0000 future_actions_involving_impression_session_item_rating
0 ± 0.0000 past_mean_pos
0 ± 0.0000 actions_involving_impression_session_search_for_item
0 ± 0.0000 actions_num_ref_diff_from_impressions_interaction_item_deals
0 ± 0.0000 future_position_impression_same_closest_clickout
0 ± 0.0001 mean_price_clickout
0 ± 0.0000 past_times_impr_appeared
0 ± 0.0000 change_sort_order_distance_from_last_clickout
0 ± 0.0000 future_pos_clicked_2
0 ± 0.0001 mean_cheap_pos_interacted
0 ± 0.0000 past_actions_involving_impression_session_item_rating
0 ± 0.0001 percentage_of_total_plat_inter
0 ± 0.0001 future_actions_involving_impression_session_search_for_item
0 ± 0.0000 future_mean_cheap_pos_interacted
0 ± 0.0000 actions_involving_impression_session_interaction_item_image
0 ± 0.0001 future_pos_closest_reference
0 ± 0.0000 past_mean_price_interacted
0 ± 0.0000 future_pos_clicked_1
0 ± 0.0000 future_mean_price_interacted
0 ± 0.0002 percentage_of_total_city_inter
0 ± 0.0001 past_times_user_interacted_impression
0 ± 0.0000 future_session_num
0 ± 0.0001 actions_num_ref_diff_from_impressions_interaction_item_image
0 ± 0.0001 scores_manhatthan
0 ± 0.0000 past_pos_clicked_1
0 ± 0.0001 change_sort_order_distance_from_first_action
0 ± 0.0000 future_mean_pos_impr_appeared
0 ± 0.0000 past_position_impression_same_closest_clickout
0 ± 0.0000 future_actions_involving_impression_session_item_image
0 ± 0.0000 future_mean_pos
0 ± 0.0001 impression_pos_price
0 ± 0.0001 past_actions_involving_impression_session_no_action
0 ± 0.0000 future_times_interacted_impr
0 ± 0.0000 future_pos_clicked_16_25
0 ± 0.0001 rating
0 ± 0.0000 past_mean_pos_impr_appeared
0 ± 0.0000 future_times_impr_appeared
0 ± 0.0000 future_actions_involving_impression_session_item_info
0 ± 0.0000 past_actions_involving_impression_session_item_image
-0.0001 ± 0.0000future_actions_involving_impression_session_no_action
-0.0001 ± 0.0001future_actions_involving_impression_session_item_deals
-0.0001 ± 0.0000future_times_user_interacted_impression
-0.0001 ± 0.0000future_pos_clicked_9_15
-0.0001 ± 0.0001percentage_of_total_plat_clk
-0.0001 ± 0.0001stars

About

The complete code and notebooks used for the ACM Recommender Systems Challenge 2019


Languages

Language:Jupyter Notebook 94.4%Language:Python 5.6%Language:Shell 0.0%