First of all, thanks to Hoon Pyo (Tim) Jeon and Kaggle team for such an interesting competition. And congratulates to all the winning teams!
I would like to have three kaggler to thank. @limerobot for sharing DSB 3rd solution. I'm beginner in transformer for time-series data, so I learned a lot from your solution! @takoi for inviting me to form a team. If it weren't for you, I couldn't reach this rank! @wangsg for sharing notebook https://www.kaggle.com/wangsg/a-self-attentive-model-for-knowledge-tracing. I used this notebook as a baseline and finally get 0.810 CV for single transformer.
The following is the team takoi + kurupical solution.
*
- @tito's validation strategy. https://www.kaggle.com/its7171/cv-strategy
I made 1 LightGBM and 8 NN models. The model that combined Transformer and LSTM had the best CV. Here is architecture and brief description.
I used 17 features. 15 features were computed per user_id. 2 features were computed per content_id.
- sum of answered correctly
- average of answered correctly
- lag time
- same content_id lag time
- distance between the same content_id
- average of answered correctly for each content_id
- average of lag time for each content_id
## LightGBM I used 97 features. The following are the main features. - sum of answered correctly - average of answered correctly - lag time - same part lag time - same content_id lag time - distance between the same content_id - Word2Vec features of content_id - swem (content_id) - decayed features (average of answered correctly) - average of answered correctly for each content_id - average of lag time for each content_id # kurupical side ## model
* 20epochs
* AdamW(lr=1e-3, weight_decay=0.1)
* linear_with_warmup(lr=1e-3, warmup_epoch=2)
* baseline (SAKT, https://www.kaggle.com/wangsg/a-self-attentive-model-for-knowledge-tracing)
* use all data (this notebook use only last 100 history per user)
* embedding concat (not add) and Linear layer after cat embedding(@limerobot DSB2019 3rd solution) (+0.03)
* Add min(timestamp_delta//1000, 300) (+0.02)
* Add "index that user answered same content_id at last" (+0.005)
* Transformer Encoder n_layers 2 -> 4 (+0.002)
* weight_decay 0.01 -> 0.1 (+0.002)
* LIT structure in EncoderLayer (+0.002)
I did over 300 experiments, and only about 20 of them were successful.
* SAINT structure (Transformer Encoder/Decoder)
* Positional Encoding
* Consider timeseries
* timedelta.cumsum() / timedelta.sum()
* np.log10(timedelta.cumsum()).astype(int) as category feature and embedding
etc...
* optimizer AdaBelief, LookAhead(Adam), RAdam
* more n_layers(4 => 6), more embedding_dimention (256 => 512)
* output only the end of the sequence
* large binning for elapsed_time/timedelta (500, 1000, etc...)
* treat elapsed_time and timedelta as continuous
First of all, thanks to Hoon Pyo (Tim) Jeon and Kaggle team for such an interesting competition. And congratulates to all the winning teams!
I would like to have three kaggler to thank. @limerobot for sharing DSB 3rd solution. I'm beginner in transformer for time-series data, so I learned a lot from your solution! @takoi for inviting me to form a team. If it weren't for you, I couldn't reach this rank! @wangsg for sharing notebook https://www.kaggle.com/wangsg/a-self-attentive-model-for-knowledge-tracing.! I used this notebook as a baseline and finally get 0.809 CV for single transformer.
The following is the team takoi + kurupical solution.
*
- @tito's validation strategy. https://www.kaggle.com/its7171/cv-strategy
I made 1 LightGBM and 8 NN models. The model that combined Transformer and LSTM had the best CV. Here is architecture and brief description.
I used 17 features. 15 features were computed per user_id. 2 features were computed per content_id.
- sum of answered correctly
- average of answered correctly
- lag time
- same content_id lag time
- distance between the same content_id
- average of answered correctly for each content_id
- average of lag time for each content_id
## LightGBM I used 97 features. The following are the main features. - sum of answered correctly - average of answered correctly - lag time - same part lag time - same content_id lag time - distance between the same content_id - Word2Vec features of content_id - swem (content_id) - decayed features (average of answered correctly) - average of answered correctly for each content_id - average of lag time for each content_id # kurupical side ## model
- 20epochs
- AdamW(lr=1e-3, weight_decay=0.1)
- linear_with_warmup(lr=1e-3, warmup_epoch=2)
- baseline (SAKT, https://www.kaggle.com/wangsg/a-self-attentive-model-for-knowledge-tracing)
- use all data (this notebook use only last 100 history per user)
- embedding concat (not add) and Linear layer after cat embedding(@limerobot DSB2019 3rd solution) (+0.03)
- Add min(timestamp_delta//1000, 300) (+0.02)
- Add "index that user answered same content_id at last" (+0.005)
- Transformer Encoder n_layers 2 -> 4 (+0.002)
- weight_decay 0.01 -> 0.1 (+0.002)
- LIT structure in EncoderLayer (+0.002)
I did over 300 experiments, and only about 20 of them were successful.
- SAINT structure (Transformer Encoder/Decoder)
- Positional Encoding
- Consider timeseries
- timedelta.cumsum() / timedelta.sum()
- np.log10(timedelta.cumsum()).astype(int) as category feature and embedding etc...
- optimizer AdaBelief, LookAhead(Adam), RAdam
- more n_layers(4 => 6), more embedding_dimention (256 => 512)
- output only the end of the sequence
- large binning for elapsed_time/timedelta (500, 1000, etc...)
- treat elapsed_time and timedelta as continuous