kurupical / riiid

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

First of all, thanks to Hoon Pyo (Tim) Jeon and Kaggle team for such an interesting competition. And congratulates to all the winning teams!

I would like to have three kaggler to thank. @limerobot for sharing DSB 3rd solution. I'm beginner in transformer for time-series data, so I learned a lot from your solution! @takoi for inviting me to form a team. If it weren't for you, I couldn't reach this rank! @wangsg for sharing notebook https://www.kaggle.com/wangsg/a-self-attentive-model-for-knowledge-tracing. I used this notebook as a baseline and finally get 0.810 CV for single transformer.

The following is the team takoi + kurupical solution.

Team takoi + kurupical Overview

*

validation

takoi side

I made 1 LightGBM and 8 NN models. The model that combined Transformer and LSTM had the best CV. Here is architecture and brief description.

Transformer + LSTM

features

I used 17 features. 15 features were computed per user_id. 2 features were computed per content_id.

main features

  • sum of answered correctly
  • average of answered correctly
  • lag time
  • same content_id lag time
  • distance between the same content_id
  • average of answered correctly for each content_id
  • average of lag time for each content_id

## LightGBM I used 97 features. The following are the main features. - sum of answered correctly - average of answered correctly - lag time - same part lag time - same content_id lag time - distance between the same content_id - Word2Vec features of content_id - swem (content_id) - decayed features (average of answered correctly) - average of answered correctly for each content_id - average of lag time for each content_id # kurupical side ## model

hyper parameters

* 20epochs
* AdamW(lr=1e-3, weight_decay=0.1)
* linear_with_warmup(lr=1e-3, warmup_epoch=2)

worked for me

* baseline (SAKT, https://www.kaggle.com/wangsg/a-self-attentive-model-for-knowledge-tracing)
* use all data (this notebook use only last 100 history per user)
* embedding concat (not add) and Linear layer after cat embedding(@limerobot DSB2019 3rd solution) (+0.03) 
* Add min(timestamp_delta//1000, 300) (+0.02)
* Add "index that user answered same content_id at last" (+0.005)
* Transformer Encoder n_layers 2 -> 4 (+0.002)
* weight_decay 0.01 -> 0.1 (+0.002)
* LIT structure in EncoderLayer (+0.002)

not worked for me

I did over 300 experiments, and only about 20 of them were successful.

* SAINT structure (Transformer Encoder/Decoder)
* Positional Encoding
* Consider timeseries
  * timedelta.cumsum() / timedelta.sum()
  * np.log10(timedelta.cumsum()).astype(int) as category feature and embedding
  etc...
* optimizer AdaBelief, LookAhead(Adam), RAdam
* more n_layers(4 => 6), more embedding_dimention (256 => 512)
* output only the end of the sequence
* large binning for elapsed_time/timedelta (500, 1000, etc...)
* treat elapsed_time and timedelta as continuous 

First of all, thanks to Hoon Pyo (Tim) Jeon and Kaggle team for such an interesting competition. And congratulates to all the winning teams!

I would like to have three kaggler to thank. @limerobot for sharing DSB 3rd solution. I'm beginner in transformer for time-series data, so I learned a lot from your solution! @takoi for inviting me to form a team. If it weren't for you, I couldn't reach this rank! @wangsg for sharing notebook https://www.kaggle.com/wangsg/a-self-attentive-model-for-knowledge-tracing.! I used this notebook as a baseline and finally get 0.809 CV for single transformer.

The following is the team takoi + kurupical solution.

Team takoi + kurupical Overview

*

validation

takoi side

I made 1 LightGBM and 8 NN models. The model that combined Transformer and LSTM had the best CV. Here is architecture and brief description.

Transformer + LSTM

features

I used 17 features. 15 features were computed per user_id. 2 features were computed per content_id.

main features

  • sum of answered correctly
  • average of answered correctly
  • lag time
  • same content_id lag time
  • distance between the same content_id
  • average of answered correctly for each content_id
  • average of lag time for each content_id

## LightGBM I used 97 features. The following are the main features. - sum of answered correctly - average of answered correctly - lag time - same part lag time - same content_id lag time - distance between the same content_id - Word2Vec features of content_id - swem (content_id) - decayed features (average of answered correctly) - average of answered correctly for each content_id - average of lag time for each content_id # kurupical side ## model

hyper parameters

  • 20epochs
  • AdamW(lr=1e-3, weight_decay=0.1)
  • linear_with_warmup(lr=1e-3, warmup_epoch=2)

worked for me

  • baseline (SAKT, https://www.kaggle.com/wangsg/a-self-attentive-model-for-knowledge-tracing)
  • use all data (this notebook use only last 100 history per user)
  • embedding concat (not add) and Linear layer after cat embedding(@limerobot DSB2019 3rd solution) (+0.03)
  • Add min(timestamp_delta//1000, 300) (+0.02)
  • Add "index that user answered same content_id at last" (+0.005)
  • Transformer Encoder n_layers 2 -> 4 (+0.002)
  • weight_decay 0.01 -> 0.1 (+0.002)
  • LIT structure in EncoderLayer (+0.002)

not worked for me

I did over 300 experiments, and only about 20 of them were successful.

  • SAINT structure (Transformer Encoder/Decoder)
  • Positional Encoding
  • Consider timeseries
    • timedelta.cumsum() / timedelta.sum()
    • np.log10(timedelta.cumsum()).astype(int) as category feature and embedding etc...
  • optimizer AdaBelief, LookAhead(Adam), RAdam
  • more n_layers(4 => 6), more embedding_dimention (256 => 512)
  • output only the end of the sequence
  • large binning for elapsed_time/timedelta (500, 1000, etc...)
  • treat elapsed_time and timedelta as continuous

About


Languages

Language:Python 51.5%Language:Jupyter Notebook 48.5%