nlztrk / Learning-Equality-Curriculum-Recommendations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Solution for "Learning Equality - Curriculum Recommendations" @Kaggle

architecture

I used sentence-transformers library and the models from HuggingFace. I tried to implement the shared architecture here.

The pipeline consists of:

Splitting the Data as Train/Val

I've seen a lot of different approaches on the forum. I also wanted to use the imbalance in language distribution in my approach. I set all the data coming from source as train. For the remaining, I used:

  • CV Scheme: Grouped Stratified K-Fold
  • Folds: 5 (Used only the first)
  • Group: Topic ID
  • Stratifier Label: Language

Text Processing

  • Created topic tree
  • Created special tokens for each value language and content kind can take.
  • Created identifier separators for topic title, topic tree, topic description, content title, content description and content text.

My final input for the model was like:

  • Topic: [<[language_en]>] [<[topic_title]>] videos [<[topic_tree]>] maths g3 to g10 > maths > g6 > 17. geometrical constructions > perpendicular and perpendicular bisector > videos [<[topic_desc]>] nan
  • Content: [<[language_en]>] [<[kind_exercise]>] [<[cntnt_title]>] level 3: identify elements of simple machine(axle,wheel,pulley and inclined plane etc [<[cntnt_desc]>] nan [<[cntnt_text]>] nan

Training Sentence-Transformer (Stage 1)

Retrieve with kNN using Stage 1 Embeddings

I used kNN from RAPIDS and get closest 100 content embedding for each topic embedding using cosine-similarity.

Training Cross-Encoder (Stage 2)

  • Base Model: Trained model from Stage 1
  • Output: Sigmoid
  • Sequence Length: 128
  • Epochs: 15
  • Batch Size: 256
  • Warm-Up Ratio: 0.05

Inference

  • Ran all the steps above sequentially in a single script.
  • Tuned classification threshold on the hold-out validation set to maximize F2-Score.
  • Imputed empty topic rows with the highest scoring content IDs.

Didn't Work & Improve

  • Language specific kNN
  • Smaller models
  • Lower sequence length
  • Lower batch-size
  • Union submission blending

About


Languages

Language:Jupyter Notebook 88.0%Language:Python 12.0%