jhashekhar / multilingual-clf

Classification of multilingual dataset trained only on English training data using pre-trained models. Model is trained on TPUs using PyTorch and torch_xla library.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

multilingual-clf

Data

The data has been used from Kaggle cometion Jigsaw Multilingual Toxic Comment Classification

Workings

Refer to my notebook to see how all of the stuff works out. Kaggle Notebook

  • Use PyTorch nightly. PyTorch and torch_xla seems to be unstable a lot of times.

  • bert-multilingual-uncased models works very easily. There are no SIGKILL or other memory issues.

  • xlm-roberta-base model works too with batch_size=8.

  • xlm-roberta-large is a lot trickier. Garbage collection, limiting the loading of dataloader to once is required.

    • Model needs to be called only once and wrapped with a wrapper function provided in torch_xla library.

Todo

  • Add Multiple Sample Dropout
  • Mixed precision training

About

Classification of multilingual dataset trained only on English training data using pre-trained models. Model is trained on TPUs using PyTorch and torch_xla library.


Languages

Language:Python 100.0%