danjohnvelasco / Filipino-ULMFiT

Pre-trained AWD-LSTM language model trained on Filipino text corpus using fastai v2. Instructions included.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Filipino-ULMFiT

This is an accompanying repository to my paper:

Contents

  • instructions to download the pre-trained language model.
  • jupyter notebook to show you how to use the pre-trained model on a text classification task using fastai v2. [notebook]

Contributions

  • Release a pre-trained AWD LSTM language model in Filipino using fastai v2.
  • Benchmark AWD LSTM to the Hate Speech Dataset. [reference]

Requirements

  • fastai v2 and up
  • NVIDIA GPU (all experiments were done on Colab w/ Tesla T4)

Language Model

Total Epochs Dataset Size Train Set Val Set Accuracy Perplexity Total Training Time Dataset
20 160428 90% 10% 86.71% 2.028250 26H WikiText-TL-39

Download pre-trained language model

# Install gdown
pip install gdown

# Make directory
mkdir models

# Download data
gdown --id 19jdv8-XEbDNiqlm_lPb1csbVZYkn3gfA

# Unzip
unzip pretrained.zip -d models

# Finally
You should see two files inside 'models' directory: 
1. finetuned_weights_20.pth (pre-trained weights)
2. vocab.pkl (vocab) 

This will be used later in language model fine-tuning. 
See accompanying jupyter notebook to see usage.

Acknowledgements

Big thanks to Blaise Cruz for answering my questions and for nudging me in the right direction.

About

Pre-trained AWD-LSTM language model trained on Filipino text corpus using fastai v2. Instructions included.


Languages

Language:Jupyter Notebook 100.0%