Taglish-Electra

Kumusta Mga Kaibigan and hello friends! This repository reviews our process for building a bilingual Tagalog-English trained Electra Model.

Data Download

Our data consists of two Tagalog datasets equal to approxiamtely 1.5 GB of tagalog training data and 500MB of English.

WikiText-TL-39: Large scale, unlabeled Filipino text dataset with 39 Million tokens
TLUnified Large Scale Corpus: Unlabeled Filipino text dataset
openwebtext: Small subset of the open-source project dataset of 38GB of social media forum posts.

Code

Environment Setup

Initialize your conda environment

conda create -n your_env python=3.7
conda activate your_env
git clone https://github.com/charityking2358/Taglish-Electra.git
cd Taglish-Electra
pip install -r requirements.txt

Processing the Data

This bash script combines all of our data and uses the Bert-multilingual-base-cased model to tokenize our corpus.

bash pre-process-data-multi.sh

Pre-training our model

This script uses the builds the pre-training dataset, uses the Electra pre-training script with parameters, and finally converts all Electra checkpoints to Tensorflow checkpoints to enable Hugginface model submission.

bash train.sh

Pre-trained ELECTRA Models

We released new ELECTRA models in small configurations for discriminators. Our models are available on HuggingFace Transformers and can be used on both PyTorch and Tensorflow.
Taglish-Electra at charityking2358/taglish-electra-55K

To Evaluate Against Benchmark

Hate Speech Setup We use the benchmark Tagalog Electra model and annotated hate-speech dataset to determine our model's performance.

git clone https://github.com/jcblaisecruz02/Filipino-Text-Benchmarks.git
mkdir Filipino-Text-Benchmarks/data

# Hatespeech Dataset
wget https://s3.us-east-2.amazonaws.com/blaisecruz.com/datasets/hatenonhate/hatespeech_raw.zip
unzip hatespeech_raw.zip -d Filipino-Text-Benchmarks/data && rm hatespeech_raw.zip

Run benchmark Electra Model

export DATA_DIR='Filipino-Text-Benchmarks/data/hatespeech'
python Filipino-Text-Benchmarks/train.py \
    --pretrained jcblaise/electra-tagalog-small-cased-discriminator \
    --train_data ${DATA_DIR}/train.csv \
    --valid_data ${DATA_DIR}/valid.csv \
    --test_data ${DATA_DIR}/test.csv \
    --data_pct 1.0 \
    --checkpoint finetuned_model \
    --do_train true \
    --do_eval true \
    --msl 128 \
    --optimizer adam \
    --batch_size 32 \
    --add_token [LINK],[MENTION],[HASHTAG] \
    --weight_decay 1e-8 \
    --learning_rate 2e-4 \
    --adam_epsilon 1e-6 \
    --warmup_pct 0.1 \
    --epochs 3 \
    --seed 42

Compare our model

python Filipino-Text-Benchmarks/train.py \
    --pretrained charityking2358/taglish-electra-55K \
    --train_data ${DATA_DIR}/train.csv \
    --valid_data ${DATA_DIR}/valid.csv \
    --test_data ${DATA_DIR}/test.csv \
    --data_pct 1.0 \
    --checkpoint finetuned_model \
    --do_train true \
    --do_eval true \
    --msl 128 \
    --optimizer adam \
    --batch_size 32 \
    --add_token [LINK],[MENTION],[HASHTAG] \
    --weight_decay 1e-8 \
    --learning_rate 2e-4 \
    --adam_epsilon 1e-6 \
    --warmup_pct 0.1 \
    --epochs 3 \
    --seed 42

charityking2358 / Taglish-Electra