- This repo is dedicated to build a classification pipeline to the problem of Arabic dialects identification, labels contain 18 clasess/dialects, using a classic ML and modern DL approaches.
- The Dataset and the dialect identification problem were addressed by Qatar Computing Research Institute. More on: https://arxiv.org/pdf/2005.06557.pdf
- I Implemented a data-scraper, data-preprocessor (using Regex), data-modeling (SVM with TF-IDF and MarBert using HuggingFace) and deployment-script using FlaskAPIs locally.
- Data fetching is done by sending 1000 (max-json-length) ids per post-request using a start & end index, parsing its content and appending it in the data-frame. There's a small trick in the request loop, if the end index exceeded the length of the ids, it will take the remainder.
- The output of this process is a data-frame that has the retrieved text and labels.
- I have implemented two classes, one for preprocessing and the other is for tokenization.
- Preprocessing has a pipeline that applies letters normalization, removing tashkeel, substitute characters, removing symbols, punctuation, spaces, quotation marks and English letters. All of this is done by regex python module.
- Tokenization has a pipeline that applies a normal tokenizing for text, removing stop words and removing repeated words. There’s also a small EDA in the notebook.
- Before training any models, I prepared data by dropping the non-value rows in both class (dialect) and text columns.
- There were an imbalance in the dialects, so I used SMOTE (Synthetic Minority Oversampling Technique) to balance the classes. The amount of data is big but if I used Undersampling there will be loss, more text data will help ML-Models to learn better.
- I used TF-IDF vectorizer as it's one of the best statistical vector spaces used in NLP Classic solutions.
- The Model trained is LinearSVM as it's one of the best linear models in text classification problems.
- SVM understands relationships between features better than naïve-bayes. Also working with well prepared vector spaces makes SVM work well as in text classification problems.
- Trained using Kaggle (CPU).
- I Used MARBERT (A Deep Bidirectional Transformer for Arabic).
- MARBERT is a large-scale pre-trained masked language model focused on both Dialectal Arabic (DA) and MSA.
- Fine-tunning a pre-trained dialectical Arabic transformer will excel in this task better than any BI-LSTM/LSTM model, as it has a tremendous word-embeddings trained on 1B Arabic tweets from a large in-house dataset of about 6B tweets.
- MARBERT is the same network architecture as ARBERT (BERT-base), but without the next sentence prediction (NSP) objective since tweets are short.
- MARBERT tokenizer and fine-tuned transformer will be used later in deployment.
- Trained using Kaggle GPU (NVIDIA V100).
- This is a small script, based on Flask micro-framework, that takes an input from user and displayes an output using a static page. Every back-end processing is done in the previous stages.
-
Imbalance-learn module uses Macro Average internally.
-
Macro averaging is perhaps the most straightforward amongst the numerous averaging methods.
-
The macro-averaged F1 score (or macro F1 score) is computed by taking the arithmetic mean (aka unweighted mean) of all the per-class F1 scores.
-
This method treats all classes equally regardless of their support values. working with an imbalanced dataset where all classes are equally important, using the macro average would be a good choice as it treats all classes equally.
You have to download Python 3 and the following Python libraries: All are provided in env.yml