OoFa99 / ArabBert_Tokenizer

Used “aubmindlab/bert-base-arabertv2” from Aub-mind AraBERT to create a simple Arabic text tokenizer.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ArabBert_Tokenizer

  • ArabBERT_Tokenizer: Open In Colab

Goal:-

  • Writing a sample tokenizer Code and testing it, Using a provided sample code on GitHub and Google Colab.

Steps:-

  1. Installing arabert and transformers modules.
  2. Using from transformers import AutoTokenizer, AutoModel to import the tokenizer and the model builder.
  3. Using from arabert.preprocess import ArabertPreprocessor to import the text preprocessing tool.
  4. Calling the Model model_name = "aubmindlab/bert-base-arabertv2".
  5. Testing the tokenizer and the preprocessor:-
  • Tested with Different forms of Arabic text:
    • العربية الفصحى
    • الْعَرَبِيَّةِ الْفُصْحَى
      Using Shakkala.
    • Egyptian Arabic text.

About

Used “aubmindlab/bert-base-arabertv2” from Aub-mind AraBERT to create a simple Arabic text tokenizer.


Languages

Language:Jupyter Notebook 100.0%