dasayan05 / bnlp

BNLP is a natural language processing toolkit for Bengali Language.

Home Page:https://bnlp.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

bnlp

Bengali Natural Language Processing(BNLP)

Build Status PyPI version release version Support Python Version Documentation Status Gitter

BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, Bengali POS Tagging, Construct Neural Model for Bengali NLP purposes.

NB: Any Researcher who refer this tool in his/her paper please let us know, we will include paper link here

Contents

Current Features

Installation

PIP installer(python 3.5, 3.6, 3.7 tested okay)

pip install bnlp_toolkit

Local Installer

$git clone https://github.com/sagorbrur/bnlp.git
$cd bnlp
$python setup.py install

Pretrained Model

Download Link

Training Details

  • Sentencepiece, Word2Vec, Fasttext, GloVe model trained with Bengali Wikipedia Dump Dataset
  • SentencePiece Training Vocab Size=50000
  • Fasttext trained with total words = 20M, vocab size = 1171011, epoch=50, embedding dimension = 300 and the training loss = 0.318668,
  • Word2Vec word embedding dimension = 300
  • To Know Bengali GloVe Wordvector and training process follow this repository
  • Bengali CRF POS Tagging was training with nltr dataset with 80% accuracy.

Tokenization

  • Bengali SentencePiece Tokenization

    • tokenization using trained model
      from bnlp.sentencepiece_tokenizer import SP_Tokenizer
      
      bsp = SP_Tokenizer()
      model_path = "./model/bn_spm.model"
      input_text = "আমি ভাত খাই। সে বাজারে যায়।"
      tokens = bsp.tokenize(model_path, input_text)
      print(tokens)
      text2id = bsp.text2id(model_path, input_text)
      print(text2id)
      id2text = bsp.id2text(model_path, text2id)
      print(id2text)
    • Training SentencePiece
      from bnlp.sentencepiece_tokenizer import SP_Tokenizer
      
      bsp = SP_Tokenizer()
      data = "test.txt"
      model_prefix = "test"
      vocab_size = 5
      bsp.train_bsp(data, model_prefix, vocab_size) 
  • Basic Tokenizer

    from bnlp.basic_tokenizer import BasicTokenizer
    basic_t = BasicTokenizer()
    raw_text = "আমি বাংলায় গান গাই।"
    tokens = basic_t.tokenize(raw_text)
    print(tokens)
    
    # output: ["আমি", "বাংলায়", "গান", "গাই", "।"]
  • NLTK Tokenization

    from bnlp.nltk_tokenizer import NLTK_Tokenizer
    
    text = "আমি ভাত খাই। সে বাজারে যায়। তিনি কি সত্যিই ভালো মানুষ?"
    bnltk = NLTK_Tokenizer()
    word_tokens = bnltk.word_tokenize(text)
    sentence_tokens = bnltk.sentence_tokenize(text)
    print(word_tokens)
    print(sentence_tokens)
    
    # output
    # word_token: ["আমি", "ভাত", "খাই", "।", "সে", "বাজারে", "যায়", "।", "তিনি", "কি", "সত্যিই", "ভালো", "মানুষ", "?"]
    # sentence_token: ["আমি ভাত খাই।", "সে বাজারে যায়।", "তিনি কি সত্যিই ভালো মানুষ?"]

Word Embedding

  • Bengali Word2Vec

    • Generate Vector using pretrain model

      from bnlp.bengali_word2vec import Bengali_Word2Vec
      
      bwv = Bengali_Word2Vec()
      model_path = "model/bengali_word2vec.model"
      word = 'আমার'
      vector = bwv.generate_word_vector(model_path, word)
      print(vector.shape)
      print(vector)
    • Find Most Similar Word Using Pretrained Model

      from bnlp.bengali_word2vec import Bengali_Word2Vec
      
      bwv = Bengali_Word2Vec()
      model_path = "model/bengali_word2vec.model"
      word = 'আমার'
      similar = bwv.most_similar(model_path, word)
      print(similar)
    • Train Bengali Word2Vec with your own data

      from bnlp.bengali_word2vec import Bengali_Word2Vec
      bwv = Bengali_Word2Vec(True)
      data_file = "test.txt"
      model_name = "test_model.model"
      vector_name = "test_vector.vector"
      bwv.train_word2vec(data_file, model_name, vector_name)
      
  • Bengali FastText

    • Generate Vector Using Pretrained Model

      from bnlp.bengali_fasttext import Bengali_Fasttext
      
      bft = Bengali_Fasttext()
      word = "গ্রাম"
      model_path = "model/bengali_fasttext.bin"
      word_vector = bft.generate_word_vector(model_path, word)
      print(word_vector.shape)
      print(word_vector)
      
    • Train Bengali FastText Model

      from bnlp.bengali_fasttext import Bengali_Fasttext
      
      bft = Bengali_Fasttext()
      data = "data.txt"
      model_name = "saved_model.bin"
      epoch = 50
      bft.train_fasttext(data, model_name, epoch)
  • Bengali GloVe Word Vectors

    We trained glove model with bengali data(wiki+news articles) and published bengali glove word vectors
    You can download and use it on your different machine learning purposes.

    from bnlp.glove_wordvector import BN_Glove
    glove_path = "bn_glove.39M.100d.txt"
    word = "গ্রাম"
    bng = BN_Glove()
    res = bng.closest_word(glove_path, word)
    print(res)
    vec = bng.word2vec(glove_path, word)
    print(vec)

Bengali POS Tagging

  • Bengali CRF POS Tagging

    • Find Pos Tag Using Pretrained Model

      from bnlp.bengali_pos import BN_CRF_POS
      bn_pos = BN_CRF_POS()
      model_path = "model/bn_pos_model.pkl"
      text = "আমি ভাত খাই।"
      res = bn_pos.pos_tag(model_path, text)
      print(res)
      # [('আমি', 'PPR'), ('ভাত', 'NC'), ('খাই', 'VM'), ('।', 'PU')]
    • Train POS Tag Model

      from bnlp.bengali_pos import BN_CRF_POS
      bn_pos = BN_CRF_POS()
      model_name = "pos_model.pkl"
      tagged_sentences = [[('রপ্তানি', 'JJ'), ('দ্রব্য', 'NC'), ('-', 'PU'), ('তাজা', 'JJ'), ('ও', 'CCD'), ('শুকনা', 'JJ'), ('ফল', 'NC'), (',', 'PU'), ('আফিম', 'NC'), (',', 'PU'), ('পশুচর্ম', 'NC'), ('ও', 'CCD'), ('পশম', 'NC'), ('এবং', 'CCD'),('কার্পেট', 'NC'), ('৷', 'PU')], [('মাটি', 'NC'), ('থেকে', 'PP'), ('বড়জোর', 'JQ'), ('চার', 'JQ'), ('পাঁচ', 'JQ'), ('ফুট', 'CCL'), ('উঁচু', 'JJ'), ('হবে', 'VM'), ('৷', 'PU')]]
      
      bn_pos.training(model_name, tagged_sentences)

Issue

  • if ModuleNotFoundError: No module named 'fasttext' problem arise please do the next line

pip install fasttext

  • if nltk issue arise please do the following line before importing bnlp
import nltk
nltk.download("punkt")

Contributor Guide

Check CONTRIBUTING.md page for details.

Thanks To

Contributor List

Extra Contributor

About

BNLP is a natural language processing toolkit for Bengali Language.

https://bnlp.readthedocs.io

License:MIT License


Languages

Language:Jupyter Notebook 71.6%Language:Python 28.4%