tuanlda78202 / nlps23

NLP Summer'23 - Vietnamese Poem Generator

Home Page:https://paperswithcode.com/search?q_meta=&q_type=&q=poem+generator

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Vietnamese Poem Generator

Vietnam Reunification Day 30-4-1975
Vietnam Reunification Day 30-4-1975

This is the source code of the project "Vietnamese Poem Generator" of the course "Natural Language Processing" Summer 2023.


Abstract

Vietnamese poetry has a rich history that dates back to the 10th century, with the emergence of the "Lục Bát" (six-eight) form, which is characterized by alternating lines of six and eight syllables. Since then, Vietnamese poetry has undergone various transformations and has been influenced by different cultural and historical periods, such as Chinese Confucianism, French colonialism, and modernization. Vietnamese poetry often includes themes of love, nature, patriotism, and social issues, and follows strict rules and structures, such as tonal patterns, rhyming schemes, and word choice.

This project proposes a Vietnamese poem generator that utilizes a combination of state-of-the-art natural language processing techniques, including diffusion, transformer, GPT2, and LLMs. The proposed model generates high-quality Vietnamese poems that adhere to the traditional rules and structures of Vietnamese poetry while also incorporating modern themes and language. The generator is based on a large corpus of Vietnamese poetry and uses the diffusion technique to enhance the coherence and fluency of the generated poems. The transformer-based architecture is used for encoding and decoding, while the GPT2 and LLMs techniques are employed for language modeling and improving the diversity of the generated poems.

The performance of the proposed model is evaluated through a set of quantitative and qualitative metrics, including perplexity, rhyme, and coherence. The experimental results demonstrate the effectiveness of the proposed model in generating high-quality Vietnamese poems that are both linguistically and aesthetically pleasing. The proposed model has potential applications in various fields, including literature, education, and art.

Folder Structure

nlps23/
├── configs/ - training config
|   ├── README.md - config name style
│   ├── */README.md - abstract and experiment results model
|   ├── api-key/ - wandb api key for monitoring
|
├── tools/ - script to downloading data, training, testing, inference and web interface
|
├── trainer/ - trainer classes 
|
├── model/ 
|   ├── architecture/ - model architectures
|   ├── README.md - losses and metrics definition
|
├── base/ - abstract base classes
│   
├── data/ - storing input data
|
├── data_loader/ - custom dataset and dataloader
│
├── saved/ - trained models config, log-dir and logging output
│
├── logger/ - module for wandb visualization and logging
|
├── utils/ - utility functions

Model Zoo

Traditional approach
Beam Search HMM
Deep Learning approach
SP-GPT2 (ICMLA'2021) BARTpho (INTERSPEECH'2022) LD4LG (ACCV'2022)

Usage

Install the required packages:

pip install -r requirements.txt

Running private repository on Kaggle:

  1. Generate your token
  2. Get repo address from github.com/.../...git:
git clone https://your_personal_token@your_repo_address.git
cd CVP

Config file format

Config files are in YAML format
name: U2NetFull_scratch_1gpu-bs4_KNC_size320x320

n_gpu: 1

arch:
  type: u2net_full
  args: {}

data_loader:
  type: KNC_DataLoader
  args:
    batch_size: 4
    shuffle: true
    num_workers: 1
    validation_split: 0.1
    output_size: 320
    crop_size: 288

optimizer:
  type: Adam
  args:
    lr: 0.001
    weight_decay: 0
    eps: 1.e-8
    betas:
      - 0.9
      - 0.999

loss: multi_bce_fusion

metrics:
  - mae
  - sm

lr_scheduler:
  type: StepLR
  args:
    step_size: 50
    gamma: 0.1

trainer:
  type: Trainer

  epochs: 1000
  save_dir: saved/
  save_period: 10
  verbosity: 1

  visual_tool: wandb
  project: cvps23
  name: U2NetLite_scratch_1gpu-bs4_KNC_size320x320

  # Edit *username for tracking WandB multi-accounts
  api_key_file: ./configs/api-key/tuanlda78202
  entity: tuanlda78202
  
test:
  save_dir: saved/generated
  n_sample: 1000
  batch_size: 32

Using config files

Modify the configurations in .yaml config files, then run:

python tools/train.py [CONFIG] [RESUME] [DEVICE] [BATCH_SIZE] [EPOCHS]

Resuming from checkpoints

You can resume from a previously saved checkpoint by:

sh tools/train.py --resume path/to/the/ckpt

Evaluating

python tools/eval.py

Inference

Contributors

About

NLP Summer'23 - Vietnamese Poem Generator

https://paperswithcode.com/search?q_meta=&q_type=&q=poem+generator


Languages

Language:Python 60.2%Language:Jupyter Notebook 39.8%