JiaoPaner / fairseq-kd

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool



Support Ukraine MIT License Latest Release Build Status Documentation Status CicleCI Status


Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.

Usage

This clone of fairseq supports Knowledge Distillation, Recurrent Stacking, and LoRA for the Transformer model and the translation task. You can add the following flags to fairseq-train to use them:

  • Knowledge Distillation: The original implementation was sourced from LeslieOverfitting and MANGA-UOFA

    • Pure Word-Level Distillation (Hinton et al.) can be achieved by:

      • --task translation_with_kd --kd-strategy word_level --teacher-checkpoint-path $teacher_ckpt --criterion label_smoothed_cross_entropy_with_kd
      • Note that, there no NLL Loss between the gold targets and predictions. The only loss is the KL-Divergence between the student and teacher distributions ($\mathcal{L}$ = $\mathcal{L}_{KD}$)
    • Kim & Rush extend this idea and add a NLL Loss between the predictions and target and modify the loss as $\mathcal{L}$ = $\mathcal{L}_{KD}$ + $\mathcal{L}_{NLL}$. The same can be achieved with the following flags:

      • --task translation_with_kd --kd-strategy word_seq_level --teacher-checkpoint-path $teacher_ckpt --criterion label_smoothed_cross_entropy_with_kd
    • Training with Batch-Level and Global-Level KD (Wang et al.) can be done as follows:

      • --task translation_with_kd --kd-strategy batch_level --teacher-checkpoint-path $teacher_ckpt --criterion label_smoothed_cross_entropy_with_kd --kd-rate $kd_rate
      • --task translation_with_kd --kd-strategy global_level --teacher-checkpoint-path $teacher_ckpt --criterion label_smoothed_cross_entropy_with_kd --kd-rate $kd_rate --kd-queue-size $kd_queue_sz
    • Lastly, the Global-Language-wise selection approach (Gumma et al.) can used by:

      • --task translation_with_kd --kd-strategy global_language_wise --teacher-checkpoint-path $teacher_ckpt --criterion label_smoothed_cross_entropy_with_kd --kd-rate $kd_rate --kd-queue-size $kd_queue_sz --kd-language-tags $language_tags (note that the $language_tags should be a comma separated string of language tags)
    • Here, similar to Global-Level KD, each language has its own Global FIFO queue, which makes it suitable for multilingual KD with imbalanced datasets. This technique requires adding language tags to each translation pair, similar to Ramesh et al.. These tags will help the model break the batch into respective languages and push them into the corresponding Global language queues. Note that each FIFO language queue, irrespective of language abundance, will be of the same size, i.e., $kd_queue_sz. I know this does not sound so good, and I am working on an alternative.

    • UPDATE-1: Initially, the KD Loss was implemented as the CrossEntropy between student and teacher model distributions, but it was very unustable in mixed-precision training and led to inf loss. Hence, the latest implementation uses KL-Divergence, which is much more stable and easy to compute in PyTorch.

    • UPDATE-2: Based on Wen et al., newer variants for KD Loss have been implemented, wiz. js_div and tvd. They can be used by setting the flag --kd-criterion $kd_criterion. By default, kl_div is used.

  • Recurrent Stacking (Dabre & Fujita): RS is an extreme parameter sharing technique in which all the layers in the encoder/decoder are shared. Implementation-wise, only one layer exists in the module, and the rest $N-1$ are mere references to it. RS can be activated with the following flags: --encoder-recurrent-stacking $encoder_recurrent_stacking --decoder-recurrent-stacking $decoder_recurrent_stacking

  • Low-Rank Adaptation (LoRA) (Hu et al.): LoRA is a technique for efficient model adaptation that modifies a small number of model parameters while freezing the rest, enabling effective fine-tuning of large-scale pre-trained models with minimal computational overhead. The LoRA modules can be added and trained using the following flags: --use-native-attention --lora-r $r --lora-alpha $alpha --lora-dropout $dropout --lora-bias "none" --lora-modules "q_proj,k_proj" --load-checkpoint-liberally. The model will automatically merge the weights when the .eval() method is called, and unmerge them with .train(). Note that, training these modules will replace the required linear layers with LoRALinear layers, embedding layers with LoRAEmbeddding, and attention blocks with NativeMultiheadAttention. These changes are yet to be reverted upon saving the final best model. Use the --load-checkpoint-liberally flag for fairseq-interactive as well when evaluting the model.

  • Rotatary Positional Embedding (RoPE) (Su et al.): RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. Notably, RoPE enables valuable properties, including the flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding. RoPE can be enabled with the following flags: --use-native-attention --use-rope --no-token-positional-embeddings. Note that, when adding RoPE to models previously trained with Sinusoidal Positional Embeddings, finetuning with LoRA added to the embedding, query, and key is most efficient. The RoPE implementation is directly sourced from here, and is one of the dependencies.

  • Miscellaneous:

    • Factorized Embedding Parameterization (Lan et al.): Similar to ALBERT, the large embeddings can be parameterized by adding an intermediate bottleneck layer, i.e., the instead of being a single $|V| \times d_m$ matrix, the Embedding consists of two pieces of sizes $|V| \times k$ and $k \times d_m$ respectively, where $k < d_m$. This helps curb the number of parameters in the Embedding layer, which can one of the most bulky components. Factorized embeddings can be used as:--encoder-factorized-embed-dim $encoder_fac_embed_dim --decoder-factorized-embed-dim $decoder_fac_embed_dim. A non-linear activation function can be applied to the intermediate bottleneck layer specifying it in the flag --factorized-embed-activation-fn $fac_embed_activation_fn.

    • When using a penultimate linear transformation before the final projection onto the vocabulary, activation can be added by --decoder-output-activation-fn $decoder_out_activation_fn

    • Sanity Validation steps: Similar to Pytorch-Lightning Trainer, a full pass over the validation set can be run at the beginning of training to eliminate any bugs in the training/validation. It can be activated with the flag: --run-sanity-validation-steps

    • Added support for Python 3.11+ and bumped the version from 0.12.2 -> 0.12.4

    • Script to port fairseq transformer models to HuggingFace can be found here

    • Adapters are now deprecated and removed in favor of LoRA

Requirements and Installation

  • PyTorch version >= 2.0.1
  • Python version >= 3.8
  • For training new models, you'll also need an NVIDIA GPU and NCCL
  • To install fairseq and develop locally:
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install -e ./

# on MacOS:
# CFLAGS="-stdlib=libc++" pip install --editable ./

# to install the latest stable release (0.10.x)
# pip install fairseq
  • For faster training install NVIDIA's apex library:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./
  • For large datasets install PyArrow: pip install pyarrow
  • If you use Docker make sure to increase the shared memory size either with --ipc=host or --shm-size as command line options to nvidia-docker run.

Getting Started

The full documentation contains instructions for getting started, training new models, and extending fairseq with a new model types and tasks.

Join the fairseq community

License

fairseq(-py) is MIT-licensed. The license applies to the pre-trained models as well.

Citation

Please cite as:

@inproceedings{ott2019fairseq,
  title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
  author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli},
  booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations},
  year = {2019},
}

Final Note

I will try my best to keep this repo synced with the upstream fairseq repository. This clone is very dynamic and can have broken stuff once in a while. So feel free to raise any issues or pull requests to clear any bugs or introduce new features.

About

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

License:MIT License


Languages

Language:Python 98.2%Language:Cuda 0.9%Language:C++ 0.5%Language:Cython 0.3%Language:Lua 0.1%Language:Shell 0.0%