georgian-io / Transformers-Domain-Adaptation

:no_entry: [DEPRECATED] Adapt Transformer-based language models to new text domains

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

⛔ [DEPRECATED] Transformers Domain Adaptation

DocumentationColab Guide

PyPI - Python Version PyPI version Python package Documentation Status

This toolkit improves the performance of HuggingFace transformer models on downstream NLP tasks, by domain-adapting models to the target domain of said NLP tasks (e.g. BERT -> LawBERT).

The overall Domain Adaptation framework can be broken down into three phases:

  1. Data Selection

    Select a relevant subset of documents from the in-domain corpus that is likely to be beneficial for domain pre-training (see below)

  2. Vocabulary Augmentation

    Extending the vocabulary of the transformer model with domain specific-terminology

  3. Domain Pre-Training

    Continued pre-training of transformer model on the in-domain corpus to learn linguistic nuances of the target domain

After a model is domain-adapted, it can be fine-tuned on the downstream NLP task of choice, like any pre-trained transformer model.

Components

This toolkit provides two classes, DataSelector and VocabAugmentor, to simplify the Data Selection and Vocabulary Augmentation steps respectively.

Installation

This package was developed on Python 3.6+ and can be downloaded using pip:

pip install transformers-domain-adaptation

Features

  • Compatible with the HuggingFace ecosystem:
    • transformers 4.x
    • tokenizers
    • datasets

Usage

Please refer to our Colab guide!

Open In Colab

Results

TODO

About

:no_entry: [DEPRECATED] Adapt Transformer-based language models to new text domains

License:Apache License 2.0


Languages

Language:Jupyter Notebook 76.6%Language:Python 20.5%Language:Shell 2.8%Language:Makefile 0.1%