TRAnsformers and HIVemind, trahiv

Simplified decentralized training of 🤗 transformers over the internet

🌎 trahiv is a fork of the 🤗 Transformers library with integration of hivemind to allow for decentralized training in the simplest way possible.

How does it work?

Hivemind is a library for decentralized training over the internet. Using a DHT and parameter averaging, you can train a full model on multiple computers (stable or not) across the internet. Similar to Bittorrent, theres no need to reach the entire network, and is fault tolerant, meaning that peers can connect and disconnect whenever they wish.

Warning

At the moment it is heavily encouraged to only train on extremely small models. A LoRA of 7B, with 8M parameters take 15-30 seconds under ideal conditions (tested from Utah -> Indiana) to 40-50 minutes (Norway -> Czech Rep.). This should just be taken as a proof of concept rather than a full library.

A better implementation of hivemind with transformers is coming but not right now.

Usage

Most if not all projects using 🤗 Transformers should work without problems with 🌎 trahiv.

Let's take for example the QLoRA finetuning notebook:

Almost everything remains unchanged except for the initialization of the trainer:

import transformers

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        hivemind_config={
            "dht": {}
            "opt": {
                "run_id": "spanish_qlora",
                "target_batch_size": 100,
                "verbose": True
            }
        }
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

As you can see, we only have to add hivemind_config, where:

dht: Although it's an empty dict here, we can pass the kwargs for the hivemind.DHT. You can also skip it, as it will always be initialized.
opt: Here go the kwargs for the hivemind.Optimizer. The required parameters are run_id and target_batch_size. run_id is the name of the swarm you want to connect to, and target_batch_size the batch size for the averaging round. More on this later.

This will initialize a local training run.

Connecting to the swarm

Local Network

Once the DHT has been setup, it will prompt the mADDRs to connect to the swarm:

To join the training, use initial_peers = ['/ip4/127.0.0.1/tcp/42749/p2p/12D3KooWMEGSxULAnyzgdnsfDLBV5wjy7CThowe4rVoRqrYEy5Rv']

On another computer, start the same training script but add initial_peers to the DHT configuration:

hivemind_config={
    "dht": {
        initial_peers: ['/ip4/127.0.0.1/tcp/42749/p2p/12D3KooWMEGSxULAnyzgdnsfDLBV5wjy7CThowe4rVoRqrYEy5Rv']
    }
    "opt": {
        "run_id": "spanish_qlora",
        "target_batch_size": 100,
        "verbose": True
    }
}

Make sure to use the same run_id too. This will connect your training script to the swarm. Every 100 steps (target_batch_size) it will run an averaging round, which can take from seconds to minutes depending on how large the model is.

Public Internet

The procedure for public internet is almost the same, except that, when starting the swarm (first peer/initializer peer) you need to pass the host_maddrs where you want the DHT to listen:

hivemind_config={
    "dht": {
        host_maddrs: ["/ip4/0.0.0.0/tcp/0", "/ip4/0.0.0.0/udp/0/quic"]
    }
    "opt": {
        "run_id": "spanish_qlora",
        "target_batch_size": 100,
        "verbose": True
    }
}

These host maddrs will choose a random port to listen at, both on TCP and UDP. Note: TCP is far more stable than UDP.

If you want to set a specific port, or have problems with the DHT detecting your public ip, you can always change it:

host_maddrs: ["/ip4/180.54.41.66/tcp/27015"]: In this scenario, the DHT will listen on port 27015 TCP.

Either way, this will also prompt the mADDRs to connect to the swarm, including the public internet:

To join the training, use initial_peers = ['/ip4/180.54.41.66/tcp/27015/p2p/QmaVTB2LwayToK2rzMkaCbkCaH7nF2rTHIS0IS0AN0EXAMPLE']

Use it the same way you would on a Local Network:

hivemind_config={
    "dht": {
        initial_peers: ['/ip4/180.54.41.66/tcp/27015/p2p/QmaVTB2LwayToK2rzMkaCbkCaH7nF2rTHIS0IS0AN0EXAMPLE']
    }
    "opt": {
        "run_id": "spanish_qlora",
        "target_batch_size": 100,
        "verbose": True
    }
}

Averaging

Every 100 steps (target_batch_size) the peers will average gradients. If one peer takes too long, it will simply be dropped out of the current averaging round and moved to the next one, and will use the local gradients.

Make sure to read the hivemind library documentation for more information: https://learning-at-home.readthedocs.io/en/latest/user/quickstart.html

Where you can find more info on how to configure timeouts, compression, and more.

Credits

The Hivemind Library

@misc{hivemind,
  title = {{H}ivemind: a {L}ibrary for {D}ecentralized {D}eep {L}earning},
  author = {Learning{@}home team},
  year = 2020,
  howpublished = {\url{https://github.com/learning-at-home/hivemind}}
}

QLoRA: Efficient Finetuning of Quantized LLMs

@article{dettmers2023qlora,
  title={QLoRA: Efficient Finetuning of Quantized LLMs},
  author={Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
  journal={arXiv preprint arXiv:2305.14314},
  year={2023}
}

HuggingFace Transformers

@inproceedings{wolf-etal-2020-transformers,
    title = "Transformers: State-of-the-Art Natural Language Processing",
    author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = oct,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6",
    pages = "38--45"
}

chavinlo / trahiv