Huggingface Library Demos

This repo collects code examples and demos used for demonstrating different aspects of the Huggingface libraries, including transformers, datasets, evaluate, and (eventually) more. Originally prepared for UCSC's NLP 244 Winter 2023 Course, expanding to include new examples and for a wider audience.

Be sure to check out Huggingface's Course, for an in depth overview and tutorial from HF.

Preamble: Python packaging

Not required for anything in this course, but I often personally find it useful to organize my work as a Python package. This allows others to import elements from your work without using or extracting individual files from your repo. This also allows a simpler workflow for importing your code into a Google Colab notebook

This is accomplished in three main steps:

Create a pyproject.toml file like this. For typical use-cases, you can copy the one linked without edits.
Create a setup.cfg file like this. You'll need to edit everything under [metadata], install_requires for your dependencies, and possibly other attributes as you customize your package organization and contents.
Add all your code under a sources directory linked from setup.cfg. In this case, I have everything under src/hf_libraries_demo since my package_dir includes =src. You'll want to rename appropriately. For more details on the src layout and alternatives, see this article.

I've defined minimal example of a function to import in src/hf_libraries_demo/package_demo. Given this, you can install an editable version of the whole package with pip install -e . from its root directory, and import functions you've defined in different modules in src. You can also install it from the git link directly. See an example of doing this in Colab here.

Using Huggingface Datasets

See this directory of examples

Loading a dataset from Huggingface (official tutorial) (example)
Using map and filter for pre-processing (official tutorial) (example)
Aside: pre-modeling data analysis with datasets (example)

Setting up Evaluation w/ Huggingface Evaluate

See this directory of examples

Here we approach things in a round-about order: we set up evaluation for a model on our dataset without first defining the model. To do this, we build pipelines for two model-free baselines and or test cases:

A perfect model, in which we can verify expectations of our evaluation metrics
A random baseline, which we can use to test the evaluator and compare results against

calculating accuracy for a random and a perfect model with evaluators (official eval tutorial) (example)
- random baseline "pipeline"
- perfect "pipeline"
calculating F1 as a custom metric (example)

Fine-tuning a Transformer with the Trainer API!

We'll use a fairly small pre-trained model: microsoft/xtremedistil-l6-h384-uncased.

instantiating the model and making zero-shot predictions manually (example)
making zero-shot predictions with an evaluator (example)
fine-tuning with the Trainer API (official docs) (example)
- bonus: logging to Weights & Biases
Customizing Trainer via-subclass: compute an alternative loss function (label-smoothed cross entropy) (official docs) (example)

Efficient Inference

a worked example of translating snli to French, using T5, as in Quest 4 (en_snli_to_french.py)
- filtering to only the unique English strings for translation
- Adding task prefixes with worker parallelism (num_proc=32)
- Batch tokenization with max_length=512
- Using a torch DataLoader with batch size 512 and num_workers=8
- batch decoding and storing of results
- building a french_snli from our map of unique EN -> FR translations

Modifying Generation/Decoding

some worked examples for generating text with T5 as an encoder-decoder are shown in generation_examples.py
- normal greedy decoding from T5
- normal sampling from T5
- adding a decoder prefix to T5 before decoding

Pre-training from Scratch

A complete example for pre-training RoBERTa from scratch with the BabyLM dataset can be found in experiments/pretraining

Not Covered (Yet)

You can use both/either a sparse (BM25) and dense (FAISS) Search index on a huggingface dataset to retrieve data points.
- great for retrieval-augmented generation or retrieval augmented in-context-learning
Huggingface Accelerate and Deepspeed integrations can vastly improve training speed and capacity
Other methods and modalities:
- Transformers and Datasets for vision and audio
- Diffusion Models
- Reinforcement Learning Environments (HF Simulate) and Reinforcement Learning from Human Feedback

kingb12 / hf_libraries_demo