PIXEL

This repository contains code for PIXEL, the Pixel-based Encoder of Language. PIXEL is a language model that operates on text rendered as images, fully removing the need for a fixed vocabulary. This effectively allows for transfer to any language and script that can be typeset on your computer screen.

We pretrained a monolingual PIXEL model on the English Wikipedia and BookCorpus (in total around 3.2B words), the same data as BERT, and showed that PIXEL substantially outperforms BERT on syntactic and semantic processing tasks on scripts that are not found in the pretraining data, but PIXEL is slightly weaker than BERT when working with Latin scripts.

For details about PIXEL, please have a look at our paper Language Modelling with Pixels. Information on how to cite our work can be found at the bottom.

PIXEL consists of three major components: a text renderer, which draws text as an image; an encoder, which encodes the unmasked regions of the rendered image; and a decoder, which reconstructs the masked regions at the pixel level. It is built on ViT-MAE.

During pretraining, the renderer produces images containing the training sentences. Patches of these images are linearly projected to obtain patch embeddings (as opposed to having an embedding matrix like e.g. in BERT), and 25% of the patches are masked out. The encoder, which is a Vision Transformer (ViT), then only processes the unmasked patches. The lightweight decoder with hidden size 512 and 8 transformer layers inserts learnable mask tokens into the encoder's output sequence and learns to reconstruct the raw pixel values at the masked positions.

After pretraining, the decoder can be discarded leaving an 86M parameter encoder, upon which task-specific classification heads can be stacked. Alternatively, the decoder can be retained and PIXEL can be used as a pixel-level generative language model (see Figures 3 and 6 in the paper for examples).

Coming Soon

Gradio demo
Rendering guide
Finetuned robustness models

Setup

This codebase is built on Transformers for PyTorch. We also took inspiration from the original ViT-MAE codebase. The default font GoNotoCurrent.ttf that we used for all experiments is a merged Noto font built with go-noto-universal.

You can set up this codebase as follows to get started with using PIXEL models:

Show Instructions

Clone repo and initialize submodules

git clone https://github.com/xplip/pixel.git
cd pixel
git submodule update --init --recursive

Create a fresh conda environment

conda create -n pixel-env python=3.9
conda activate pixel-env

Install Python packages

conda install pytorch torchvision cudatoolkit=11.3 -c pytorch
conda install -c conda-forge pycairo pygobject manimpango
pip install --upgrade pip
pip install -r requirements.txt
pip install ./datasets
pip install -e .

(Optional) Install Nvidia Apex

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Pretraining PIXEL

We provide instructions for pretraining PIXEL in PRETRAINING.md.

You can find our pretrained PIXEL-base at https://huggingface.co/Team-PIXEL/pixel-base.

Note: This link also gives access to all intermediate training checkpoints from 10k to 1M steps through the commit history. You can select these checkpoints when finetuning PIXEL via --model_revision=<commit_id>

Finetuning PIXEL

We provide instructions for finetuning PIXEL in FINETUNING.md. If you follow our training recipes or simply evaluate using the models we provide via the links below, you can expect similar results as below.

Note: The links give access to all 5 random seeds that we averaged results over for each model (one in the main branch, and the others in branches seed2–seed5). You can select different seeds via --model_revision=<branch_name>.

Universal Dependencies (POS Tagging and Dependency Parsing)

Show Table

	English-EWT	Arabic-PADT	Coptic-Scriptorium	Hindi-HDTB	Japanese-GSD	Korean-GSD	Tamil-TTB	Vietnamese-VTB	Chinese-GSD
POS Tagging Accuracy	96.7 Models	95.7 Models	96.0 Models	96.3 Models	97.2 Models	94.2 Models	81.0 Models	85.7 Models	92.8 Models
Dependency Parsing LAS	88.7 Models	77.3 Models	83.5 Models	89.2 Models	90.7 Models	78.5 Models	52.6 Models	50.5 Models	73.7 Models

MasakhaNER

Show Table

	ConLL-2003 English	Amharic	Hausa	Igbo	Kinyarwanda	Luganda	Luo	Naija Pidgin	Swahili	Wolof	Yorùbá
F1 Score	89.5 Models	47.7 Models	82.4 Models	79.9 Models	64.2 Models	76.5 Models	66.6 Models	78.7 Models	79.8 Models	59.7 Models	70.7 Models

GLUE Validation Sets

Show Table

MNLI-M/MM Acc	QQP F1	QNLI Acc	SST-2 Acc	COLA Matthew's Corr.	STS-B Spearman's ρ	MRPC F1	RTE Acc	WNLI Acc	Avg
78.1 / 78.9 Models	84.5 Models	87.8 Models	89.6 Models	38.4 Models	81.1 Models	88.2 Models	60.5 Models	53.8 Models	74.1

Question Answering (TyDiQA-GoldP, SQuAD, KorQuAD 1.0, JaQuAD)

Show Table

	TyDiQA-GoldP										SQuADv1	KorQuADv1	JaQuAD
	English	Arabic	Bengali	Finnish	Indonesian	Korean	Russian	Swahili	Telugu	Avg	English	Korean	Japanese
F1 Score	59.6	57.3	36.3	57.1	63.6	26.1	50.5	65.9	61.7	52.3	81.4	78.0	34.1
URL	Models										Models	Models	Models

Citation & Contact

@article{rust-etal-2022-pixel,
  title={Language Modelling with Pixels},
  author={Phillip Rust and Jonas F. Lotz and Emanuele Bugliarello and Elizabeth Salesky and Miryam de Lhoneux and Desmond Elliott},
  journal={arXiv preprint},
  year={2022},
  url={https://arxiv.org/abs/2207.06991}
}

Feel free to open an issue here or send an email to ask questions about PIXEL or report problems with the code! We emphasize that this is experimental research code.

Contact person: Phillip Rust (p.rust@di.ku.dk)

If you find this repo useful, we would also be happy about a ⭐️ :).

ivokun / pixel

PIXEL

Coming Soon

Setup

Pretraining PIXEL

Finetuning PIXEL

Universal Dependencies (POS Tagging and Dependency Parsing)

MasakhaNER

GLUE Validation Sets

Question Answering (TyDiQA-GoldP, SQuAD, KorQuAD 1.0, JaQuAD)

Citation & Contact

About

Languages