Add image tokenizer and patch position encoding
qgallouedec opened this issue · comments
Quentin Gallouédec commented
Images are first transformed into sequences of non-overlapping 16 × 16 patches in raster order, as done in ViT (Dosovitskiy et al., 2020). Each pixel in the image patches is then normalized between [−1, 1] and divided by the square-root of the patch size (i.e. √16 = 4).
Usage should look like:
import numpy as np
import torch
from gia.model.embedding import Embeddings
from gia.model.tokenization import Tokenizer
# Define tokenizer and embedding layer
tokenizer = Tokenizer()
embedding_layer = Embeddings(embedding_dim=32)
# Load dataset (100k samples)
# First, clone it with `git clone https://huggingface.co/datasets/edbeeching/prj_gia_dataset_atari_2B_atari_yarsrevenge_1111`
dataset = np.load("prj_gia_dataset_atari_2B_atari_yarsrevenge_1111/dataset.npy", allow_pickle=True)
# Convert numpy object to dict. Keys are ['observations', 'actions', 'dones', 'rewards']
dataset = dataset.item()
observations = torch.from_numpy(dataset["observations"]).squeeze(1) # TODO: remove sqeeze when fixed in dataset
actions = torch.from_numpy(dataset["actions"]).to(torch.int64) # TODO: remove when fixed in dataset
# Tokenize and embed
tokens = tokenizer(images=observations, actions=actions)
print(tokens.shape) # torch.Size([100000, K])
embeddings = embedding_layer(tokens)
print(embeddings.shape) # torch.Size([100000, K, 32])