CLIP4MC: An RL-Friendly Vision-Language Model for Minecraft

[Website] [Arxiv Paper]

CLIP4MC is a Vision-Language model for Minecraft, aligning actions implicitly contained in the video and transcript clips in addition to entities. We construct and release a neat vision-language dataset for Minecraft based on YouTube datset from MineDojo, and we train our CLIP4MC model on the constructed dataset. Empirically, our method can provide a more friendly reward signal for the RL training procedure.

Demonstrations

Here are some demonstrations of agent trained with CLIP4MC.

Harvest a leaf	Milk a cow	Shear a sheep

Requirements

Packages

Install python packages in requirements.txt.

Note that we require PyTorch>=1.10.0 and x-transformers==0.27.1.

Data

Dataset should get ready before training. Information of each data piece is available in our released dataset.

In this project we provide a naive implementation of dataloader and dataset. To use the dataloader and dataset, the data should be organized in the following structure:

data_dir_0
├── text_input.pkl
├── video_input.pkl
data_dir_1
├── text_input.pkl
├── video_input.pkl
...
data_dir_n
├── text_input.pkl
├── video_input.pkl

Tokenized and padded text (via CLIP tokenizer) and equidistantly-sampled frames are stored in text_input.pkl and video_input.pkl respectively and are supposed to be loaded via pickle.load function.

A log file for each dataset is also required. The log file should be a json file with the following structure:

{
  "train": [data_dir_0, data_dir_1, ..., data_dir_n],
  "test" : [data_dir_0, data_dir_1, ..., data_dir_n],
}

The train and test keys are required. The train key should contain a list of data directories for training. The test key should contain a list of data directories for testing.

Pretrained Models

A ViT-B-16 version of pretrained CLIP is required for training from scratch. You can download it from here.
The weight of our pretrained CLIP4MC model is available here for fine-tuning or evaluation.

Usage

You can use the command pattern below to train the model. XXX means a path to fill in. [...] means optional arguments. {...} means a choice of arguments.

torchrun --nproc_per_node=4 train_ddp.py --dataset_log_file XXX \ 
                                         [--use_pretrained_CLIP --pretrain_CLIP_path XXX \] 
                                         [--use_pretrained_model --pretrain_model_path XXX \] 
                                         --model_type {CLIP4MC,CLIP4MC_simple,MineCLIP}

Citation

@article{ding2023clip4mc,
  title={CLIP4MC: An RL-Friendly Vision-Language Model for Minecraft},
  author={Ding, Ziluo and Luo, Hao and Li, Ke and Yue, Junpeng and Huang, Tiejun and Lu, Zongqing},
  journal={arXiv preprint arXiv:2303.10571},
  year={2023}
}

PKU-RL / CLIP4MC