LASST(Language-guided Semantic Style Transfer of 3D Indoor Scenes)

Accepted by ACM Multimedia PIES-ME 2022. Paper Created by Bu Jin, Beiwen Tian, Hao Zhao and Guyue Zhou from Institute for AI Industry Research(AIR), Tsinghua University.

Introduction

3D content creation and editing is a long-existing multimedia demand. With the surge of metaverse, tech giants and consumers are now looking forward to a high-quality virtual world that people can live in and interactive with. We study the problem of 3D indoor scene style transfer, which would promote the user experience of metaverse residents.

In this repository, we address the new problem of language-guided semantic style transfer of 3D indoor scenes. The input is a 3D indoor scene mesh and several phrases that describe the target scene. Firstly, 3D vertex coordinates are mapped to RGB residues by a multi-layer perceptron. Secondly, colored 3D meshes are differentiablly rendered into 2D images, via a viewpoint sampling strategy tailored for indoor scenes. Thirdly, rendered 2D images are compared to phrases, via pre-trained vision-language models. Lastly, errors are back-propagated to the multi-layer perceptron to update vertex colors corresponding to certain semantic categories. The whole process of LASST can be seen from below. Code and models will be made publicly available.

Getting Started

Installation

conda env create --name LASST python=3.7
conda install --yes --file requirements.txt

System Requirements

Python 3.7
CUDA 11.0
GPU w/ minimum 8 GB ram

Data Preparation

The dataset we used is ScanNetV2 dataset. See HERE for more details. Remember to fix the data path in src/local.py as your own datapath.

Run examples

Run the following command for a room with wooden floor,steel refridgerator:

sh ./scripts/go.sh

The rendered images and final outputs will be saved to results/.

Outputs

semantic mask(input mesh, w/o semantic mask, w/ semantic mask)

text prompt: steel table

text prompt: marble floor

text prompt: wooden floor, silk sofa, wooden table

sampling(input mesh, text2mesh sampling, LASST sampling)

text prompt: marble_floor, fabric sofa

text prompt: wooden floor, steel refrigerator

text prompt: golden chair, oak table

regularization(input mesh, None, rgb, hsv)

text prompt: leather sofa

text prompt: leather sofa, marble floor, oak table

gt label vs. pred label

Citation

@article{jin2022language,
  title={Language-guided Semantic Style Transfer of 3D Indoor Scenes},
  author={Jin, Bu and Tian, Beiwen and Zhao, Hao and Zhou, Guyue},
  journal={arXiv preprint arXiv:2208.07870},
  year={2022}
}

AIR-DISCOVER / LASST