Text2Room generates textured 3D meshes from a given text prompt using 2D text-to-image models.
This is the official repository that contains source code for the ICCV 2023 paper Text2Room.
[arXiv] [Project Page] [Video]
If you find Text2Room useful for your work please cite:
@preprint{hoellein2023text2room,
title={Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models},
author={H{\"o}llein, Lukas and Cao, Ang and Owens, Andrew and Johnson, Justin and Nie{\ss}ner, Matthias},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2023}
}
Create a conda environment:
conda create -n text2room python=3.9
conda activate text2room
pip install -r requirements.txt
Then install Pytorch3D by following the official instructions. For example, to install Pytorch3D on Linux (tested with PyTorch 1.13.1, CUDA 11.7, Pytorch3D 0.7.2):
conda install -c fvcore -c iopath -c conda-forge fvcore iopath
pip install "git+https://github.com/facebookresearch/pytorch3d.git@stable"
Download the pretrained model weights for the fixed depth inpainting model, that we use:
- refer to the official IronDepth implemention to download the files
normal_scannet.pt
andirondepth_scannet.pt
. - place the files under
text2room/checkpoints
(Optional) Download the pretrained model weights for the text-to-image model:
git clone https://huggingface.co/stabilityai/stable-diffusion-2-inpainting
git clone https://huggingface.co/stabilityai/stable-diffusion-2-1
ln -s <path/to/stable-diffusion-2-inpainting> checkpoints
ln -s <path/to/stable-diffusion-2-1> checkpoints
As default, we generate a living room scene:
python generate_scene.py
Outputs are stored in text2room/output
.
We generate the following outputs per generated scene:
Mesh Files:
<output_root>/fused_mesh/after_generation.ply: generated mesh after the first stage of our method
<output_root>/fused_mesh/fused_final.ply: generated mesh after the second stage of our method
<output_root>/fused_mesh/x_poisson_meshlab_depth_y.ply: result of applying poisson surface reconstruction on mesh x with depth y
<output_root>/fused_mesh/x_poisson_meshlab_depth_y_quadric_z.ply: result of applying poisson surface reconstruction on mesh x with depth y and then decimating the mesh to have at least z faces
Renderings:
<output_root>/output_rendering/rendering_t.png: image from pose t, that was rendered from the final mesh
<output_root>/output_rendering/rendering_noise_t.png: image from a slightly different/noised pose t, that was rendered from the final mesh
<output_root>/output_depth/depth_t.png: depth from pose t, that was rendered from the final mesh
<output_root>/output_depth/depth_noise_t.png: depth from a slightly different/noised pose t, that was rendered from the final mesh
Metadata:
<output_root>/settings.json: all arguments used to generate the scene
<output_root>/seen_poses.json: list of all poses in Pytorch3D convention used to render output_rendering (no noise)
<output_root>/seen_poses_noise.json: list of all poses in Pytorch3D convention used to render output_rendering (with noise)
<output_root>/transforms.json: a file in the standard NeRF convention (e.g. see NeRFStudio) that can be used to optimize a NeRF for the generated scene. It refers to the rendered images in output_rendering (no noise).
We also generate the following intermediate outputs during generation of the scene:
<output_root>/fused_mesh/fused_until_frame_t.ply: generated mesh using the content until pose t
<output_root>/rendered/rendered_t.png: image from pose t, that was rendered from mesh_t
<output_root>/mask/mask_t.png: mask from pose t, that signals unobserved regions
<output_root>/mask/mask_eroded_dilated_t.png: mask from pose t, after applying erosion/dilation
<output_root>/rgb/rgb_t.png: image from pose t, that was inpainted with the text-to-image model
<output_root>/depth/rendered_depth_t.png: depth from pose t, that was rendered from mesh_t
<output_root>/depth/depth_t.png: depth from pose t, that was predicted/aligned from rgb_t and rendered_depth_t
<output_root>/rgbd/rgbd_t.png: combination of rgb_t and depth_t placed next to each other
Already have an in-the-wild image, from which you want to start the generation?
Specify it as --input_image_path
and the generated scene kicks-off from there.
python generate_scene.py --input_image_path sample_data/0.png
Generate indoor-scenes of arbitrary rooms by specifying another --trajectory_file
as input:
python generate_scene.py --trajectory_file model/trajectories/examples/bedroom.json
We provide a bunch of example rooms.
We provide a highly configurable method. See opt.py for a complete list of the configuration options.
You can specify your own prompts and camera trajectories by simply creating your own trajectory.json
file.
Each trajectory.json
file should satisfy the following format:
[
{
"prompt": (str, optional) the prompt to use for this trajectory,
"negative_prompt": (str, optional) the negative prompt to use for this trajectory,
"n_images": (int, optional) how many images to render between start and end pose of this trajectory,
"surface_normal_threshold": (float, optional) the surface_normal_threshold to use for this trajectory
"fn_name": (str, required) the name of a trajectory_function as specified in model/trajectories/trajectory_util.py
"fn_args": (dict, optional) {
"a": value for an argument with name 'a' of fn_name,
"b": value for an argument with name 'b' of fn_name,
},
"adaptive": (list, optional) [
{
"arg": (str, required) name of an argument of fn_name that represents a float value,
"delta": (float, required) delta value to add to the argument during adaptive pose search,
"min": (float, optional) minimum value during search,
"max": (float, optional) maximum value during search
}
]
},
{... next trajectory with similar structure as above ...}
]
We provide a bunch of predefined trajectory functions in trajectory_util.py.
Each trajectory.json
file is a combination of the provided trajectory functions.
You can create custom trajectories by creating new combinations of existing functions.
You can also add custom trajectory functions in trajectory_util.py.
For automatic integration with our codebase, custom trajectory functions should have the following pattern:
def custom_trajectory_fn(current_step, n_steps, **args):
# n_steps: how many poses including start and end pose in this trajectory
# current_step: pose in the current trajectory
# your custom trajectory function here...
def custom_trajectory(**args):
return _config_fn(custom_trajectory_fn, **args)
This lets you reference custom_trajectory
as fn_name
in a trajectory.json
file.
We provide a script that renders images from a mesh at different poses:
python render_cameras.py -m <path/to/mesh.ply> -c <path/to/cameras.json>
where you can provide any cameras in the Pytorch3D convention via -c
.
For example, to re-render all poses used during generation and completion:
python render_cameras.py \
-m <output_root>/fused_mesh/fused_final_poisson_meshlab_depth_12.ply \
-c <output_root>/seen_poses.json
We provide an easy way to train a NeRF from our generated scene.
We save a transforms.json
file in the standard NeRF convention, that can be used to optimize a NeRF for the generated scene.
It refers to the rendered images in <output_root>/output_rendering
.
It can be used with standard NeRF frameworks like Instant-NGP or NeRFStudio.
Our work builds on top of amazing open-source networks and codebases. We thank the authors for providing them.
- IronDepth [1]: a method for monocular depth prediction, that can be used for depth inpainting.
- StableDiffusion [2]: a state-of-the-art text-to-image inpainting model with publicly released network weights.
[1] IronDepth: Iterative Refinement of Single-View Depth using Surface Normal and its Uncertainty, BMVC 2022, Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla
[2] High-Resolution Image Synthesis with Latent Diffusion Models, CVPR 2022, Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer