Cross-domain Compositing with Pretrained Diffusion Models

Abstract:
Diffusion models have enabled high-quality, conditional image editing capabilities. We propose to expand their arsenal, and demonstrate that off-the-shelf diffusion models can be used for a wide range of cross-domain compositing tasks. Among numerous others, these include image blending, object immersion, texture-replacement and even CG2Real translation or stylization. We employ a localized, iterative refinement scheme which infuses the injected objects with contextual information derived from the background scene, and enables control over the degree and types of changes the object may undergo. We conduct a range of qualitative and quantitative comparisons to prior work, and exhibit that our method produces higher quality and realistic results without requiring any annotations or training. Finally, we demonstrate how our method may be used for data augmentation of downstream tasks.

This is the official implementation of Cross-domain Compositing (CDC), a local, inference-time, image editing method which utilizes pretrained diffusion models for image compositing in various domains.
We base our method on previous work in global image editing in inference-time, and propose a localized extension which enables applications such as: guided image inpainting, cross-domain image compositing and object-guided Sim2Real. The fidelity-realism tradeoff is controlled by our parameters.

Setup

This code builds on the Stable Diffusion codebase.

Clone the repo:

git clone --recursive https://github.com/cross-domain-compositing/cross-domain-compositing.git
cd cross-domain-compositing/

Create a new environment:

conda env create -f environment.yaml
conda activate ldm

Or install additional requirements to existing Stable Diffusion environment:

conda activate ldm
conda install -c anaconda scikit-learn
conda install -c conda-forge h5py
conda install -c conda-forge plyfile
conda install -c conda-forge trimesh
conda install -c conda-forge natsort
pip install PyMCubes

Note that these are needed only if intended to run SVR.
3. Add submodules path and set PYTHONPATH to point to the root of the repository (we used ResizeRight for image resizing):

export PYTHONPATH=$PYTHONPATH:$(pwd):$(pwd)/ResizeRight

Download pretrained Stable Diffusion checkpoints:

wget -P models/ldm/stable-diffusion-v1 https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/resolve/main/sd-v1-4.ckpt
wget -P models/ldm/stable-diffusion-v1 https://huggingface.co/runwayml/stable-diffusion-inpainting/resolve/main/sd-v1-5-inpainting.ckpt

See Stable Diffusion for more information on available model checkpoints.

Usage

On top of the original img2img arguments (see Stable Diffusion), we add the following for controlled levels of local image editing:

--mask - Path to mask or directory containing masks (1 is for FG, 0 is for BG).
--T_in - Editing strength for inner region (1 in mask). 0 is for no conditioning at all, 1 for full guidance. Controls the amount of similarity to the reference image, using a high T allows for adding finer details to the reference at expense of similarity.
--T_out - Editing strength for outer region (0 in mask). 0 is for no conditioning at all, 1 for full guidance. Controls the amount of similarity to the reference image, using a high T allows for adding finer details to the reference at expense of similarity.
--down_N_in - Scaling (downsampling) factor for inner region (1 in mask). Has a similar effect to T_in but relates more to structure.
--down_N_in - Scaling (downsampling) factor for outer region (0 in mask). Has a similar effect to T_out but relates more to structure.
--blend_pix - Number of pixels for mask smoothing (see paper).
--repaint_start - When to start resampling for increased receptive field (see paper). 0 is for no resampling, 1 to start from first step.
--mask_dilate - Dilate mask by number of pixels.

We also implement Paint-By-Word from eDiff-I which enables localized text guidance to some degree, to use it:

--prompt_in/prompt_out - Prompts for inner/outer regions (must appear in --prompt).
--prompt_amplifier_in/prompt_amplifier_out - Prompt weight for inner/outer regions.

Our arguments are also sweepable! To do so simply supply multiple parameters to the desired arguments, and the script will sweep all permutations. To define specific sets of sweeps use --sweep_tuples.

Examples

Locally Guided Image Editing

python scripts_cdc/img2img.py --config configs/stable-diffusion/v1-inference.yaml --ckpt models/ldm/stable-diffusion-v1/sd-v1-4.ckpt --init_img examples/scribbles/images/ --mask examples/scribbles/masks/ --from_file examples/scribbles/prompts.txt --batch_size 1 --n_samples 1 --outdir outputs/scribbles --ddim_steps 50 --strength 1.0 --T_out 1.0 --T_in 0.0 0.2 0.4 0.6 0.8 --down_N_out 1 --down_N_in 1 2 4 --seed 42 --repaint_start 0 0.2 0.4 0.6 0.8 --skip_grid

You can also supply a config.yaml from a previous run:

python scripts_cdc/img2img.py --config [config]

Background augmentation

python scripts_cdc/img2img_inpaint.py --config configs/stable-diffusion/v1-inpainting-inference.yaml --ckpt models/ldm/stable-diffusion-v1/sd-v1-5-inpainting.ckpt --prompt "A photograph of a sofa in a living room" --init_img examples/sofas/images/ --mask examples/sofas/masks/ --n_samples 1 --outdir outputs/sofas --ddim_steps 50 --strength 1 --T_in 0 --T_out 0.5 --down_N_in 1 --down_N_out 1 --blend_pix 0 --seed 42 --repaint_start 0

Single View 3D Reconstruction

We adopted D^2IMNet for SVR model architecture, DISN data for training and OccNet data for 3D evaluation. The model training/testing/preprocessing scripts are forked from D^2IMNet.

Environment Configuration

We use ChamferDistancePytorch for chamfer distance evaluation.

git submodule add https://github.com/ThibaultGROUEIX/ChamferDistancePytorch.git

Data Preparation

Download and unzip rendered image data and GT ShapeNet SDF models from DISN to ./SVR/data/
Preprocess image data by running scripts in ./SVR/preprocessing
[Optional] Download processed ShapeNet data from OccNet for 3D test result evaluation

Data Augmentation

Follow the steps below to perform background augmentation on the ShapeNet dataset.

Configure ./SVR/utils.py. Specify ShapeNet category ID of interest, articulate on text prompt, choose camera views and set foreground conditioning strength.
python ./SVR/augment_ShapeNet_data.py. Change the save directory in the script if needed.

Training and testing

Follow the steps below to train D^2IMNet on the augmented dataset.

python ./SVR/train_test_split.py - Create train test split lst files.
Configure ./SVR/utils.py. Specify the category of interest, the path to training images, and file names to save the trained model.
python ./SVR/train/train.py - Train SDFNet.
python ./SVR/train/train_cam.py - Train CamNet.
python ./SVR/test/test.py - Test on in-domain images.

Test on in-the-wild images

Follow the steps below to test and evaluate in-the-wild images.:

Source in-the-wild images and preprocess them to 224 x 224. Save the folder of test images to ./SVR/data. Alternatively, we provide our processed sofa samples here, please save them to ./SVR/data/sofa_samples.
Run python ./SVR/test/test_external_cam.py and python ./SVR/test/test_external_images.py to get the predicted camera pose and 3D model. Remember to specify the input/output paths.
Run python ./SVR/blender_render_trans_bg.py to render the reconstructed model under the predicted camera pose. Remember to specify the input/output paths. We used blender 3.4.1 python API to run this script.
Extract the reference mask of the input images, we used U^2Net. Save the folder of extracted reference masks in ./SVR/result.
Run python ./SVR/eval/eval_2D_iou.py to evaluate 2D-IOU on in the wild images. Remember to specify the input/output paths.

Data

We manually collected and created 24 scribble examples + masks, and 45 immersion examples + masks, in this link.

Updates

23/05/2023: Added mitigation for small objects. Use new parameters --crop_mask, --crop_scale[float], --crop_size[int] to apply CDC on a small, upscaled, bounding box around the object.

roy-hachnochi / cross-domain-compositing

Cross-domain Compositing with Pretrained Diffusion Models

Setup

Usage

Examples

Locally Guided Image Editing

Background augmentation

Single View 3D Reconstruction

Environment Configuration

Data Preparation

Data Augmentation

Training and testing

Test on in-the-wild images

Data

Updates

Results

Image Modification via Scribbles

Object Immersion in Paintings

Background Augmentation

Parameters Configuration Effects

Citation

About

Languages