Abstract:
Diffusion models have enabled high-quality, conditional image editing capabilities. We propose to expand their arsenal, and demonstrate that off-the-shelf diffusion models can be used for a wide range of cross-domain compositing tasks. Among numerous others, these include image blending, object immersion, texture-replacement and even CG2Real translation or stylization. We employ a localized, iterative refinement scheme which infuses the injected objects with contextual information derived from the background scene, and enables control over the degree and types of changes the object may undergo. We conduct a range of qualitative and quantitative comparisons to prior work, and exhibit that our method produces higher quality and realistic results without requiring any annotations or training. Finally, we demonstrate how our method may be used for data augmentation of downstream tasks.
This is the official implementation of Cross-domain Compositing (CDC), a local, inference-time, image editing method
which utilizes pretrained diffusion models for image compositing in various domains.
We base our method on previous work in global image editing in inference-time, and propose a localized extension which
enables applications such as: guided image inpainting, cross-domain image compositing and object-guided Sim2Real.
The fidelity-realism tradeoff is controlled by our parameters.
This code builds on the Stable Diffusion codebase.
- Clone the repo:
git clone --recursive https://github.com/cross-domain-compositing/cross-domain-compositing.git
cd cross-domain-compositing/
- Create a new environment:
conda env create -f environment.yaml
conda activate ldm
Or install additional requirements to existing Stable Diffusion environment:
conda activate ldm
conda install -c anaconda scikit-learn
conda install -c conda-forge h5py
conda install -c conda-forge plyfile
conda install -c conda-forge trimesh
conda install -c conda-forge natsort
pip install PyMCubes
Note that these are needed only if intended to run SVR.
3. Add submodules path and set PYTHONPATH to point to the root of the repository (we used ResizeRight for image resizing):
export PYTHONPATH=$PYTHONPATH:$(pwd):$(pwd)/ResizeRight
- Download pretrained Stable Diffusion checkpoints:
wget -P models/ldm/stable-diffusion-v1 https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/resolve/main/sd-v1-4.ckpt
wget -P models/ldm/stable-diffusion-v1 https://huggingface.co/runwayml/stable-diffusion-inpainting/resolve/main/sd-v1-5-inpainting.ckpt
See Stable Diffusion for more information on available model checkpoints.
On top of the original img2img arguments (see Stable Diffusion), we add the following for controlled levels of local image editing:
--mask
- Path to mask or directory containing masks (1 is for FG, 0 is for BG).--T_in
- Editing strength for inner region (1 in mask). 0 is for no conditioning at all, 1 for full guidance. Controls the amount of similarity to the reference image, using a high T allows for adding finer details to the reference at expense of similarity.--T_out
- Editing strength for outer region (0 in mask). 0 is for no conditioning at all, 1 for full guidance. Controls the amount of similarity to the reference image, using a high T allows for adding finer details to the reference at expense of similarity.--down_N_in
- Scaling (downsampling) factor for inner region (1 in mask). Has a similar effect toT_in
but relates more to structure.--down_N_in
- Scaling (downsampling) factor for outer region (0 in mask). Has a similar effect toT_out
but relates more to structure.--blend_pix
- Number of pixels for mask smoothing (see paper).--repaint_start
- When to start resampling for increased receptive field (see paper). 0 is for no resampling, 1 to start from first step.--mask_dilate
- Dilate mask by number of pixels.
We also implement Paint-By-Word from eDiff-I which enables localized text guidance to some degree, to use it:
--prompt_in/prompt_out
- Prompts for inner/outer regions (must appear in--prompt
).--prompt_amplifier_in/prompt_amplifier_out
- Prompt weight for inner/outer regions.
Our arguments are also sweepable! To do so simply supply multiple parameters to the desired arguments, and the script will sweep all permutations.
To define specific sets of sweeps use --sweep_tuples
.
python scripts_cdc/img2img.py --config configs/stable-diffusion/v1-inference.yaml --ckpt models/ldm/stable-diffusion-v1/sd-v1-4.ckpt --init_img examples/scribbles/images/ --mask examples/scribbles/masks/ --from_file examples/scribbles/prompts.txt --batch_size 1 --n_samples 1 --outdir outputs/scribbles --ddim_steps 50 --strength 1.0 --T_out 1.0 --T_in 0.0 0.2 0.4 0.6 0.8 --down_N_out 1 --down_N_in 1 2 4 --seed 42 --repaint_start 0 0.2 0.4 0.6 0.8 --skip_grid
You can also supply a config.yaml from a previous run:
python scripts_cdc/img2img.py --config [config]
python scripts_cdc/img2img_inpaint.py --config configs/stable-diffusion/v1-inpainting-inference.yaml --ckpt models/ldm/stable-diffusion-v1/sd-v1-5-inpainting.ckpt --prompt "A photograph of a sofa in a living room" --init_img examples/sofas/images/ --mask examples/sofas/masks/ --n_samples 1 --outdir outputs/sofas --ddim_steps 50 --strength 1 --T_in 0 --T_out 0.5 --down_N_in 1 --down_N_out 1 --blend_pix 0 --seed 42 --repaint_start 0
We adopted D^2IMNet for SVR model architecture, DISN data for training and OccNet data for 3D evaluation. The model training/testing/preprocessing scripts are forked from D^2IMNet.
We use ChamferDistancePytorch for chamfer distance evaluation.
git submodule add https://github.com/ThibaultGROUEIX/ChamferDistancePytorch.git
- Download and unzip rendered image data and GT ShapeNet SDF models from DISN to
./SVR/data/
- Preprocess image data by running scripts in
./SVR/preprocessing
- [Optional] Download processed ShapeNet data from OccNet for 3D test result evaluation
Follow the steps below to perform background augmentation on the ShapeNet dataset.
- Configure
./SVR/utils.py
. Specify ShapeNet category ID of interest, articulate on text prompt, choose camera views and set foreground conditioning strength. python ./SVR/augment_ShapeNet_data.py
. Change the save directory in the script if needed.
Follow the steps below to train D^2IMNet on the augmented dataset.
python ./SVR/train_test_split.py
- Create train test split lst files.- Configure
./SVR/utils.py
. Specify the category of interest, the path to training images, and file names to save the trained model. python ./SVR/train/train.py
- Train SDFNet.python ./SVR/train/train_cam.py
- Train CamNet.python ./SVR/test/test.py
- Test on in-domain images.
Follow the steps below to test and evaluate in-the-wild images.:
- Source in-the-wild images and preprocess them to 224 x 224. Save the folder of test images to
./SVR/data
. Alternatively, we provide our processed sofa samples here, please save them to./SVR/data/sofa_samples
. - Run
python ./SVR/test/test_external_cam.py
andpython ./SVR/test/test_external_images.py
to get the predicted camera pose and 3D model. Remember to specify the input/output paths. - Run
python ./SVR/blender_render_trans_bg.py
to render the reconstructed model under the predicted camera pose. Remember to specify the input/output paths. We used blender 3.4.1 python API to run this script. - Extract the reference mask of the input images, we used U^2Net. Save the folder of extracted reference masks in
./SVR/result
. - Run
python ./SVR/eval/eval_2D_iou.py
to evaluate 2D-IOU on in the wild images. Remember to specify the input/output paths.
We manually collected and created 24 scribble examples + masks, and 45 immersion examples + masks, in this link.
23/05/2023: Added mitigation for small objects. Use new parameters --crop_mask
, --crop_scale
[float], --crop_size
[int] to apply CDC on a small, upscaled, bounding box around the object.
If you use our work, please cite our paper.