ali-vilab / VGen

Official repo for VGen: a holistic video generation ecosystem for video generation building on diffusion models

Home Page:https://i2vgen-xl.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

t2v inference

xjxu21 opened this issue · comments

Hi, thanks for sharing the code and model.

I am trying to do some t2v inference with this codebase. I downloaded the t2v model text2video_pytorch_model.pth from modelscope and modified the yaml config. Then I run python inference.py --cfg configs/t2v_infer.yaml, but the results seem to be abnormal.

Is this model incompatible with the current codebase? If so, could you please give me a link to the right t2v model?

Thank you.

There are some differences. You may need to modify the settings in t2v_train.yaml.

Diffusion: {
    'type': 'DiffusionDDIM',
    'schedule': 'linear_sd', # cosine
    'schedule_param': {
        'num_timesteps': 1000,
        'init_beta': 0.00085,
        'last_beta': 0.0120,
        'zero_terminal_snr': False,
    },
    'mean_type': 'eps',
    'loss_type': 'mse',
    'var_type': 'fixed_small',
    'rescale_timesteps': False,
    'noise_strength': 0.0
}

Just replace the diffusion above, but I haven't verified it yet. You can give it a try. Thanks.

It works, thank you!

it doesn't works to me

t2v_train.yaml.

Hey xjxu21, I have created my own workspace folder as i follow this in the t2v_train.yaml

image

i have even added the models that is previously not inside

image

And I have changed Steven's suggestions accordingly

image

But no luck. It still does not output anything. Do you mind helping me? Thank you in advance

Running Inference

How to resolve this, i've downloaded the text2video_pytorch_model.pth and open_clip_pytorch_model:
Exception: Failed to invoke function <function inference_text2video_entrance at 0x7f96796578b0>, with Failed to init class <class 'tools.modules.autoencoder.AutoencoderKL'>, with [Errno 2] No such file or directory: 'models/v2-1_512-ema-pruned.ckpt'

I have what seems like a similar issue with open_clip_pytorch_model. Is there an updated fix? Where do I find these weights? Which yaml files are currently supported and expected to run vs which are for models that have not been released?

/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")
[2024-05-16 23:35:08,863] INFO: {'name': 'Config: VideoLDM Decoder', 'mean': [0.5, 0.5, 0.5], 'std': [0.5, 0.5, 0.5], 'max_words': 1000, 'num_workers': 6, 'prefetch_factor': 2, 'resolution': [448, 256], 'vit_out_dim': 1024, 'vit_resolution': [224, 224], 'depth_clamp': 10.0, 'misc_size': 384, 'depth_std': 20.0, 'frame_lens': [1, 16, 16, 16, 16, 32, 32, 32], 'sample_fps': [1, 8, 16, 16, 16, 8, 16, 16], 'vid_dataset': {'type': 'VideoDataset', 'data_list': ['data/vid_list.txt'], 'max_words': 1000, 'resolution': [448, 256], 'data_dir_list': ['data/videos/'], 'vit_resolution': [224, 224], 'get_first_frame': True}, 'img_dataset': {'type': 'ImageDataset', 'data_list': ['data/img_list.txt'], 'max_words': 1000, 'resolution': [448, 256], 'data_dir_list': ['data/images'], 'vit_resolution': [224, 224]}, 'batch_sizes': {'1': 32, '4': 8, '8': 4, '16': 4, '32': 2}, 'Diffusion': {'type': 'DiffusionDDIM', 'schedule': 'linear_sd', 'schedule_param': {'num_timesteps': 1000, 'init_beta': 0.00085, 'last_beta': 0.012, 'zero_terminal_snr': True}, 'mean_type': 'v', 'loss_type': 'mse', 'var_type': 'fixed_small', 'rescale_timesteps': False, 'noise_strength': 0.1, 'ddim_timesteps': 50}, 'ddim_timesteps': 50, 'use_div_loss': False, 'p_zero': 0.1, 'guide_scale': 9.0, 'vit_mean': [0.48145466, 0.4578275, 0.40821073], 'vit_std': [0.26862954, 0.26130258, 0.27577711], 'sketch_mean': [0.485, 0.456, 0.406], 'sketch_std': [0.229, 0.224, 0.225], 'hist_sigma': 10.0, 'scale_factor': 0.18215, 'use_checkpoint': True, 'use_sharded_ddp': False, 'use_fsdp': False, 'use_fp16': True, 'temporal_attention': True, 'UNet': {'type': 'UNetSD_TFT2V', 'in_dim': 4, 'dim': 320, 'y_dim': 1024, 'context_dim': 1024, 'out_dim': 4, 'dim_mult': [1, 2, 4, 4], 'num_heads': 8, 'head_dim': 64, 'num_res_blocks': 2, 'attn_scales': [1.0, 0.5, 0.25], 'dropout': 0.1, 'temporal_attention': True, 'temporal_attn_times': 1, 'use_checkpoint': True, 'use_fps_condition': False, 'use_sim_mask': False, 'config': 'None', 'num_tokens': 4, 'upper_len': 128, 'default_fps': 8, 'misc_dropout': 0.4}, 'guidances': [], 'auto_encoder': {'type': 'AutoencoderKL', 'ddconfig': {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0, 'video_kernel_size': [3, 1, 1]}, 'embed_dim': 4, 'pretrained': 'models/v2-1_512-ema-pruned.ckpt'}, 'embedder': {'type': 'FrozenOpenCLIPTextVisualEmbedder', 'layer': 'penultimate', 'pretrained': 'models/open_clip_pytorch_model.bin', 'vit_resolution': [224, 224]}, 'ema_decay': 0.9999, 'num_steps': 1000000, 'lr': 3e-05, 'weight_decay': 0.0, 'betas': [0.9, 0.999], 'eps': 1e-08, 'chunk_size': 2, 'decoder_bs': 2, 'alpha': 0.7, 'save_ckp_interval': 50, 'warmup_steps': 10, 'decay_mode': 'cosine', 'use_ema': False, 'load_from': None, 'Pretrain': {'type': 'pretrain_specific_strategies', 'fix_weight': False, 'grad_scale': 0.5, 'resume_checkpoint': 'workspace/model_bk/model_scope_0267000.pth', 'sd_keys_path': 'data/stable_diffusion_image_key_temporal_attention_x1.json'}, 'viz_interval': 5, 'visual_train': {'type': 'VisualTrainTextImageToVideo', 'partial_keys': [['y', 'fps']], 'use_offset_noise': False, 'guide_scale': 9.0}, 'visual_inference': {'type': 'VisualGeneratedVideos'}, 'inference_list_path': '', 'log_interval': 1, 'log_dir': 'workspace/experiments/text_list_for_tft2v', 'seed': 888, 'negative_prompt': 'Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms', 'ENABLE': True, 'DATASET': 'webvid10m', 'TASK_TYPE': 'inference_tft2v_entrance', 'max_frames': 16, 'target_fps': 16, 'scale': 8, 'batch_size': 1, 'use_zero_infer': True, 'round': 1, 'test_list_path': 'data/text_list_for_tft2v.txt', 'vldm_cfg': 'configs/t2v_train.yaml', 'positive_prompt': ', cinematic, High Contrast, highly detailed, no blur, 4k render', 'test_model': 'models/tft2v_t2v_non_ema_512000.pth', 'video_compositions': ['text', 'image'], 'cfg_file': 'configs/tft2v_t2v_infer.yaml', 'init_method': 'tcp://localhost:9999', 'debug': False, 'opts': [], 'pmi_rank': 0, 'pmi_world_size': 1, 'gpus_per_machine': 1, 'world_size': 1, 'noise_strength': 0.1, 'gpu': 0, 'rank': 0, 'log_file': 'workspace/experiments/text_list_for_tft2v/log_00.txt'}
[2024-05-16 23:35:09,826] INFO: Going into inference_text2video_entrance inference on 0 gpu
[2024-05-16 23:35:09,847] INFO: Loading ViT-H-14 model config.
[2024-05-16 23:35:22,084] WARNING: Pretrained weights (models/open_clip_pytorch_model.bin) not found for model ViT-H-14.
[rank0]: Traceback (most recent call last):
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 62, in build_from_config
[rank0]: return req_type_entry(**cfg)
[rank0]: File "/content/drive/MyDrive/creative/VGen/tools/modules/clip_embedder.py", line 158, in init
[rank0]: model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'), pretrained=pretrained)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/open_clip/factory.py", line 151, in create_model_and_transforms
[rank0]: model = create_model(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/open_clip/factory.py", line 122, in create_model
[rank0]: raise RuntimeError(f'Pretrained weights ({pretrained}) not found for model {model_name}.')
[rank0]: RuntimeError: Pretrained weights (models/open_clip_pytorch_model.bin) not found for model ViT-H-14.

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 67, in build_from_config
[rank0]: return req_type_entry(**cfg)
[rank0]: File "/content/drive/MyDrive/creative/VGen/tools/inferences/inference_tft2v_entrance.py", line 74, in inference_tft2v_entrance
[rank0]: worker(0, cfg, cfg_update)
[rank0]: File "/content/drive/MyDrive/creative/VGen/tools/inferences/inference_tft2v_entrance.py", line 139, in worker
[rank0]: clip_encoder = EMBEDDER.build(cfg.embedder)
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 107, in build
[rank0]: return self.build_func(*args, **kwargs, registry=self)
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry_class.py", line 7, in build_func
[rank0]: return build_from_config(cfg, registry, **kwargs)
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 64, in build_from_config
[rank0]: raise Exception(f"Failed to init class {req_type_entry}, with {e}")
[rank0]: Exception: Failed to init class <class 'tools.modules.clip_embedder.FrozenOpenCLIPTextVisualEmbedder'>, with Pretrained weights (models/open_clip_pytorch_model.bin) not found for model ViT-H-14.

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]: File "/content/drive/MyDrive/creative/VGen/inference.py", line 18, in
[rank0]: INFER_ENGINE.build(dict(type=cfg_update.TASK_TYPE), cfg_update=cfg_update.cfg_dict)
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 107, in build
[rank0]: return self.build_func(*args, **kwargs, registry=self)
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry_class.py", line 7, in build_func
[rank0]: return build_from_config(cfg, registry, **kwargs)
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 69, in build_from_config
[rank0]: raise Exception(f"Failed to invoke function {req_type_entry}, with {e}")
[rank0]: Exception: Failed to invoke function <function inference_tft2v_entrance at 0x7b6d423d5ea0>, with Failed to init class <class 'tools.modules.clip_embedder.FrozenOpenCLIPTextVisualEmbedder'>, with Pretrained weights (models/open_clip_pytorch_model.bin) not found for model ViT-H-14.