t2v inference

Question

t2v inference

xjxu21 opened this issue 7 months ago · comments

Hi, thanks for sharing the code and model.

I am trying to do some t2v inference with this codebase. I downloaded the t2v model text2video_pytorch_model.pth from modelscope and modified the yaml config. Then I run python inference.py --cfg configs/t2v_infer.yaml, but the results seem to be abnormal.

Is this model incompatible with the current codebase? If so, could you please give me a link to the right t2v model?

Thank you.

Shiwei Zhang · Answer 1 · Tue Dec 19 2023 17:52:10 GMT+0800 (China Standard Time)

There are some differences. You may need to modify the settings in t2v_train.yaml.

Diffusion: {
    'type': 'DiffusionDDIM',
    'schedule': 'linear_sd', # cosine
    'schedule_param': {
        'num_timesteps': 1000,
        'init_beta': 0.00085,
        'last_beta': 0.0120,
        'zero_terminal_snr': False,
    },
    'mean_type': 'eps',
    'loss_type': 'mse',
    'var_type': 'fixed_small',
    'rescale_timesteps': False,
    'noise_strength': 0.0
}

Just replace the diffusion above, but I haven't verified it yet. You can give it a try. Thanks.

Xiaojie Xu · Answer 2 · Tue Dec 19 2023 19:09:31 GMT+0800 (China Standard Time)

It works, thank you!

justinday123 · Answer 3 · Sun Jan 21 2024 12:14:15 GMT+0800 (China Standard Time)

it doesn't works to me

Khansa Zilfa Shofia Ghazali · Answer 4 · Sun Jan 28 2024 01:22:53 GMT+0800 (China Standard Time)

t2v_train.yaml.

Hey xjxu21, I have created my own workspace folder as i follow this in the t2v_train.yaml

i have even added the models that is previously not inside

And I have changed Steven's suggestions accordingly

But no luck. It still does not output anything. Do you mind helping me? Thank you in advance

Luthando Maqondo · Answer 5 · Fri Feb 23 2024 18:10:35 GMT+0800 (China Standard Time)

Running Inference

How to resolve this, i've downloaded the text2video_pytorch_model.pth and open_clip_pytorch_model:
Exception: Failed to invoke function <function inference_text2video_entrance at 0x7f96796578b0>, with Failed to init class <class 'tools.modules.autoencoder.AutoencoderKL'>, with [Errno 2] No such file or directory: 'models/v2-1_512-ema-pruned.ckpt'

Matthew Rosenberg · Answer 6 · Fri May 17 2024 11:44:16 GMT+0800 (China Standard Time)

I have what seems like a similar issue with open_clip_pytorch_model. Is there an updated fix? Where do I find these weights? Which yaml files are currently supported and expected to run vs which are for models that have not been released?

/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")
[2024-05-16 23:35:08,863] INFO: {'name': 'Config: VideoLDM Decoder', 'mean': [0.5, 0.5, 0.5], 'std': [0.5, 0.5, 0.5], 'max_words': 1000, 'num_workers': 6, 'prefetch_factor': 2, 'resolution': [448, 256], 'vit_out_dim': 1024, 'vit_resolution': [224, 224], 'depth_clamp': 10.0, 'misc_size': 384, 'depth_std': 20.0, 'frame_lens': [1, 16, 16, 16, 16, 32, 32, 32], 'sample_fps': [1, 8, 16, 16, 16, 8, 16, 16], 'vid_dataset': {'type': 'VideoDataset', 'data_list': ['data/vid_list.txt'], 'max_words': 1000, 'resolution': [448, 256], 'data_dir_list': ['data/videos/'], 'vit_resolution': [224, 224], 'get_first_frame': True}, 'img_dataset': {'type': 'ImageDataset', 'data_list': ['data/img_list.txt'], 'max_words': 1000, 'resolution': [448, 256], 'data_dir_list': ['data/images'], 'vit_resolution': [224, 224]}, 'batch_sizes': {'1': 32, '4': 8, '8': 4, '16': 4, '32': 2}, 'Diffusion': {'type': 'DiffusionDDIM', 'schedule': 'linear_sd', 'schedule_param': {'num_timesteps': 1000, 'init_beta': 0.00085, 'last_beta': 0.012, 'zero_terminal_snr': True}, 'mean_type': 'v', 'loss_type': 'mse', 'var_type': 'fixed_small', 'rescale_timesteps': False, 'noise_strength': 0.1, 'ddim_timesteps': 50}, 'ddim_timesteps': 50, 'use_div_loss': False, 'p_zero': 0.1, 'guide_scale': 9.0, 'vit_mean': [0.48145466, 0.4578275, 0.40821073], 'vit_std': [0.26862954, 0.26130258, 0.27577711], 'sketch_mean': [0.485, 0.456, 0.406], 'sketch_std': [0.229, 0.224, 0.225], 'hist_sigma': 10.0, 'scale_factor': 0.18215, 'use_checkpoint': True, 'use_sharded_ddp': False, 'use_fsdp': False, 'use_fp16': True, 'temporal_attention': True, 'UNet': {'type': 'UNetSD_TFT2V', 'in_dim': 4, 'dim': 320, 'y_dim': 1024, 'context_dim': 1024, 'out_dim': 4, 'dim_mult': [1, 2, 4, 4], 'num_heads': 8, 'head_dim': 64, 'num_res_blocks': 2, 'attn_scales': [1.0, 0.5, 0.25], 'dropout': 0.1, 'temporal_attention': True, 'temporal_attn_times': 1, 'use_checkpoint': True, 'use_fps_condition': False, 'use_sim_mask': False, 'config': 'None', 'num_tokens': 4, 'upper_len': 128, 'default_fps': 8, 'misc_dropout': 0.4}, 'guidances': [], 'auto_encoder': {'type': 'AutoencoderKL', 'ddconfig': {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0, 'video_kernel_size': [3, 1, 1]}, 'embed_dim': 4, 'pretrained': 'models/v2-1_512-ema-pruned.ckpt'}, 'embedder': {'type': 'FrozenOpenCLIPTextVisualEmbedder', 'layer': 'penultimate', 'pretrained': 'models/open_clip_pytorch_model.bin', 'vit_resolution': [224, 224]}, 'ema_decay': 0.9999, 'num_steps': 1000000, 'lr': 3e-05, 'weight_decay': 0.0, 'betas': [0.9, 0.999], 'eps': 1e-08, 'chunk_size': 2, 'decoder_bs': 2, 'alpha': 0.7, 'save_ckp_interval': 50, 'warmup_steps': 10, 'decay_mode': 'cosine', 'use_ema': False, 'load_from': None, 'Pretrain': {'type': 'pretrain_specific_strategies', 'fix_weight': False, 'grad_scale': 0.5, 'resume_checkpoint': 'workspace/model_bk/model_scope_0267000.pth', 'sd_keys_path': 'data/stable_diffusion_image_key_temporal_attention_x1.json'}, 'viz_interval': 5, 'visual_train': {'type': 'VisualTrainTextImageToVideo', 'partial_keys': [['y', 'fps']], 'use_offset_noise': False, 'guide_scale': 9.0}, 'visual_inference': {'type': 'VisualGeneratedVideos'}, 'inference_list_path': '', 'log_interval': 1, 'log_dir': 'workspace/experiments/text_list_for_tft2v', 'seed': 888, 'negative_prompt': 'Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms', 'ENABLE': True, 'DATASET': 'webvid10m', 'TASK_TYPE': 'inference_tft2v_entrance', 'max_frames': 16, 'target_fps': 16, 'scale': 8, 'batch_size': 1, 'use_zero_infer': True, 'round': 1, 'test_list_path': 'data/text_list_for_tft2v.txt', 'vldm_cfg': 'configs/t2v_train.yaml', 'positive_prompt': ', cinematic, High Contrast, highly detailed, no blur, 4k render', 'test_model': 'models/tft2v_t2v_non_ema_512000.pth', 'video_compositions': ['text', 'image'], 'cfg_file': 'configs/tft2v_t2v_infer.yaml', 'init_method': 'tcp://localhost:9999', 'debug': False, 'opts': [], 'pmi_rank': 0, 'pmi_world_size': 1, 'gpus_per_machine': 1, 'world_size': 1, 'noise_strength': 0.1, 'gpu': 0, 'rank': 0, 'log_file': 'workspace/experiments/text_list_for_tft2v/log_00.txt'}
[2024-05-16 23:35:09,826] INFO: Going into inference_text2video_entrance inference on 0 gpu
[2024-05-16 23:35:09,847] INFO: Loading ViT-H-14 model config.
[2024-05-16 23:35:22,084] WARNING: Pretrained weights (models/open_clip_pytorch_model.bin) not found for model ViT-H-14.
[rank0]: Traceback (most recent call last):
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 62, in build_from_config
[rank0]: return req_type_entry(**cfg)
[rank0]: File "/content/drive/MyDrive/creative/VGen/tools/modules/clip_embedder.py", line 158, in init
[rank0]: model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'), pretrained=pretrained)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/open_clip/factory.py", line 151, in create_model_and_transforms
[rank0]: model = create_model(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/open_clip/factory.py", line 122, in create_model
[rank0]: raise RuntimeError(f'Pretrained weights ({pretrained}) not found for model {model_name}.')
[rank0]: RuntimeError: Pretrained weights (models/open_clip_pytorch_model.bin) not found for model ViT-H-14.

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 67, in build_from_config
[rank0]: return req_type_entry(**cfg)
[rank0]: File "/content/drive/MyDrive/creative/VGen/tools/inferences/inference_tft2v_entrance.py", line 74, in inference_tft2v_entrance
[rank0]: worker(0, cfg, cfg_update)
[rank0]: File "/content/drive/MyDrive/creative/VGen/tools/inferences/inference_tft2v_entrance.py", line 139, in worker
[rank0]: clip_encoder = EMBEDDER.build(cfg.embedder)
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 107, in build
[rank0]: return self.build_func(*args, **kwargs, registry=self)
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry_class.py", line 7, in build_func
[rank0]: return build_from_config(cfg, registry, **kwargs)
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 64, in build_from_config
[rank0]: raise Exception(f"Failed to init class {req_type_entry}, with {e}")
[rank0]: Exception: Failed to init class <class 'tools.modules.clip_embedder.FrozenOpenCLIPTextVisualEmbedder'>, with Pretrained weights (models/open_clip_pytorch_model.bin) not found for model ViT-H-14.

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]: File "/content/drive/MyDrive/creative/VGen/inference.py", line 18, in
[rank0]: INFER_ENGINE.build(dict(type=cfg_update.TASK_TYPE), cfg_update=cfg_update.cfg_dict)
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 107, in build
[rank0]: return self.build_func(*args, **kwargs, registry=self)
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry_class.py", line 7, in build_func
[rank0]: return build_from_config(cfg, registry, **kwargs)
[rank0]: File "/content/drive/MyDrive/creative/VGen/utils/registry.py", line 69, in build_from_config
[rank0]: raise Exception(f"Failed to invoke function {req_type_entry}, with {e}")
[rank0]: Exception: Failed to invoke function <function inference_tft2v_entrance at 0x7b6d423d5ea0>, with Failed to init class <class 'tools.modules.clip_embedder.FrozenOpenCLIPTextVisualEmbedder'>, with Pretrained weights (models/open_clip_pytorch_model.bin) not found for model ViT-H-14.