problème run docker

Question

problème run docker

YoannRandon opened this issue 2 years ago · comments

Hi,
I have some questions about how to build the Dockerfile.
I tried in a first step to build both "Dockerfile.build" and "Dockerfile.server" files, the "build" one build correctly however when i try to run it. It close directly, is it normal ?
Moreover i can't build the Dockerfile.server because of credential. I have the credential but i don't know how to put it in the code and if i try to connect using the url : "https://docker.joligan.com/v2/joligan_build/manifests/latest". I end up with :
"{"errors":[{"code":"MANIFEST_UNKNOWN","message":"manifest unknown","detail":{"Tag":"latest"}}]}"
Can u help me to build and run correctly those Dockerfile?

YoannRandon · Answer 1 · Thu Nov 24 2022 19:10:18 GMT+0800 (China Standard Time)

In fact I'm not that in the server, I would like to know if it's possible to just build the dockerfile and do inference using models downloaded from "https://confiance.joligan.com/#/models" (somehow it's the joligan server so i already can get models from there). I am more interested in the Dockerfile.build and how to run it . Thanks

Emmanuel Benazera · Answer 2 · Thu Nov 24 2022 22:10:24 GMT+0800 (China Standard Time)

I would like to know if it's possible to just build the dockerfile and do inference using models

Yes you can do this, though the build docker is not exactly designed for this, as follows:

nvidia-docker run -v /path/to/models/:/models/ -v /path/to/images/:/images/ --rm --gpus all -it --entrypoint bash jolibrain/joligan_build

This gets you a running docker with a root user inside it. The -v mounts the local path to models to /models/ inside the docker, and the path to images to /images/ inside the docker.

From there you can use inference, e.g.

cd scripts
python3 gen_single_image.py --model-in-file /models/xxx/latest_net_G_A.pth --img-in /images/xxx.png --img-out /path/to/out/image.png

YoannRandon · Answer 3 · Thu Nov 24 2022 22:17:17 GMT+0800 (China Standard Time)

I resolve my problem for using the Dockerfile.build by using "tail -f /dev/null" after docker run.
1 problem remain, I tried to launch an inference using a model of the joligan server with the command :
"
python3 gen_single_image.py
--model-in-file /app/pretrained_weights_models/bdd100k_weather_det_clear2snowy_mm1/latest_net_G_A.pth
--img-size 512
--img-in /app/sample_bdd100k_img/8221f03e-7a27e32f.jpg
--img-out 8221f03e-7a27e32f_snowy.jpg
--gpuid 1
"
I received a cuda error : "invalid ordinal device" but this error seems to come form my docker/cuda.
So I tried to use the cpu instead (for inference)by replacing "--gpuid" by "--cpu" according to gen_single_image.py" argument description to avoid this error but it returns 'name "device" is not defined'.

Emmanuel Benazera · Answer 4 · Thu Nov 24 2022 22:43:58 GMT+0800 (China Standard Time)

but it returns 'name "device" is not defined'.

This is a bug, I just fixed it on master, see bb3c70c

Emmanuel Benazera · Answer 5 · Thu Nov 24 2022 22:44:51 GMT+0800 (China Standard Time)

I received a cuda error : "invalid ordinal device" but this error seems to come form my docker/cuda.

try nvidia-smi and make sure you have two GPUs available since your are asking GPU 1 (0 should be the first one).

YoannRandon · Answer 6 · Mon Nov 28 2022 17:08:05 GMT+0800 (China Standard Time)

Hi, I still have some problem with the single_gen_image.py script, I did build the dockerfile, and when i tried the command:

" python3 gen_single_image.py --model-in-file /app/pretrained_weights_models/bdd100k_weather_det_clear2snowy_mm1/latest_net_G_A.pth --img-in /app/sample_bdd100k_img/val/8221f03e-7a27e32f.jpg --img-out 8221f03e-7a27e32f_snowy.jpg --gpuid 1"

i got the following error:

"Traceback (most recent call last):
File "gen_single_image.py", line 60, in
model, opt = load_model(modelpath, os.path.basename(args.model_in_file), device)
File "gen_single_image.py", line 28, in load_model
opt = TrainOptions().parse_json(train_json)
File "/app/scripts/../options/base_options.py", line 925, in parse_json
self._json_parse_known_args(parser, opt, flat_json)
File "/app/scripts/../options/base_options.py", line 882, in _json_parse_known_args
raise ValueError(
ValueError: data_online_creation_mask_delta_A: Bad type (<class 'int'>, should be list of <class 'int'>)"

I already replace "cut_semantic_mask" by cut in the train_config.json of the model "bdd100k_weather_det_clear2snowy_mm1" downloaded on the joligan server. It seem the problem comes from base_option.py but i can't find what to change.

I think the problem may comes from the train_config.json, I'll put it bellow.

YoannRandon · Answer 7 · Mon Nov 28 2022 18:04:05 GMT+0800 (China Standard Time)

I received a cuda error : "invalid ordinal device" but this error seems to come form my docker/cuda.

try nvidia-smi and make sure you have two GPUs available since your are asking GPU 1 (0 should be the first one).

I already checked and both GPU 1 and 2 are shown in nvidia-smi, for a unknown reason, it seems the problem resolved by itself but another one occured. The error mention just before "data _online_mask_delta"

YoannRandon · Answer 8 · Mon Nov 28 2022 18:24:22 GMT+0800 (China Standard Time)

train_config.json

{
"D": {
"dropout": false,
"n_layers": 3,
"ndf": 64,
"netDs": [
"projected_d",
"basic",
"vision_aided"
],
"no_antialias": false,
"no_antialias_up": false,
"norm": "instance",
"proj_config_segformer": "models/configs/segformer/segformer_config_b0.py",
"proj_interp": 512,
"proj_network_type": "vitsmall",
"proj_weight_segformer": "models/configs/segformer/pretrain/segformer_mit-b0.pth",
"spectral": false,
"temporal_every": 4,
"temporal_frame_step": 30,
"temporal_num_common_char": -1,
"temporal_number_frames": 5,
"vision_aided_backbones": "clip+dino"
},
"G": {
"attn_nb_mask_attn": 10,
"attn_nb_mask_input": 1,
"backward_compatibility_twice_resnet_blocks": false,
"config_segformer": "models/configs/segformer/segformer_config_b0.py",
"dropout": false,
"netE": "resnet_512",
"netG": "segformer_attn_conv",
"ngf": 64,
"norm": "instance",
"padding_type": "reflect",
"spectral": false,
"stylegan2_num_downsampling": 1
},
"alg": {
"cut": {
"flip_equivariance": false,
"lambda_GAN": 1.0,
"lambda_NCE": 1.0,
"nce_T": 0.07,
"nce_idt": true,
"nce_includes_all_negatives_from_minibatch": false,
"nce_layers": "0,4,8,12,16",
"netF": "mlp_sample",
"netF_dropout": false,
"netF_nc": 256,
"netF_norm": "instance",
"num_patches": 256
},
"cyclegan": {},
"re": {
"P_lr": 0.0002,
"adversarial_loss_p": false,
"netP": "unet_128",
"no_train_P_fake_images": false,
"nuplet_size": 3,
"projection_threshold": 1.0
}
},
"data": {
"online_creation": {
"crop_delta_A": 64,
"crop_delta_B": 64,
"crop_size_A": 512,
"crop_size_B": 512,
"mask_delta_A": 0,
"mask_delta_B": 0,
"mask_square_A": false,
"mask_square_B": false
},
"crop_size": 512,
"dataset_mode": "unaligned_labeled_mask_online",
"direction": "AtoB",
"load_size": 512,
"max_dataset_size": 1000000000,
"num_threads": 4,
"online_context_pixels": 0,
"preprocess": "resize_and_crop",
"relative_paths": false,
"sanitize_paths": false,
"serial_batches": false
},
"f_s": {
"all_classes_as_one": false,
"class_weights": [
1,
10,
10,
1,
5,
5,
10,
10,
30,
50,
50
],
"config_segformer": "models/configs/segformer/segformer_config_b0.py",
"dropout": false,
"net": "segformer",
"nf": 64,
"semantic_nclasses": 11,
"semantic_threshold": 1.0,
"weight_segformer": ""
},
"output": {
"display": {
"G_attention_masks": false,
"diff_fake_real": false,
"env": "bdd100k_weather_det_clear2snowy_mm1",
"freq": 200,
"id": 1,
"ncols": 4,
"networks": false,
"port": 8097,
"server": "http://localhost",
"winsize": 256
},
"no_html": false,
"print_freq": 200,
"update_html_freq": 1000,
"verbose": false
},
"model": {
"init_gain": 0.02,
"init_type": "normal",
"input_nc": 3,
"multimodal": true,
"output_nc": 3
},
"train": {
"sem": {
"cls_B": false,
"cls_pretrained": false,
"cls_template": "basic",
"idt": true,
"l1_regression": false,
"lambda": 1.0,
"lr_f_s": 0.0002,
"net_output": false,
"regression": false,
"use_label_B": true
},
"mask": {
"charbonnier_eps": 1e-06,
"disjoint_f_s": false,
"f_s_B": true,
"for_removal": false,
"lambda_out_mask": 10.0,
"loss_out_mask": "L1",
"no_train_f_s_A": false,
"out_mask": false
},
"D_accuracy_every": 1000,
"D_lr": 0.0001,
"G_ema": true,
"G_ema_beta": 0.999,
"G_lr": 0.0002,
"batch_size": 2,
"beta1": 0.9,
"beta2": 0.999,
"compute_D_accuracy": false,
"compute_fid": false,
"compute_fid_val": false,
"continue": false,
"epoch": "latest",
"epoch_count": 1,
"fid_every": 1000,
"gan_mode": "lsgan",
"iter_size": 4,
"load_iter": 0,
"lr_decay_iters": 50,
"lr_policy": "linear",
"mm_lambda_z": 0.5,
"mm_nz": 16,
"n_epochs": 100,
"n_epochs_decay": 100,
"nb_img_max_fid": 1000000000,
"optim": "adam",
"pool_size": 50,
"save_by_iter": false,
"save_epoch_freq": 1,
"save_latest_freq": 5000,
"use_contrastive_loss_D": false
},
"dataaug": {
"APA": false,
"APA_every": 4,
"APA_nimg": 50,
"APA_p": 0,
"APA_target": 0.6,
"D_label_smooth": false,
"D_noise": 0.01,
"affine": 0.0,
"affine_scale_max": 1.2,
"affine_scale_min": 0.8,
"affine_shear": 45,
"affine_translate": 0.2,
"diff_aug_policy": "",
"diff_aug_proba": 0.5,
"imgaug": false,
"no_flip": false,
"no_rotate": true
},
"checkpoints_dir": "/data1/confiance_platform/checkpoints/",
"dataroot": "/data1/confiance/datasets/bdd100k_weather_clear2snowy/",
"ddp_port": "13456",
"gpu_ids": "2",
"model_type": "cut",
"name": "bdd100k_weather_det_clear2snowy_mm1",
"phase": "train",
"suffix": "",
"warning_mode": false
}

Emmanuel Benazera · Answer 9 · Mon Nov 28 2022 21:51:21 GMT+0800 (China Standard Time)

The error mention just before data _online_mask_delta

This is because the option has changed, you can fix it easily by editing the train_config.json file to set:

"mask_delta_A": [0],
"mask_delta_B": [0]

We had to do it ourselves on other models as well.

YoannRandon · Answer 10 · Tue Nov 29 2022 00:49:31 GMT+0800 (China Standard Time)

my problem with cuda is not gone, i think the problem comes from my set up even though I build the dockerfile.
I have tested if my gpus were available by adding the follwing lines in gen_single_image.py script :

"
modelpath = args.model_in_file.replace(os.path.basename(args.model_in_file), "")
print("modelpath=", modelpath)
use_cuda = torch.cuda.is_available()
print("cuda device is availaible :",use_cuda)
GPUtil.getAvailable()
"

and it's seems that it's ok, if u have already see something like this, can u give me a tip to correct it. Thanks

YoannRandon · Answer 11 · Tue Nov 29 2022 00:54:42 GMT+0800 (China Standard Time)

I have the same error when i use "--cpu" argument

Emmanuel Benazera · Answer 12 · Tue Nov 29 2022 02:05:19 GMT+0800 (China Standard Time)

If you haven´t done so yet, you shall rebuild the docker image so that it runs the latest code. Or you can patch from within the docker, as you like best.

Emmanuel Benazera · Answer 13 · Tue Nov 29 2022 02:09:21 GMT+0800 (China Standard Time)

can u give me a tip to correct it

First, make sure nvidia-smi works correctly from inside the docker, and look at the list of GPUs.

Try export CUDA_VISIBLE_DEVICES=1, and then use --gpuid 0. You may have to set the env variable into the dockerfile as well...

YoannRandon · Answer 14 · Wed Nov 30 2022 22:55:36 GMT+0800 (China Standard Time)

Hi,

I checked several things and i still can't find why I have this cuda error : invalid device ordinal,
nvidia-smi worked well, i can get my gpu names and id with torch.

Using "export CUDA_VISIBLE_DEVICES=1" didn't solve the problem.
I also tried to change versions of modules but i still got the same error.
This error also occur when i use "--cpu" argument from gen_single_image.py
i currently use :

python 3.9.13
torch 1.12.1+cu116
torchvision 0.13.1+cu116
cuda version (nvidia-smi) : 11.8

May I know what is your config when u run gen_single_image.py script?
I'll try to reproduce it. Thanks

YoannRandon · Answer 15 · Wed Nov 30 2022 23:47:40 GMT+0800 (China Standard Time)

Le training marche bien, il semblerai que le problème n'arrive que pendant l'inférence.

Pierre-Nicolas Suau · Answer 16 · Thu Dec 01 2022 22:51:37 GMT+0800 (China Standard Time)

Hi @YoannRandon ,
#322 should solve your issue, please let us know if you still have any problem.

Emmanuel Benazera · Answer 17 · Fri Dec 02 2022 21:44:03 GMT+0800 (China Standard Time)

@YoannRandon you need to rebuild your docker though.