segment fault

Question

segment fault

fengjiaxin opened this issue 4 years ago · comments

hi，excuse me
i meet a new issue,when i train the model
i meet another issue
segment fault core dump
would you update the new code,i have no idea to solve the problem

and more:
i think GLN/gln/mods/mol_gnn/gnn_family/utils.py can update by replace cuda() to to(DEVICE)
thanks a lot

Hanjun Dai · Answer 1 · Tue Feb 18 2020 17:57:42 GMT+0800 (China Standard Time)

could you please provide more details for the segfault?

冯佳欣 · Answer 2 · Tue Feb 18 2020 18:02:58 GMT+0800 (China Standard Time)

./run_mf.sh: 行 60: 9301 段错误 (吐核)python ../main.py -gm $gm -fp_degree 2 -neg_sample $neg_sample -att_type $att_type -gnn_out $gnn_out -tpl_enc $tpl_enc -subg_enc $subg_enc -latent_dim $msg_dim -bn $bn -gen_method $gen -retro_during_train $retro -neg_num $neg_size -embed_dim $embed_dim -readout_agg_type $graph_agg -act_func $act -act_last True -max_lv $lv -dropbox $dropbox -data_name $data_name -save_dir $save_dir -tpl_name $tpl_name -f_atoms $dropbox/cooked_$data_name/atom_list.txt -iters_per_val 3000 -gpu 1 -topk 50 -beam_size 50 -num_parts 1

no other information, i think its not environment issue

Hanjun Dai · Answer 3 · Tue Feb 18 2020 18:05:38 GMT+0800 (China Standard Time)

are you able to run the test with existing model dumps?

Hanjun Dai · Answer 4 · Tue Feb 18 2020 18:06:50 GMT+0800 (China Standard Time)

and did you modify the script?

I use -gpu 0 in the script. Please try with the vanilla code and see if that works

冯佳欣 · Answer 5 · Tue Feb 18 2020 18:27:12 GMT+0800 (China Standard Time)

get another issue gpu cuda error
are ckpt file saved by gpu?

冯佳欣 · Answer 6 · Tue Feb 18 2020 19:28:59 GMT+0800 (China Standard Time)

i use -gpu 1 ,and did you save the model by gpu 0, i run test script by error as follows:

Traceback (most recent call last):
File "main_test.py", line 139, in
model = RetroGLN(cmd_args.dropbox, local_args.model_for_test)
File "/home/fengjiaxin/GLN/gln/test/model_inference.py", line 43, in init
self.gln.load_state_dict(torch.load(model_file))
File "/home/fengjiaxin/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/torch/serialization.py", line 426, in load
return _load(f, map_location, pickle_module, **pickle_load_args)
File "/home/fengjiaxin/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/torch/serialization.py", line 613, in _load
result = unpickler.load()
File "/home/fengjiaxin/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/torch/serialization.py", line 576, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/home/fengjiaxin/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/torch/serialization.py", line 155, in default_restore_location
result = fn(storage, location)
File "/home/fengjiaxin/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/torch/serialization.py", line 135, in _cuda_deserialize
return storage_type(obj.size())
File "/home/fengjiaxin/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/torch/cuda/init.py", line 634, in _lazy_new
return super(_CudaBase, cls).new(cls, *args, **kwargs)
RuntimeError: CUDA error: out of memory

Hanjun Dai · Answer 7 · Wed Feb 19 2020 03:12:59 GMT+0800 (China Standard Time)

yes it uses gpu by default. Please always use -gpu 0 in your script.
If you want to change GPU, please use CUDA_VISIBLE_DEVICES instead

冯佳欣 · Answer 8 · Mon Feb 24 2020 15:38:27 GMT+0800 (China Standard Time)

hi , i debug the code ,some error at GLN/gln/graph_logic/soft_logic.py line 29
jagged_forward graph_embed = graph_enc(list)
no other information
can you introduce your code in brief
i can not find the error
thanks

冯佳欣 · Answer 9 · Mon Feb 24 2020 17:13:50 GMT+0800 (China Standard Time)

can you give a docker image? i think it will be useful

Hanjun Dai · Answer 10 · Tue Feb 25 2020 08:35:32 GMT+0800 (China Standard Time)

graph_enc is from another sub package in this repo.

Can you first try without GPU? Please take a look at this:
https://discuss.pytorch.org/t/on-a-cpu-device-how-to-load-checkpoint-saved-on-gpu-device/349

to see how to load a gpu dump into cpu

冯佳欣 · Answer 11 · Tue Feb 25 2020 09:55:10 GMT+0800 (China Standard Time)

hi, i debug the traing file and test file
got the same error ,not cuda error
would you introduce your code in brief ,thanks

Hanjun Dai · Answer 12 · Wed Feb 26 2020 04:56:01 GMT+0800 (China Standard Time)

If the error is happening in that line, you may double check the
https://github.com/Hanjun-Dai/GLN/blob/master/gln/mods/mol_gnn/gnn_family/utils.py#L64

note that different graph nn implementation will override this function.