为什么程序会在CPU上跑？

Question

为什么程序会在CPU上跑？

Pro-flynn opened this issue 5 years ago · comments

Shufflewave commented 5 years ago

在我自己的数据上运程序时为什么只会在CPU上运行呢发现用了550%的CPU 而 gpu才用了150M
希望以前踩过这个坑的人能够提示一下谢谢！！

下面是我 /scripts/train.sh 的中设置

set -x
set -e
export CUDA_VISIBLE_DEVICES=0
IMG_PER_GPU=32

TRAIN_DIR=$/pixel_link_info

OLD_IFS="$IFS"
IFS=","
gpus=($CUDA_VISIBLE_DEVICES)
IFS="$OLD_IFS"
NUM_GPUS=${#gpus[@]}

BATCH_SIZE=expr $NUM_GPUS \* $IMG_PER_GPU

DATASET=thaiid
DATASET_DIR=$/tmp

CUDA_VISIBLE_DEVICES=0 python train_pixel_link.py
--train_dir=${TRAIN_DIR}
--num_gpus=${NUM_GPUS}
--learning_rate=1e-3
--gpu_memory_fraction=-1
--train_image_width=512
--train_image_height=512
--batch_size=${BATCH_SIZE}
--dataset_dir=${DATASET_DIR}
--dataset_name=${DATASET}
--dataset_split_name=train
--max_number_of_steps=100
--checkpoint_path=${CKPT_PATH}
--using_moving_average=1
2>&1 | tee -a ${TRAIN_DIR}/log.log

JiSheng · Answer 1 · Thu Aug 01 2019 15:36:31 GMT+0800 (China Standard Time)

How can you solved this problem? Please help me

Shufflewave · Answer 2 · Thu Aug 01 2019 17:26:02 GMT+0800 (China Standard Time)

i had sloved this issue To slove it, you could check the whether cuda and cdunn are in your bashrc

…

---原始邮件--- 发件人: "JiSheng"<notifications@github.com> 发送时间: 2019年8月1日(星期四) 下午3:36 收件人: "ZJULearning/pixel_link"<pixel_link@noreply.github.com>; 抄送: "State change"<state_change@noreply.github.com>;"Shufflewave"<295171504@qq.com>; 主题: Re: [ZJULearning/pixel_link] 为什么程序会在CPU上跑？ (#138) How can you solved this problem? Please help me — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

JiSheng · Answer 3 · Thu Aug 01 2019 20:42:20 GMT+0800 (China Standard Time)

@Pro-xiaowen Are you using conda environment? In my case, after used fully 20GB GPU, the code also used fully CPU. And i don't know why?

Shufflewave · Answer 4 · Tue Aug 06 2019 18:32:38 GMT+0800 (China Standard Time)

spelling mistake . i suggect you check your cuda and cudnn environment. Since your experiments have used full gpu, maybe your condition are normal.

…

---原始邮件--- 发件人: "JiSheng"<notifications@github.com> 发送时间: 2019年8月1日(星期四) 晚上8:42 收件人: "ZJULearning/pixel_link"<pixel_link@noreply.github.com>; 抄送: "Mention"<mention@noreply.github.com>;"Shufflewave"<295171504@qq.com>; 主题: Re: [ZJULearning/pixel_link] 为什么程序会在CPU上跑？ (#138) @Pro-xiaowen Are you using conda environment? In my case, after used fully 20GB GPU, the code also used fully CPU. And i don't know why? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

JiSheng · Answer 5 · Tue Aug 06 2019 19:26:15 GMT+0800 (China Standard Time)

@Pro-xiaowen the problem come from its also using fully my CPU too. Have you face this problem?

And I have another question, after about 3000 iterations, my loss is approximate at 0.5 - 0.6 (pretrained model is PixelLink VGG 2s) and its not drop down anymore. @@!.
Have you trained successfully?
Would you give me some advice, or idea to get out of this situation?

Shufflewave · Answer 6 · Tue Aug 06 2019 20:27:52 GMT+0800 (China Standard Time)

i do not konw the reason why the CPU was full in your experoments. About loss, you can use different learning rate setting, for example, expoentital_decay of learning rate . Different optimizer may output different performance. After above adjustment, the convergence loss is 0.4 around when i train pixellink using my own data in 40K interations. and the recall of test data is 97％ icdar 2013 evalution criteria. you can try those adjustments.

…

---原始邮件--- 发件人: "JiSheng"<notifications@github.com> 发送时间: 2019年8月6日(星期二) 晚上7:26 收件人: "ZJULearning/pixel_link"<pixel_link@noreply.github.com>; 抄送: "Mention"<mention@noreply.github.com>;"Shufflewave"<295171504@qq.com>; 主题: Re: [ZJULearning/pixel_link] 为什么程序会在CPU上跑？ (#138) @Pro-xiaowen the problem come from its also using fully my CPU too. Have you face this problem? And I have another question, after about 3000 iterations, my loss is approximate at 0.5 - 0.6 and its not drop down anymore. @@!. Have you trained successfully? Would you give me some advice, or idea to get out of this situation? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

JiSheng · Answer 7 · Fri Aug 09 2019 14:41:41 GMT+0800 (China Standard Time)

@Pro-xiaowen my loss is 0.4 around, but detect empty box and i don't know why @@!

JiSheng · Answer 8 · Mon Aug 12 2019 12:04:06 GMT+0800 (China Standard Time)

@Pro-xiaowen Can you share how you setup the dataset?. I think may be the problem come from my dataset.

Shufflewave · Answer 9 · Mon Aug 12 2019 15:15:41 GMT+0800 (China Standard Time)

i imitate the synthtext_to_tfrecords.py to set my tfrecords. you can have a try

…

---原始邮件--- 发件人: "JiSheng"<notifications@github.com> 发送时间: 2019年8月12日(星期一) 中午12:04 收件人: "ZJULearning/pixel_link"<pixel_link@noreply.github.com>; 抄送: "Mention"<mention@noreply.github.com>;"Shufflewave"<295171504@qq.com>; 主题: Re: [ZJULearning/pixel_link] 为什么程序会在CPU上跑？ (#138) @Pro-xiaowen Can you share how you setup the dataset?. I think may be the problem come from my dataset. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

JiSheng · Answer 10 · Mon Aug 12 2019 16:16:44 GMT+0800 (China Standard Time)

@Pro-xiaowen Thanks you for sharing. I will try it.

Shufflewave · Answer 11 · Mon Aug 12 2019 19:11:55 GMT+0800 (China Standard Time)

you are welcome. About the empty prediction results, i suggect that you can train more epoches . In my experiment, this model can output effective boxes after 10-20k epoches.

…

---原始邮件--- 发件人: "JiSheng"<notifications@github.com> 发送时间: 2019年8月12日(星期一) 下午4:16 收件人: "ZJULearning/pixel_link"<pixel_link@noreply.github.com>; 抄送: "Mention"<mention@noreply.github.com>;"Shufflewave"<295171504@qq.com>; 主题: Re: [ZJULearning/pixel_link] 为什么程序会在CPU上跑？ (#138) @Pro-xiaowen Thanks you for sharing. I will try it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.