There occurs an error when I set 'is_local=False'
wkr114 opened this issue · comments
wkr114 commented
My program works well in local, but when I set 'is_local=False', an error occurs.
I submitted the job by this way:
paddlecloud submit -jobname my-paddlecloud-job
-cpu 2
-gpu 0
-memory 4Gi
-parallelism 4
-pscpu 1
-pservers 2
-psmemory 1Gi
-passes 1
-entry "python trainer_config.py"
/pfs/[datacenter_name]/home/[username]/ctr_demo_package
Here is the error information:
==========================dpt-l1-sync-test-trainer-v0t8n==========================
label selector: paddle-job-pserver=dpt-l1-sync-test, desired: 2
current cnt: 1 sleep for 5 seconds...
label selector: paddle-job=dpt-l1-sync-test, desired: 2
Starting training job: /pfs/mulan/home/wangkairui@baidu.com/jobs/dpt-l1-sync-test, num_gradient_servers: 2, trainer_id: 1, version: v2
[INFO 2018-03-29 09:25:02,441 train.py:55] class number is : 28.
[INFO 2018-03-29 09:25:02,460 train.py:75] length of word dictionary is : 40201.
I0329 09:25:03.139626 131 Util.cpp:166] commandline: --num_gradient_servers=2 --ports_num_for_sparse=1 --use_gpu=False --trainer_id=1 --pservers=192.168.170.133,192.168.32.36 --trainer_count=1 --num_passes=1 --ports_num=1 --port=7164
I0329 09:25:03.203243 131 GradientMachine.cpp:94] Initing parameters..
I0329 09:25:03.592674 131 GradientMachine.cpp:101] Init parameters done.
I0329 09:25:03.593145 131 ParameterClient2.cpp:113] pserver 0 192.168.170.133:7164
I0329 09:25:03.593410 131 ParameterClient2.cpp:113] pserver 1 192.168.32.36:7164
processing /pfs/mulan/home/wangkairui@baidu.com/ques_sync_test/train_data/train-00001
[INFO 2018-03-29 09:25:09,039 train.py:110] Pass 0, trainer 1, Batch 0, Cost 3.327363, {'__auc_evaluator_0__': 0.0, 'classification_error_evaluator': 0.875}
F0329 09:25:40.119244 171 SocketChannel.cpp:54] Check failed: len >= 0 peer=192.168.170.133
*** Check failure stack trace: ***
@ 0x7f2081f506fd google::LogMessage::Fail()
@ 0x7f2081f541ac google::LogMessage::SendToLog()
@ 0x7f2081f50223 google::LogMessage::Flush()
@ 0x7f2081f556be google::LogMessageFatal::~LogMessageFatal()
@ 0x7f2081db51c4 paddle::SocketChannel::read()
@ 0x7f2081db56b0 paddle::SocketChannel::readMessage()
@ 0x7f2081db64e6 paddle::ProtoClient::recv()
@ 0x7f20824b93d4 paddle::ParameterClient2::sendParallel()
@ 0x7f2081ebff7c _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
@ 0x7f20ae8c4c80 (unknown)
@ 0x7f20b97696ba start_thread
@ 0x7f20b949f3dd clone
@ (nil) (unknown)
Aborted
job returned 134...setting pod return message...
===============================
termination log wroted...
Yan Xu commented
Seems Parameter Server
failed, please try to turn up the memory of PServer by passing arg -psmemory
, such as -psmemory 5Gi