PaddlePaddle / PaddleCloud

PaddlePaddle Docker images and K8s operators for PaddleOCR/Detection developers to use on public/private cloud.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

There occurs an error when I set 'is_local=False'

wkr114 opened this issue · comments

My program works well in local, but when I set 'is_local=False', an error occurs.

I submitted the job by this way:

paddlecloud submit -jobname my-paddlecloud-job
-cpu 2
-gpu 0
-memory 4Gi
-parallelism 4
-pscpu 1
-pservers 2
-psmemory 1Gi
-passes 1
-entry "python trainer_config.py"
/pfs/[datacenter_name]/home/[username]/ctr_demo_package

Here is the error information:

==========================dpt-l1-sync-test-trainer-v0t8n==========================
label selector: paddle-job-pserver=dpt-l1-sync-test, desired: 2
current cnt: 1 sleep for 5 seconds...
label selector: paddle-job=dpt-l1-sync-test, desired: 2
Starting training job:  /pfs/mulan/home/wangkairui@baidu.com/jobs/dpt-l1-sync-test, num_gradient_servers: 2, trainer_id:  1, version:  v2
[INFO 2018-03-29 09:25:02,441 train.py:55] class number is : 28.
[INFO 2018-03-29 09:25:02,460 train.py:75] length of word dictionary is : 40201.
I0329 09:25:03.139626   131 Util.cpp:166] commandline:  --num_gradient_servers=2 --ports_num_for_sparse=1 --use_gpu=False --trainer_id=1 --pservers=192.168.170.133,192.168.32.36 --trainer_count=1 --num_passes=1 --ports_num=1 --port=7164 
I0329 09:25:03.203243   131 GradientMachine.cpp:94] Initing parameters..
I0329 09:25:03.592674   131 GradientMachine.cpp:101] Init parameters done.
I0329 09:25:03.593145   131 ParameterClient2.cpp:113] pserver 0 192.168.170.133:7164
I0329 09:25:03.593410   131 ParameterClient2.cpp:113] pserver 1 192.168.32.36:7164
processing  /pfs/mulan/home/wangkairui@baidu.com/ques_sync_test/train_data/train-00001
[INFO 2018-03-29 09:25:09,039 train.py:110] Pass 0, trainer 1, Batch 0, Cost 3.327363, {'__auc_evaluator_0__': 0.0, 'classification_error_evaluator': 0.875}

F0329 09:25:40.119244   171 SocketChannel.cpp:54] Check failed: len >= 0  peer=192.168.170.133
*** Check failure stack trace: ***
    @     0x7f2081f506fd  google::LogMessage::Fail()
    @     0x7f2081f541ac  google::LogMessage::SendToLog()
    @     0x7f2081f50223  google::LogMessage::Flush()
    @     0x7f2081f556be  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f2081db51c4  paddle::SocketChannel::read()
    @     0x7f2081db56b0  paddle::SocketChannel::readMessage()
    @     0x7f2081db64e6  paddle::ProtoClient::recv()
    @     0x7f20824b93d4  paddle::ParameterClient2::sendParallel()
    @     0x7f2081ebff7c  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
    @     0x7f20ae8c4c80  (unknown)
    @     0x7f20b97696ba  start_thread
    @     0x7f20b949f3dd  clone
    @              (nil)  (unknown)
Aborted
job returned 134...setting pod return message...
===============================
termination log wroted...

Seems Parameter Server failed, please try to turn up the memory of PServer by passing arg -psmemory, such as -psmemory 5Gi