dmlc / ps-lite

A lightweight parameter server interface

Home Page:http://ps-lite.readthedocs.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can not train a model with multi_process kvworkers

kangshantong opened this issue · comments

Hi,
I am trying to train models with ps-lite. It works well in multi_thread mode like test_kv_app_multi_workers, but in multi_process model, only one worker process works and the others are blocked in the PS::Start stage.

Trace the code of ps:Start, we can find that there is a barrier in this stage.After all the scheduler/servers/workers shoot the barrier command, every node will be activated by setting the barrier_done_ to true. But the code followed below only will set
the barrier_done_ to true for customer_id 0.

void Postoffice::Manage(const Message& recv) {
CHECK(!recv.meta.control.empty());
const auto& ctrl = recv.meta.control;
if (ctrl.cmd == Control::BARRIER && !recv.meta.request) {
barrier_mu_.lock();
auto size = barrier_done_[recv.meta.app_id].size();
for (size_t customer_id = 0; customer_id < size; customer_id++) {
barrier_done
[recv.meta.app_id][customer_id] = true;
}_

barrier_mu_.unlock();
barrier_cond_.notify_all();
}
}

The bug can be fixed by the code followed.

void Postoffice::Manage(const Message& recv) {
CHECK(!recv.meta.control.empty());
const auto& ctrl = recv.meta.control;
if (ctrl.cmd == Control::BARRIER && !recv.meta.request) {
barrier_mu_.lock();
for (auto iter=barrier_done[recv.meta.app_id].begin();iter!=barrier_done_[recv.meta.app_id].end(); iter++)
{
size_t customer_id = iter -> first;
barrier_done_[recv.meta.app_id][customer_id] = true;
}_

barrier_mu_.unlock();
barrier_cond_.notify_all();
}
}

@eric-haibin-lin can you review this commit?