SymbioticLab / Salus

Fine-grained GPU sharing primitives

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How does Salus use an exsiting lane?

jasperzhong opened this issue · comments

https://github.com/SymbioticLab/Salus/blob/master/src/oplibraries/tensorflow/device/gpu/lane/lanemgr.cpp#L336

Given that a lane has already been assigned to a DL job, it seems that other DL jobs cannot share this lane according to the following code:

std::unique_ptr<LaneHolder> GpuLane::tryFit(size_t persistent, size_t peak)
{
    auto g = sstl::with_guard(m_mu);
    auto maxPeak = peak;
    if (!m_maxPeak.empty()) {
        maxPeak = std::max(maxPeak, *m_maxPeak.cbegin());
    }
    if ((persistent + maxPeak) <= m_availableMemory) { // 1
        addHoldUnsafe(persistent, peak);
        return std::make_unique<LaneHolder>(sstl::add_ref(this), persistent, peak);
    }
    return {};
}

Actually maxPeak will equal to m_availableMemory at 1 when the second DL job that wants to share this lane invokes this function. persistent should be greater thant 0. Then it will skip if-block.

commented

hmm, you are right. the logic seems to be incorrect there. The idea is that a lane keeps its size when first created, which equals to persistent + peak of the first workload in the lane. New workload will fit in if its persistent and peak is smaller than available.

I see. So just remove the following code?

    if (!m_maxPeak.empty()) {
        maxPeak = std::max(maxPeak, *m_maxPeak.cbegin());
    }
commented

I'm away from pc so I can't judge at the moment. You should be careful removing it though. I probably added it for a reason. The lane should always maintain its largest peak, given workloads can come and go. I think m_maxPeak is related to this logic. So check that before you proceed

ok. Another question is whether a lane can change its size on the fly?

commented

no it can't