alibaba / PhotonLibOS

Probably the fastest coroutine lib in the world!

Home Page:https://PhotonLibOS.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Missed mutex unlock notification in io_uring engine

louiswilliams opened this issue · comments

The io_uring engine can miss mutex wakeup notifications and will sleep here for 10 seconds, stalling all threads. The conditions required to reproduce this are using multiple mutexes and many threads.

I can reliably reproduce by making the following changes:

Bypass the spinning code in mutex::lock:

diff --git a/thread/thread.cpp b/thread/thread.cpp
index d2a8542..9a19b97 100644
--- a/thread/thread.cpp
+++ b/thread/thread.cpp
@@ -1547,11 +1547,11 @@ R"(
     }
     int mutex::lock(uint64_t timeout)
     {
-        for (int tries = 0; tries < MaxTries; ++tries) {
-            if (try_lock() == 0)
-                return 0;
-            thread_yield();
-        }
         splock.lock();
         if (try_lock() == 0) {
             splock.unlock();

Compile this program with -O0 to make it slower:

#include <atomic>
#include <chrono>
#include <cstdio>
#include <iostream>
#include <photon/photon.h>
#include <photon/thread/std-compat.h>
#include <unistd.h>
#include <vector>

static constexpr int nMutexes = 2;
static constexpr int nWorkers = 2;
static constexpr int nThreads = 16; // Increase if necessary

photon::mutex mutexes[nMutexes];
std::atomic<int64_t> acquisitionCounters[nMutexes];
std::atomic<int64_t> acquisitionDurations[nMutexes];

void timedLock(photon::mutex &mutex, int index) {
  auto start = std::chrono::steady_clock::now();

  {
    photon_std::lock_guard<photon::mutex> lk(mutex);
    auto end = std::chrono::steady_clock::now();
    acquisitionCounters[index].fetch_add(1);
    auto durationMicros =
        std::chrono::duration_cast<std::chrono::microseconds>(end - start)
            .count();
    acquisitionDurations[index].fetch_add(durationMicros);

    if (durationMicros > 1'000'000) {
      printf("long acquisition. mutex %d, duration: %ldms\n", index,
             durationMicros / 1000);
    }
  }
}

void myThread(int tid) {
  printf("thread %d starting\n", tid);

  while (true) {
    for (int i = 0; i < nMutexes; i++) {
      timedLock(mutexes[i], i);
    }
  }
}

int main() {
  int ret = photon::init(photon::INIT_EVENT_IOURING, photon::INIT_IO_DEFAULT);
  if (ret != 0) {
    printf("failed to init photon with error %d\n", ret);
    return -1;
  }

  ret = photon_std::work_pool_init(nWorkers, photon::INIT_EVENT_IOURING,
                                   photon::INIT_IO_DEFAULT);
  if (ret != 0) {
    printf("failed to init photon work pool with error %d\n", ret);
    return -1;
  }

  std::vector<photon_std::thread> threads;
  for (int i = 0; i < nThreads; i++) {
    threads.emplace_back(myThread, i);
  }

  while (true) {
    for (int i = 0; i < nMutexes; i++) {
      auto count = acquisitionCounters[i].load();
      auto durationMs = acquisitionDurations[i].load() / 1000;

      printf("mutex %d: acquisitions: %ld, wait time: %ldms\n", i, count,
             durationMs);
    }
    printf("\n");
    sleep(1);
  }
}

Eventually, the program will stall and log mutex acquisitions that take 10 seconds:

mutex 0: acquisitions: 7730, wait time: 531ms
mutex 1: acquisitions: 7730, wait time: 0ms

long acquisition. mutex 0, duration: 10485ms
long acquisition. mutex 0, duration: 10485ms
long acquisition. mutex 0, duration: 10485ms
long acquisition. mutex 0, duration: 10485ms
long acquisition. mutex 0, duration: 10485ms
long acquisition. mutex 0, duration: 10485ms
long acquisition. mutex 0, duration: 10485ms
long acquisition. mutex 0, duration: 10485ms
long acquisition. mutex 0, duration: 10485ms
long acquisition. mutex 0, duration: 10485ms
long acquisition. mutex 0, duration: 10485ms
long acquisition. mutex 0, duration: 10485ms
long acquisition. mutex 0, duration: 10485ms
long acquisition. mutex 0, duration: 10485ms
long acquisition. mutex 0, duration: 10485ms
long acquisition. mutex 0, duration: 10485ms

Thanks for reporting.

Will dig into this issue next week.

I'm also very interested about what kind of project you've been working on with Photon. Would you like to join our Slack channel?

@louiswilliams Are you using arm or x86? Is the bug reproducible when using epoll?

Hi, this was only tested on ARM, and I haven’t been able to reproduce with epoll.

Hi, this was only tested on ARM, and I haven’t been able to reproduce with epoll.

Good to hear that. It means the mutex and epoll are ok, and the bug may lie in iouring event engine.

I also came across a similar situation yesterday in macos with arm. That may be also due to the event engine (kqueue).

@louiswilliams Please take a part in the review