google / ghost-userspace

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Scheduler is not working

Msiavashi opened this issue · comments

Hi,

My question might be naive but I'm trying to execute some experiments with the provided schedulers (e.g. fifo scheduler, shinjuku scheduler). I've noticed that the agent's executes and initializes the cores successfully, however it appears to me that it does not schedule any process. I have placed a few printfs in the Enqueue method as well as the constructor. The printf of the constructor prints, once the scheduler is initialized, however the prints inside of the enqueue method, do not get called at all. No significant difference in performance observed as well.

I have a web server running on the SCHED_FIFO policy with 99 priority.

The provided tests also pass but the scheduler does not log anything except for core mapping and initialization prints.

I also debugged the kernel to see if the SCHED_GHOST policy of the kernel works. It seems the kernel schedules the agent perfectly, however, the agent does not schedule the real-time (or any) of my processes.

So simply my question is, how to make sure the agent's scheduler is working correctly when the enqueue method seems not to be called at all?

I appreciate your help.

Thanks

Hi Mohammad, are you trying to schedule the web server (with SCHED_FIFO policy) using one of the ghost schedulers (say fifo_scheduler)?

If yes, then you'll need to move the application threads into the ghost scheduling class. All of the various ghost schedulers only schedule tasks with policy == SCHED_GHOST.

There's a couple of ways to move application threads into ghost (for your use case I'll recommend #3):

  1. modify the application itself to create GhostThreads but unless you are writing something from scratch this is usually not feasible.
  2. have the agent monitor cgroups tagged with a special file to indicate that all tasks in the cgroup should be moved into ghost (we use this approach extensively but haven't pushed it into the open source repository).
  3. move the application threads into ghost with a command line tool. I am attaching the source code for a tool we use internally (pushtosched.c). Let us know if this is useful and we can add that to our github repository.
    pushtosched.c.txt

Thanks @neelnatu,

Yep, it is exactly what I'm trying to do. I have misunderstood that the SCHED_GHOST is only for the agents.

I'll try the tool and update the issue with the result.

Thanks

Hi Mohammad, are you trying to schedule the web server (with SCHED_FIFO policy) using one of the ghost schedulers (say fifo_scheduler)?

If yes, then you'll need to move the application threads into the ghost scheduling class. All of the various ghost schedulers only schedule tasks with policy == SCHED_GHOST.

There's a couple of ways to move application threads into ghost (for your use case I'll recommend #3):

  1. modify the application itself to create GhostThreads but unless you are writing something from scratch this is usually not feasible.
  2. have the agent monitor cgroups tagged with a special file to indicate that all tasks in the cgroup should be moved into ghost (we use this approach extensively but haven't pushed it into the open source repository).
  3. move the application threads into ghost with a command line tool. I am attaching the source code for a tool we use internally (pushtosched.c). Let us know if this is useful and we can add that to our github repository.
    pushtosched.c.txt

I tried the tool and it seems to be working fine. I pass the pid of my desired process through the pipe to the tool and the agents start logging immediately. However, when using the FIFO scheduler, my webserver process won't respond anymore until I revert from SCHED_GHOST using the tool: pushtosched 0

I tried again using the agent_shinjuku to see whether the same problem occurs with this scheduler or not. The Shinjuku agent starts without error with the following output:

Core map
(  0 )	(  1 )	(  2 )	(  3 )	(  4 )	(  5 )	(  6 )	(  7 )

Initializing...
Initialization complete, ghOSt active.

Once I migrate my web server's threads to SCHED_GHOST using the tool, the agent will exit with the following error:

Core map
(  0 )	(  1 )	(  2 )	(  3 )	(  4 )	(  5 )	(  6 )	(  7 )

Initializing...
Initialization complete, ghOSt active.
PID 6149 Fatal segfault at addr 0x10: 
[0] 0x7fe13de823c0 : __restore_rt
[1] 0x55babc33b2cb : ghost::BasicDispatchScheduler<>::DispatchMessage()
[2] 0x55babc339f19 : ghost::ShinjukuAgent::AgentThread()
[3] 0x55babc33e3c4 : ghost::Agent::ThreadBody()
[4] 0x7fe13e0d2de4 : (unknown)

For example, if I run a flask dev web server, I use the following command to pipe the pid of the running python process:

pidof python | sudo ./pushtosched 18

Where do you think the problem might be from?

Thanks

Hi Mohammad,

Neel is out of office today, so let me help you until he is back.

Regarding Shinjuku, I assume you compiled with optimizations turned on, which strips out many of the debug symbols from the crash trace. What I assume is happening though is that since Shinjuku requires the client app being scheduled to set up a shared memory region for communication (check out the RocksDB experiment), it is crashing in your case because this region is not set up. I would not use the Shinjuku scheduler for scheduling the web server at the moment since it requires more extensive setup.

As for the FIFO scheduler, you do not need to do any sort of setup beyond running pushtosched.c to move the threads into the ghOSt sched class, so let me get some more information from you.

  1. Can you see the FIFO scheduler committing scheduling decisions in FifoSchedule::FifoSchedule() in fifo_scheduler.cc?
  2. If you try to schedule a very simple app with the FIFO scheduler (e.g., spawn a pthread from main(), printf() in the pthread, then join the pthread), does that app make forward progress (e.g., can it print things?).

I want to make sure the policy is up and running correctly on your machine. If it is, then we can delve into the specifics of the server. For example, the FIFO policy is non-preemptive, so perhaps certain threads are taking a while to run which is why you do not see the server responding.

Hi Mohammad,

Neel is out of office today, so let me help you until he is back.

Regarding Shinjuku, I assume you compiled with optimizations turned on, which strips out many of the debug symbols from the crash trace. What I assume is happening though is that since Shinjuku requires the client app being scheduled to set up a shared memory region for communication (check out the RocksDB experiment), it is crashing in your case because this region is not set up. I would not use the Shinjuku scheduler for scheduling the web server at the moment since it requires more extensive setup.

As for the FIFO scheduler, you do not need to do any sort of setup beyond running pushtosched.c to move the threads into the ghOSt sched class, so let me get some more information from you.

  1. Can you see the FIFO scheduler committing scheduling decisions in FifoSchedule::FifoSchedule() in fifo_scheduler.cc?
  2. If you try to schedule a very simple app with the FIFO scheduler (e.g., spawn a pthread from main(), printf() in the pthread, then join the pthread), does that app make forward progress (e.g., can it print things?).

I want to make sure the policy is up and running correctly on your machine. If it is, then we can delve into the specifics of the server. For example, the FIFO policy is non-preemptive, so perhaps certain threads are taking a while to run which is why you do not see the server responding.

Hi Jack,

Thanks for your hints.

The FifoSchedule() method works. As you mentioned I wrote a very simple app with 2 pthreads, noticed that the problem was with the pidof command which only returns the pid of the process, not the spawned threads (ls /proc/$(pidof my_multi_thread_app)/task | sudo ./pushtosched 18 solved the problem). My webserver was spawning pthreads on the fly as it receives new requests. So it works pretty well, however, I believe I should use the second solution suggested by Neel. The pushtosched tool can't monitor the spawned pthreads as they're created and terminated in runtime. Could you please share the code of the second solution mentioned by Neel? I can implement it and create a PR if you want me to. I'm gonna need this for my research anyway.

Regarding the Shinjuku's scheduler, I haven't taken a look at it yet, but since this is for a research purpose I'm gonna need to run that too. I'll investigate the RocksDB experiment as you suggested to see how it works. Will keep this issue updated with the result.

Many thanks to you guys for your kind help. 🌹

Hi Mohammad,

Glad that it is working now. We're happy to open source the cgroup code, but there is a complication. We use cgroups v1 internally in Google whereas others generally use v2 nowadays. So our code may not be useful to others, and since cgroups v1 does not support inotify (to detect when a new thread is added to a cgroup), we rely on a periodic polling mechanism to detect new threads which may be too slow for a webserver that spawns a pthread for every new request.

If you want to implement this functionality for cgroups v2, we would be thrilled to merge it in. Alternatively, is it possible for you to modify the pthread spawn code in the server to instead use GhostThreads?

Hi Mohammad,

Glad that it is working now. We're happy to open source the cgroup code, but there is a complication. We use cgroups v1 internally in Google whereas others generally use v2 nowadays. So our code may not be useful to others, and since cgroups v1 does not support inotify (to detect when a new thread is added to a cgroup), we rely on a periodic polling mechanism to detect new threads which may be too slow for a webserver that spawns a pthread for every new request.

If you want to implement this functionality for cgroups v2, we would be thrilled to merge it in. Alternatively, is it possible for you to modify the pthread spawn code in the server to instead use GhostThreads?

For my current test, yes I can modify the source of my webserver. But in the end, I believe I should take my experiments on some latency-sensitive apps like Memcached, etc (not sure if Memcached spawns new threads in runtime, I assume it doesn't). Probably not easy to modify the source in this case.

Hmm, once you move all threads of the webserver into ghost any new threads created by that application should automatically start out in the ghost sched_class. You can verify this by periodically doing a 'grep policy /proc//sched' for all tasks you get via ls /proc/$(pidof webserver_app)/task.

Is that not happening in your case?

Hmm, once you move all threads of the webserver into ghost any new threads created by that application should automatically start out in the ghost sched_class. You can verify this by periodically doing a 'grep policy /proc//sched' for all tasks you get via ls /proc/$(pidof webserver_app)/task.

Is that not happening in your case?

Hi Neel,

You're right. I double checked and It's working as you said. However, The throughout drops significantly with FIFO. Probably because of its non-preemptive implementation.

Anyway, I'm trying to run the Shinjuku to compare the performance. But stillll have some problems with running it. I'll ask for your help if I couldn't fix it today.

Thanks

For Shinjuku, here are the main gotchas:

  1. Integrate your web server with the PrioTable and make sure each pthread is marked as runnable. Our RocksDB experiment app demonstrates how to do this.
  2. Set the preemption time slice properly in Shinjuku (this is specified to the Shinjuku agent process via the --preemption_time_slice command line arg). The default is 50 µs but you probably want something higher than that for a web server.

And just to be clear, Shinjuku is a centralized scheduler that has preemption. So you should get better QPS for your web server, though the spinning Shinjuku global agent will take up a logical core.

Thanks,

Just started to get the Shinjuku running, for now, it seems the UnzipPar function in the Setup.py file gets a wrong par path from GetPar(). It return the shinjuku.runfiles instead of shinjuku.par path. Now that it's fixed, it progressed to the RunAllExperiments function but it's throwing an error related to the number of cores: lib/topology.h:472(20266) CHECK FAILED: cpu < cpus_.size() [8 >= 8]

I'm looking into it.

That's likely because the Shinjuku experiments require a machine with at least 8 logical cores. You can lower this by updating _NUM_CPUS in shinjuku.py.

Also take a look at the _FIRST_CPU variable and related options in options.py. We tend to affine the experiment away from lower CPUs (e.g., CPU 0) since a lot of background daemons run on those CPUs inside Google.

Shinjuku's experiment worked successfully. I have some clues now to get my webserver working with the Shinjuku scheduler thanks to your help. I'll integrate my webserver with PrioTable and hope it goes without a hitch.

Thanks again

Yes let us know. A key point is that you need to add a pthread to the PrioTable as soon as it is spawned, so make sure to allocate a large enough PrioTable when the server starts.

Depending on the scheduling policy and the scheduling hints from the server that your research requires, it may make sense to ditch the PrioTable altogether since it becomes a bottleneck at high thread counts (since the agent needs to scan the table). You could use Shinjuku or FIFO as a starting point and then take out the PrioTable and slim down other irrelevant parts of the policies.

Another option is to use the SOL scheduler as a starting point. It is a centralized FIFO scheduler with no PrioTable. Maybe adding preemption support to it (which should be relatively easy) would be what you're looking for.

Dear @jackhumphries,

I used the orchestrators as samples and implemented a very simple demonstration to get familiar with the code. Here is my code:

void printHelloWorld(uint32_t num){
    cout << "Hello World" << endl;
}


int main(int argc, char const *argv[])
{
    ghost_test::Ghost ghost_(1, 1);
    ghost_test::ExperimentThreadPool thread_pool_(1);
    vector<ghost::GhostThread::KernelScheduler> kernelSchedulers;
    vector<function<void (uint32_t)>> threadWork;
    kernelSchedulers.push_back(ghost::GhostThread::KernelScheduler::kGhost);
    threadWork.push_back(&printHelloWorld);
    thread_pool_.Init(kernelSchedulers, threadWork);
    ghost::sched_item si;
    ghost_.GetSchedItem(0, si);
    si.sid = 0;
    si.gpid = thread_pool_.GetGtids()[0].id();
    si.flags |= SCHED_ITEM_RUNNABLE;
    ghost_.SetSchedItem(0, si);
    return 0;
}

As you can see, I create only 1 ghOSt thread, assigned my printHelloWorld function to it and mark it as RUNNABLE. The Shinjuku agent does not crash and successfully schedules the task, however, I get multiple prints in my stdout, although I only print once. Here is the output:

Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
[2] 0x5575a4913104 : main
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
[3] 0x7f14c895b0b3 : __libc_start_main
Hello World
Hello World
Hello World

Any idea why this repetition happens? Am I missing something?

My second question is, how is it possible to ditch the PrioTable? Is it easily possible with the current implementation or it requires a new implementation of Shinjuku? I assume it should not be easy to achieve this with the current implementation of its agent. I'm I correct?

Thanks

Hi Mohammad,

In the experiments directory, we have a helper thread pool that runs a thread body over and over again until the thread is marked as ready to exit. Below is some updated code that prints "Hello World" only once, as you would expect. Just update the include paths to work with your build.

#include <functional>
#include <iostream>
#include <vector>

#include "third_party/absl/synchronization/notification.h"
#include "third_party/ghost/experiments/shared/ghost.h"
#include "third_party/ghost/experiments/shared/thread_pool.h"
#include "third_party/ghost/lib/base.h"
#include "third_party/ghost/lib/ghost.h"

void printHelloWorld(uint32_t num, absl::Notification* printed,
                     absl::Notification* wait) {
  std::cout << "Hello World" << std::endl;
  printed->Notify();
  wait->WaitForNotification();
}

int main(int argc, char const* argv[]) {
  ghost_test::Ghost ghost_(1, 1);
  ghost_test::ExperimentThreadPool thread_pool_(1);
  std::vector<ghost::GhostThread::KernelScheduler> kernelSchedulers;
  std::vector<std::function<void(uint32_t)>> threadWork;
  kernelSchedulers.push_back(ghost::GhostThread::KernelScheduler::kGhost);

  absl::Notification printed;
  absl::Notification wait;
  threadWork.push_back(
      std::bind(printHelloWorld, std::placeholders::_1, &printed, &wait));

  thread_pool_.Init(kernelSchedulers, threadWork);
  ghost::sched_item si;
  ghost_.GetSchedItem(0, si);
  si.sid = 0;
  si.gpid = thread_pool_.GetGtids()[0].id();
  si.flags |= SCHED_ITEM_RUNNABLE;
  ghost_.SetSchedItem(0, si);

  printed.WaitForNotification();
  thread_pool_.MarkExit(/*sid=*/0);
  wait.Notify();
  thread_pool_.Join();

  return 0;
}

Basically, I use notifications to wait for the ghOSt thread to print once, then I mark the ghOSt thread as ready to exit which causes the thread pool to let it exit. Then I call Join() on the thread pool which avoids triggering the CHECK in the thread pool destructor, which I can see was triggered in your output.

You can modify the thread pool to avoid the repeating behavior or just create the ghOSt threads directly yourself.

If you want to ditch the PrioTable altogether (which I would recommend for your case), then I would take a look at the SOL scheduler. It is a centralized FIFO scheduler without a PrioTable. The only downside is that it is not preemptive, but you can implement this very easily yourself. On each iteration of the global scheduling loop, see how long each thread has been running so far and then schedule something else on a CPU whose currently running thread has exceeded the time slice.

Hi Mohammad,

Please let me know if you have any additional questions. I will close this thread for now.