Errors when reproducing experiments (also when running userspace agents)
liborui opened this issue · comments
Description
Logs
I am trying to reproduce the experiments, and further do something new with ghOst.
But I came across an error below:
(Before I run this command, I finished to compiled the ghost-userspace with bazel build -c opt ...
)
sudo ./bazel-bin/experiments/scripts/centralized_queuing.par cfs # in the root of ghost-userspace.
# I use "sudo" because it seems the python script ends with "Running CFS experiments... mount: only root can use "--options" option"
It turns out to be
Running CFS experiments...
mount: /dev/cgroup/cpu: cgroup already mounted on /sys/fs/cgroup/systemd.
mount: /dev/cgroup/memory: cgroup already mounted on /sys/fs/cgroup/systemd.
Output Directory: /tmp/ghost_data/2022-04-26 10:22:56
{"throughputs": [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000, 320000, 330000, 340000,
350000, 360000, 370000, 380000, 390000, 400000, 410000, 420000, 430000, 440000, 450000, 451000, 452000, 453000, 454000, 455000, 456000, 457000, 458000, 459000, 460000, 461000, 462000, 463000, 464000, 465000, 466000, 467000, 468000, 469000, 470000, 471000, 472000, 473000, 474000,
475000, 476000, 477000, 478000, 479000, 480000], "output_prefix": "/tmp/ghost_data/2022-04-26 10:22:56", "binaries": {"rocksdb": "/dev/shm/rocksdb", "antagonist": "/dev/shm/antagonist", "ghost": "/dev/shm/agent_shinjuku"}, "rocksdb": {"print_format": "csv", "print_distribution":
false, "print_ns": false, "print_get": true, "print_range": true, "rocksdb_db_path": "/dev/shm/orch_db", "throughput": 20000, "range_query_ratio": 0.0, "load_generator_cpu": 10, "cfs_dispatcher_cpu": 11, "num_workers": 6, "worker_cpus": [12, 13, 14, 15, 16, 17], "cfs_wait_type":
"spin", "ghost_wait_type": "prio_table", "get_duration": "10us", "range_duration": "5000us", "get_exponential_mean": "1us", "batch": 1, "experiment_duration": "15s", "discard_duration": "2s", "scheduler": "cfs", "ghost_qos": 2}, "antagonist": null, "ghost": null}
Running experiment for throughput = 10000 req/s:
['/dev/shm/rocksdb', '--print_format', 'csv', '--noprint_distribution', '', '--noprint_ns', '', '--print_get', '', '--print_range', '', '--rocksdb_db_path', '/dev/shm/orch_db', '--throughput', '20000', '--range_query_ratio', '0.0', '--load_generator_cpu', '10', '--cfs_dispatcher_
cpu', '11', '--num_workers', '6', '--worker_cpus', '12,13,14,15,16,17', '--cfs_wait_type', 'spin', '--ghost_wait_type', 'prio_table', '--get_duration', '10us', '--range_duration', '5000us', '--get_exponential_mean', '1us', '--batch', '1', '--experiment_duration', '15s', '--discar
d_duration', '2s', '--scheduler', 'cfs', '--ghost_qos', '2', '--throughput', '10000']
experiments/rocksdb/cfs_orchestrator.cc:95(23984) CHECK FAILED: ghost::Ghost::SchedSetAffinity( ghost::Gtid::Current(), ghost::MachineTopology()->ToCpuList( std::vector<int>{options().load_generator_cpu})) == 0 [-1 != 0]
errno: 22 [Invalid argument]
PID 23984 Backtrace:
[0] 0x564e0ac5e487 : ghost_test::CfsOrchestrator::LoadGenerator()
[1] 0x564e0ac8561e : ghost_test::ExperimentThreadPool::ThreadMain()
[2] 0x564e0ac8756b : std::_Function_handler<>::_M_invoke()
[3] 0x7fe9aeb77de4 : (unknown)
Furthermore, I also tried this command under root of ghost-userspace
sudo bazel run fifo_agent
and it turns out to
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
ERROR: Skipping 'fifo_agent': no such target '//:fifo_agent': target 'fifo_agent' not declared in package '' defined by /home/emnets/ghost-userspace/BUILD
WARNING: Target pattern parsing failed.
ERROR: no such target '//:fifo_agent': target 'fifo_agent' not declared in package '' defined by /home/emnets/ghost-userspace/BUILD
INFO: Elapsed time: 10.222s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (1 packages loaded)
FAILED: Build did NOT complete successfully (1 packages loaded)
Env Info
And here is my environment version info:
lsb_release -a
# LSB Version: core-11.1.0ubuntu2-noarch:security-11.1.0ubuntu2-noarch
# Distributor ID: Ubuntu
# Description: Ubuntu 20.04.2 LTS
# Release: 20.04
# Codename: focal
uname -mrs
# Linux 5.11.0+ x86_64
ghost-kernel hash: 5da05ec77890217e85947ff3573e1480579687d2
ghost-userspace hash: 79ecaeb
P.S. I am using a virtual machine to reproduce ghost. I am wondering if the virtual machine matters.
The virtual machine is using VMware workstation, with 8GB mem and 8 processors (each processor has one core).
Suggestion
My colleagues and I appreciate the paper and this open-source project of ghost.
But I came across many difficulties to conduct the experiments and reproduce the results, because the README do not mentioned this.
I have to refer the the (closed) issues and find the scattered commands to run the experiments.
If you could update the README with a more detailed steps, this will be great. And I could help you out if you need.
Hi Borui,
Thanks for opening this issue. The CFS and ghOSt experiments affine threads to logical cores 10, 11, 12, and so on (this is controlled by _FIRST_CPU
in options.py). ghost::Ghost::SchedSetAffinity()
calls sched_setaffinity()
, which is failing with EINVAL
. This error generally means that those logical cores do not exist in your system, and this makes sense given that you mentioned you have 8 logical cores in your machine. The experiment parameters in the Python files need to be changed for your machine -- I would imagine that setting _FIRST_CPU
to 0
in options.py would fix your issue.
Your run command for fifo_agent
is failing because there is no fifo_agent
target. We used to have a fifo_agent
target, but we renamed it in 7ab27ed. We have fifo_per_cpu_agent
(a ghOSt scheduler with per-CPU ghOSt agents that each have their own FIFO runqueue) and fifo_centralized_agent
(a ghOSt scheduler with a global ghOSt agent that has a single FIFO runqueue for the entire machine).
Thanks for the suggestion about a README with instructions. We definitely want to create this along with several extensive tutorials, though we do not have a set timeline for these yet. If you start writing a README/tutorials and want to push your work, we would be more than happy to accept it.
Please let me know if you have additional questions.
I also have the same question, i modifiy the options.py _FIRST_CPU to 0,but it performance not ok
seu@ubuntu:$ cd ghost-userspace//ghost-userspace$ sudo su
seu@ubuntu:
[sudo] password for seu:
root@ubuntu:/home/seu/ghost-userspace# sudo ./bazel-bin/experiments/scripts/centralized_queuing.par cfs
Running CFS experiments...
Output Directory: /tmp/ghost_data/2022-12-13 17:03:08
{"throughputs": [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000, 320000, 330000, 340000, 350000, 360000, 370000, 380000, 390000, 400000, 410000, 420000, 430000, 440000, 450000, 451000, 452000, 453000, 454000, 455000, 456000, 457000, 458000, 459000, 460000, 461000, 462000, 463000, 464000, 465000, 466000, 467000, 468000, 469000, 470000, 471000, 472000, 473000, 474000, 475000, 476000, 477000, 478000, 479000, 480000], "output_prefix": "/tmp/ghost_data/2022-12-13 17:03:08", "binaries": {"rocksdb": "/dev/shm/rocksdb", "antagonist": "/dev/shm/antagonist", "ghost": "/dev/shm/agent_shinjuku"}, "rocksdb": {"print_format": "csv", "print_distribution": false, "print_ns": false, "print_get": true, "print_range": true, "rocksdb_db_path": "/dev/shm/orch_db", "throughput": 20000, "range_query_ratio": 0.0, "load_generator_cpu": 10, "cfs_dispatcher_cpu": 11, "num_workers": 6, "worker_cpus": [12, 13, 14, 15, 16, 17], "cfs_wait_type": "spin", "ghost_wait_type": "prio_table", "get_duration": "10us", "range_duration": "5000us", "get_exponential_mean": "1us", "batch": 1, "experiment_duration": "15s", "discard_duration": "2s", "scheduler": "cfs", "ghost_qos": 2}, "antagonist": null, "ghost": null}
Running experiment for throughput = 10000 req/s:
['/dev/shm/rocksdb', '--print_format', 'csv', '--noprint_distribution', '', '--noprint_ns', '', '--print_get', '', '--print_range', '', '--rocksdb_db_path', '/dev/shm/orch_db', '--throughput', '20000', '--range_query_ratio', '0.0', '--load_generator_cpu', '10', '--cfs_dispatcher_cpu', '11', '--num_workers', '6', '--worker_cpus', '12,13,14,15,16,17', '--cfs_wait_type', 'spin', '--ghost_wait_type', 'prio_table', '--get_duration', '10us', '--range_duration', '5000us', '--get_exponential_mean', '1us', '--batch', '1', '--experiment_duration', '15s', '--discard_duration', '2s', '--scheduler', 'cfs', '--ghost_qos', '2', '--throughput', '10000']
experiments/rocksdb/cfs_orchestrator.cc:87(2045) CHECK FAILED: ghost::GhostHelper()->SchedSetAffinity( ghost::Gtid::Current(), ghost::MachineTopology()->ToCpuList( std::vector{options().load_generator_cpu})) == 0 [-1 != 0]
errno: 22 [Invalid argument]
PID 2045 Backtrace:
[0] 0x55c7f5f521b9 : ghost_test::CfsOrchestrator::LoadGenerator()
[1] 0x55c7f5f792ee : ghost_test::ExperimentThreadPool::ThreadMain()
[2] 0x55c7f5f7b23b : std::_Function_handler<>::_M_invoke()
[3] 0x55c7f5f7e23d : std:🧵:_State_impl<>::_M_run()
[4] 0x7efc43502de4 : (unknown)
Hello,
The error output indicates that the failure happens on line 87 in cfs_orchestrator.cc. This line affines the load generator CPU to a logical core. It appears that your script sets the load generator to core 10, which means your system needs to have at least 11 cores for this to work. Also, it seems that the CFS dispatcher is affined to core 11 and the worker threads are affined to cores 12-17.
I assume you want to affine the threads to cores with lower IDs. Are you sure that you are setting FIRST_CPU_ to 0 in options.py?
Thanks,
Jack
add:
I also modify NUM_ROCKSDB_WORKERS = 5 in the options.py ,but still report error "num_workers": 6
so , i guess it do not modify success,
then i reboot it , it also not work well
i think i find the result of the problem:
- change the FIRST_CPU_ to 1
- need recompile it ,then it work well
thank you ,very much !