Errors when reproducing experiments (also when running userspace agents)

Question

Errors when reproducing experiments (also when running userspace agents)

liborui opened this issue 2 years ago · comments

Description

Logs

I am trying to reproduce the experiments, and further do something new with ghOst.
But I came across an error below:
(Before I run this command, I finished to compiled the ghost-userspace with bazel build -c opt ...)

sudo ./bazel-bin/experiments/scripts/centralized_queuing.par cfs   # in the root of ghost-userspace. 
# I use "sudo" because it seems the python script ends with "Running CFS experiments... mount: only root can use "--options" option"

It turns out to be

Running CFS experiments...
mount: /dev/cgroup/cpu: cgroup already mounted on /sys/fs/cgroup/systemd.
mount: /dev/cgroup/memory: cgroup already mounted on /sys/fs/cgroup/systemd.
Output Directory: /tmp/ghost_data/2022-04-26 10:22:56                                                                                                                                                                                                                                   
{"throughputs": [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000, 320000, 330000, 340000, 
350000, 360000, 370000, 380000, 390000, 400000, 410000, 420000, 430000, 440000, 450000, 451000, 452000, 453000, 454000, 455000, 456000, 457000, 458000, 459000, 460000, 461000, 462000, 463000, 464000, 465000, 466000, 467000, 468000, 469000, 470000, 471000, 472000, 473000, 474000, 
475000, 476000, 477000, 478000, 479000, 480000], "output_prefix": "/tmp/ghost_data/2022-04-26 10:22:56", "binaries": {"rocksdb": "/dev/shm/rocksdb", "antagonist": "/dev/shm/antagonist", "ghost": "/dev/shm/agent_shinjuku"}, "rocksdb": {"print_format": "csv", "print_distribution": 
false, "print_ns": false, "print_get": true, "print_range": true, "rocksdb_db_path": "/dev/shm/orch_db", "throughput": 20000, "range_query_ratio": 0.0, "load_generator_cpu": 10, "cfs_dispatcher_cpu": 11, "num_workers": 6, "worker_cpus": [12, 13, 14, 15, 16, 17], "cfs_wait_type": 
"spin", "ghost_wait_type": "prio_table", "get_duration": "10us", "range_duration": "5000us", "get_exponential_mean": "1us", "batch": 1, "experiment_duration": "15s", "discard_duration": "2s", "scheduler": "cfs", "ghost_qos": 2}, "antagonist": null, "ghost": null}
Running experiment for throughput = 10000 req/s:
['/dev/shm/rocksdb', '--print_format', 'csv', '--noprint_distribution', '', '--noprint_ns', '', '--print_get', '', '--print_range', '', '--rocksdb_db_path', '/dev/shm/orch_db', '--throughput', '20000', '--range_query_ratio', '0.0', '--load_generator_cpu', '10', '--cfs_dispatcher_
cpu', '11', '--num_workers', '6', '--worker_cpus', '12,13,14,15,16,17', '--cfs_wait_type', 'spin', '--ghost_wait_type', 'prio_table', '--get_duration', '10us', '--range_duration', '5000us', '--get_exponential_mean', '1us', '--batch', '1', '--experiment_duration', '15s', '--discar
d_duration', '2s', '--scheduler', 'cfs', '--ghost_qos', '2', '--throughput', '10000']
experiments/rocksdb/cfs_orchestrator.cc:95(23984) CHECK FAILED: ghost::Ghost::SchedSetAffinity( ghost::Gtid::Current(), ghost::MachineTopology()->ToCpuList( std::vector<int>{options().load_generator_cpu})) == 0 [-1 != 0]
errno: 22 [Invalid argument]
PID 23984 Backtrace:
[0] 0x564e0ac5e487 : ghost_test::CfsOrchestrator::LoadGenerator()
[1] 0x564e0ac8561e : ghost_test::ExperimentThreadPool::ThreadMain()
[2] 0x564e0ac8756b : std::_Function_handler<>::_M_invoke()
[3] 0x7fe9aeb77de4 : (unknown)

Furthermore, I also tried this command under root of ghost-userspace

sudo bazel run fifo_agent

and it turns out to

Extracting Bazel installation...                                                                                                                                                                                                                                                        
Starting local Bazel server and connecting to it...                                                                                                                                                                                                                                     
ERROR: Skipping 'fifo_agent': no such target '//:fifo_agent': target 'fifo_agent' not declared in package '' defined by /home/emnets/ghost-userspace/BUILD                                                                                                                              
WARNING: Target pattern parsing failed.                                                                                                                                                                                                                                                 
ERROR: no such target '//:fifo_agent': target 'fifo_agent' not declared in package '' defined by /home/emnets/ghost-userspace/BUILD                                                                                                                                                     
INFO: Elapsed time: 10.222s                                                                                                                                                                                                                                                             
INFO: 0 processes.                                                                                                                                                                                                                                                                      
FAILED: Build did NOT complete successfully (1 packages loaded)                                                                                                                                                                                                                         
FAILED: Build did NOT complete successfully (1 packages loaded)

Env Info

And here is my environment version info:

lsb_release -a 
# LSB Version:	core-11.1.0ubuntu2-noarch:security-11.1.0ubuntu2-noarch 
# Distributor ID:	Ubuntu 
# Description:	Ubuntu 20.04.2 LTS 
# Release:	20.04 
# Codename:	focal

uname -mrs
# Linux 5.11.0+ x86_64

ghost-kernel hash: 5da05ec77890217e85947ff3573e1480579687d2
ghost-userspace hash: 79ecaeb

P.S. I am using a virtual machine to reproduce ghost. I am wondering if the virtual machine matters.
The virtual machine is using VMware workstation, with 8GB mem and 8 processors (each processor has one core).

Suggestion

My colleagues and I appreciate the paper and this open-source project of ghost.
But I came across many difficulties to conduct the experiments and reproduce the results, because the README do not mentioned this.
I have to refer the the (closed) issues and find the scattered commands to run the experiments.
If you could update the README with a more detailed steps, this will be great. And I could help you out if you need.

jackhumphries · Answer 1 · Wed Apr 27 2022 17:20:20 GMT+0800 (China Standard Time)

Hi Borui,

Thanks for opening this issue. The CFS and ghOSt experiments affine threads to logical cores 10, 11, 12, and so on (this is controlled by _FIRST_CPU in options.py). ghost::Ghost::SchedSetAffinity() calls sched_setaffinity(), which is failing with EINVAL. This error generally means that those logical cores do not exist in your system, and this makes sense given that you mentioned you have 8 logical cores in your machine. The experiment parameters in the Python files need to be changed for your machine -- I would imagine that setting _FIRST_CPU to 0 in options.py would fix your issue.

Your run command for fifo_agent is failing because there is no fifo_agent target. We used to have a fifo_agent target, but we renamed it in 7ab27ed. We have fifo_per_cpu_agent (a ghOSt scheduler with per-CPU ghOSt agents that each have their own FIFO runqueue) and fifo_centralized_agent (a ghOSt scheduler with a global ghOSt agent that has a single FIFO runqueue for the entire machine).

Thanks for the suggestion about a README with instructions. We definitely want to create this along with several extensive tutorials, though we do not have a set timeline for these yet. If you start writing a README/tutorials and want to push your work, we would be more than happy to accept it.

Please let me know if you have additional questions.

cashey · Answer 2 · Tue Dec 13 2022 17:06:12 GMT+0800 (China Standard Time)

I also have the same question, i modifiy the options.py _FIRST_CPU to 0,but it performance not ok
seu@ubuntu:$ cd ghost-userspace/
seu@ubuntu:/ghost-userspace$ sudo su
[sudo] password for seu:
root@ubuntu:/home/seu/ghost-userspace# sudo ./bazel-bin/experiments/scripts/centralized_queuing.par cfs
Running CFS experiments...
Output Directory: /tmp/ghost_data/2022-12-13 17:03:08
{"throughputs": [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000, 320000, 330000, 340000, 350000, 360000, 370000, 380000, 390000, 400000, 410000, 420000, 430000, 440000, 450000, 451000, 452000, 453000, 454000, 455000, 456000, 457000, 458000, 459000, 460000, 461000, 462000, 463000, 464000, 465000, 466000, 467000, 468000, 469000, 470000, 471000, 472000, 473000, 474000, 475000, 476000, 477000, 478000, 479000, 480000], "output_prefix": "/tmp/ghost_data/2022-12-13 17:03:08", "binaries": {"rocksdb": "/dev/shm/rocksdb", "antagonist": "/dev/shm/antagonist", "ghost": "/dev/shm/agent_shinjuku"}, "rocksdb": {"print_format": "csv", "print_distribution": false, "print_ns": false, "print_get": true, "print_range": true, "rocksdb_db_path": "/dev/shm/orch_db", "throughput": 20000, "range_query_ratio": 0.0, "load_generator_cpu": 10, "cfs_dispatcher_cpu": 11, "num_workers": 6, "worker_cpus": [12, 13, 14, 15, 16, 17], "cfs_wait_type": "spin", "ghost_wait_type": "prio_table", "get_duration": "10us", "range_duration": "5000us", "get_exponential_mean": "1us", "batch": 1, "experiment_duration": "15s", "discard_duration": "2s", "scheduler": "cfs", "ghost_qos": 2}, "antagonist": null, "ghost": null}
Running experiment for throughput = 10000 req/s:
['/dev/shm/rocksdb', '--print_format', 'csv', '--noprint_distribution', '', '--noprint_ns', '', '--print_get', '', '--print_range', '', '--rocksdb_db_path', '/dev/shm/orch_db', '--throughput', '20000', '--range_query_ratio', '0.0', '--load_generator_cpu', '10', '--cfs_dispatcher_cpu', '11', '--num_workers', '6', '--worker_cpus', '12,13,14,15,16,17', '--cfs_wait_type', 'spin', '--ghost_wait_type', 'prio_table', '--get_duration', '10us', '--range_duration', '5000us', '--get_exponential_mean', '1us', '--batch', '1', '--experiment_duration', '15s', '--discard_duration', '2s', '--scheduler', 'cfs', '--ghost_qos', '2', '--throughput', '10000']
experiments/rocksdb/cfs_orchestrator.cc:87(2045) CHECK FAILED: ghost::GhostHelper()->SchedSetAffinity( ghost::Gtid::Current(), ghost::MachineTopology()->ToCpuList( std::vector{options().load_generator_cpu})) == 0 [-1 != 0]
errno: 22 [Invalid argument]
PID 2045 Backtrace:
[0] 0x55c7f5f521b9 : ghost_test::CfsOrchestrator::LoadGenerator()
[1] 0x55c7f5f792ee : ghost_test::ExperimentThreadPool::ThreadMain()
[2] 0x55c7f5f7b23b : std::_Function_handler<>::_M_invoke()
[3] 0x55c7f5f7e23d : std:🧵:_State_impl<>::_M_run()
[4] 0x7efc43502de4 : (unknown)

jackhumphries · Answer 3 · Tue Dec 13 2022 17:51:56 GMT+0800 (China Standard Time)

Hello,

The error output indicates that the failure happens on line 87 in cfs_orchestrator.cc. This line affines the load generator CPU to a logical core. It appears that your script sets the load generator to core 10, which means your system needs to have at least 11 cores for this to work. Also, it seems that the CFS dispatcher is affined to core 11 and the worker threads are affined to cores 12-17.

I assume you want to affine the threads to cores with lower IDs. Are you sure that you are setting FIRST_CPU_ to 0 in options.py?

Thanks,
Jack

cashey · Answer 4 · Thu Dec 15 2022 12:54:27 GMT+0800 (China Standard Time)

oh ,thanks , you reply so quickly
yes ,i am sure i set FIRST_CPU_ to 0 in the options.py

as you can see in the picture
sorry for the delay
if there is any other not configure right?
thank you very much!

cashey · Answer 5 · Thu Dec 15 2022 13:31:49 GMT+0800 (China Standard Time)

add:
I also modify NUM_ROCKSDB_WORKERS = 5 in the options.py ,but still report error "num_workers": 6
so , i guess it do not modify success,
then i reboot it , it also not work well

cashey · Answer 6 · Sun Dec 18 2022 15:30:06 GMT+0800 (China Standard Time)

i think i find the result of the problem:

change the FIRST_CPU_ to 1
need recompile it ,then it work well

cashey · Answer 7 · Sun Dec 18 2022 15:30:37 GMT+0800 (China Standard Time)

thank you ,very much !