How to properly destroy/deconstruct the simulator?

Question

How to properly destroy/deconstruct the simulator?

FirefoxMetzger opened this issue 3 years ago · comments

Sebastian Wallkötter commented 3 years ago

I want to run a series of simulations in scenario and one instance of the simulation has many moving parts. The robot will interact with different objects, modifying their initial state. To clean up between simulations, I thought that it is easiest to completely destroy the simulation and rebuild it from scratch. This is the safest way to ensure that there is no residue from the previous simulation/run.

My naive approach was

from scenario import gazebo as scenario_gazebo


for sim_idx in range (3):
    print(f"Simulation {sim_idx}")
    gazebo = scenario_gazebo.GazeboSimulator(step_size=0.001, rtf=1.0, steps_per_run=round((1 / 0.001) / 30))
    assert gazebo.insert_world_from_sdf("/usr/share/ignition/ignition-gazebo4/worlds/gpu_lidar_sensor.sdf")
    assert gazebo.initialize()
    for _ in range(10):
        gazebo.run()
    gazebo.close()  # should clean up the simulation?

which fails with a segmentation fault

Segmentation fault log

$ python3 foobar.py 
Simulation 0
[Err] [SystemPaths.cc:444] Could not resolve file [playground_diffuse.jpg]
Setting callback for signal SIGINT
Setting callback for signal SIGTERM
Setting callback for signal SIGABRT
Simulation 1
Setting callback for signal SIGINT
Setting callback for signal SIGTERM
Setting callback for signal SIGABRT
[Err] [SceneManager.cc:169] Visual: [ground_plane] already exists
[Err] [SceneManager.cc:169] Visual: [box] already exists
[Err] [SceneManager.cc:169] Visual: [model_with_lidar] already exists
[Err] [SceneManager.cc:169] Visual: [playground] already exists
[Err] [BaseStorage.hh:927] Another item already exists with name: sun
Segmentation fault

How would I correctly close the simulator and clean up once the simulation has finished?

Sebastian Wallkötter · Answer 1 · Tue Apr 13 2021 03:24:26 GMT+0800 (China Standard Time)

The above snippet works fine, i.e., doesn't segfault (due to dubplicate models?) if I don't use insert_world_from_sdf, but instead use the default empty world and add items manually like so:

from scenario import gazebo as scenario_gazebo
import gym_ignition_models

for sim_idx in range (3):
    print(f"Simulation {sim_idx}")
    gazebo = scenario_gazebo.GazeboSimulator(step_size=0.001, rtf=1.0, steps_per_run=round((1 / 0.001) / 30))
    assert gazebo.initialize()
    world = gazebo.get_world()
    assert world.insert_model(gym_ignition_models.get_model_file("ground_plane"))
    gazebo.run()
    gazebo.close()

Diego Ferigo · Answer 2 · Tue Apr 13 2021 21:02:07 GMT+0800 (China Standard Time)

Calling GazeboSimulator::close() is the right way to delete the ignition::gazebo::Server instance.

I remember in the past segfaults occurring due to plugins, that do not get unloaded when the corresponding model or world is deleted (gazebosim/gz-sim#113). However, you completely destroy the simulator, and it should be unrelated.

You can try to find out the incriminated line calling a script (as simple as possible) with: valgrind python script.py. I'm not 100% sure that a Release installation (e.g. from PyPI) will provide enough information. You can switch to the Developer installation and compile the project in either Debug or RelWithDebInfo.

Just checking, do you have any custom plugins in your world?

Sebastian Wallkötter · Answer 3 · Wed Apr 14 2021 16:02:52 GMT+0800 (China Standard Time)

Valgrind traces the segfault back to ign-rendering and the sensor plugin. Apparently, there is a dangling reference to the previous sun, which has been destroyed, and following the null pointer leads to a segfault. This would also be in line with the error messages from SceneManager stating that the model already exists for each model being inserted.

==980== Process terminating with default action of signal 11 (SIGSEGV)
==980==  Access not within mapped region at address 0x0
==980==    at 0x2877B2DC: ignition::gazebo::v4::SceneManager::CreateLight(unsigned long, sdf::v10::Light const&, unsigned long) (in /usr/lib/x86_64-linux-gnu/libignition-gazebo4-rendering.so.4.6.0)
==980==    by 0x28728C9E: ignition::gazebo::v4::RenderUtil::Update() (in /usr/lib/x86_64-linux-gnu/libignition-gazebo4-rendering.so.4.6.0)
==980==    by 0x28652653: ignition::gazebo::v4::systems::SensorsPrivate::RunOnce() (in /usr/lib/x86_64-linux-gnu/ign-gazebo-4/plugins/libignition-gazebo4-sensors-system.so.4.6.0)
==980==    by 0x28652DE7: ignition::gazebo::v4::systems::SensorsPrivate::RenderThread() (in /usr/lib/x86_64-linux-gnu/ign-gazebo-4/plugins/libignition-gazebo4-sensors-system.so.4.6.0)
==980==    by 0x6676D83: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==980==    by 0x4A5F608: start_thread (pthread_create.c:477)
==980==    by 0x4986292: clone (clone.S:95)
==980==  If you believe this happened as a result of a stack
==980==  overflow in your program's main thread (unlikely but
==980==  possible), you can try to increase the size of the
==980==  main thread stack using the --main-stacksize= flag.
==980==  The main thread stack size used in this run was 8388608.

If I remove the sensor plugin from the simulation, it runs as expected. However, I do need it to extract sensor data from the simulation (images in my case). I tried to manually insert the plugin (via world.insert_world_plugin), but this, too, doesn't seem to get unloaded/reloaded correctly (same logs and same segfault).

Just checking, do you have any custom plugins in your world?

From ignition's point of view no. From scenario's point of view maybe, since I am using the sensor plugin to get camera images.

Diego Ferigo · Answer 4 · Wed Apr 14 2021 17:07:46 GMT+0800 (China Standard Time)

Valgrind traces the segfault back to ign-rendering and the sensor plugin.
[...]

Just checking, do you have any custom plugins in your world?

From ignition's point of view no. From scenario's point of view maybe, since I am using the sensor plugin to get camera images.

This was my suspicion, and I'm afraid we cannot do much from downstream.

My suggestion is to compile ign-gazebo either in Debug or RelWithDebInfo, and find the exact line that segfaults. This requires a colcon installation. Then, you can either compile the entire workspace passing:

colcon build --merge-install --cmake-args -DCMAKE_BUILD_TYPE=RelWithDebInfo [...]

or, if you already have an existing workspace, you can just rebuild ign-gazebo as follows:

cd <workspace>
cd build/ignition-gazebo<ver>
cmake . -DCMAKE_BUILD_TYPE=RelWithDebInfo
cmake --build . --target install  # could be INSTALL depending on your platform

Then, use again valgrind and find the exact line. The fix could be one-liner, and submitting an upstream fix could be easy. Note that, once you have all this setup ready, you can use your local colcon setup with the fix, you don't have to wait that the upstream PR will get merged / released.

Sebastian Wallkötter · Answer 5 · Wed Apr 14 2021 17:30:07 GMT+0800 (China Standard Time)

Thanks for the detailed explanation! I'll close this issue since it is an upstream problem.

Unfortunately, I currently don't have the time to fix this upstream myself, so it will have to wait until my schedule frees and I can work on "extracurriculars" again.

That said, a workaround that I found is to run the simulator in a subprocess (via subprocess.call) and have the OS do the cleanup, i.e., have it unload the gazebo libraries. In my case, I log the initial environment configuration anyway (for reproducibility), so I can simply use that log to communicate between the processes and initialize the child.