corrupted double-linked list [1] 14443 IOT instruction (core dumped)
jamjamjon opened this issue · comments
When I use CUDA
provider to run the corresponding onnx model and complete the inference, this problem will appear from time to time. (The inference result is totally correct!)
corrupted double-linked list [1]
14443 IOT instruction (core dumped)
Enviroment
cuda: 11.7
ort: 2.0.0 alpha4
onnxruntime:1.7.1 & 1.7.0
gpu: GeForce RTX3060
OS: ubuntu 23.04, X86_64
Code snippest
pub fn new(config: &Config) -> Result<Self> {
// build session first in order to get `shape, dtype, names` from onnx model.
ort::init().commit()?;
let session =
Session::builder()?.with_model_from_file(onnx_path)?;
// --------------------------------------------------------------------
// do something about inputs and outputs here
// --------------------------------------------------------------------
// build session again
let builder = Session::builder()?;
let cuda = CUDAExecutionProvider::default().build();
if cuda.is_available()? && cuda.register(&builder).is_ok() {
println!("> Using CUDA");
} else {
println!("> Using CPU");
}
let session = builder
.with_optimization_level(ort::GraphOptimizationLevel::Level3)?
.with_model_from_file(onnx_path)?;
pub fn run_fp32(
&self,
xs: &Array<f32, IxDyn>,
) -> Result<Vec<Array<f32, IxDyn>>> {
// run
let ys = self.session.run(ort::inputs![xs.view()]?)?;
Ok(ys
.iter()
.map(|(_, v)| {
v.extract_tensor::<f32>()
.unwrap()
.view()
.clone()
.into_owned()
})
.collect::<Vec<Array<_, _>>>())
}
When using TensorRT
and CPU
providers, there will be fine.
Need your help, please.
Is run_fp32
being called from multiple threads/or a different thread than the one the session was created on?
Yes! I use [mpsc](std::sync::mpsc::channel())
// use mpsc
let (tx, rx) = std::sync::mpsc::channel();
thread::spawn(move || {
for (images, _paths) in dl.into_iter() {
tx.send(images).unwrap();
}
});
thread::spawn(move || {
for (_i, message) in rx.iter().enumerate() {
let _ys = model.run(&message).unwrap();
}
})
.join()
.unwrap();
But, this problem will appear even if I use the single thread.
// load then run
let x = image::io::Reader::open("./assets/demo.jpg")?.decode()?;
let y = model.run(&vec![x])?;
Here is the output:
cargo run -r --example rtdetr
Finished release [optimized] target(s) in 0.05s
Running `target/release/examples/rtdetr`
> Using CUDA
[ORT Inference]: 5.165937265s
Results saved at: runs/RT-DETR/2024-03-09-13-39-23-465984633.jpg
[Results { probs: None, Bboxes: Some([Bbox { xmin: 23.76523, ymin: 229.94244, xmax: 804.8533, ymax: 730.4618, id: 5, confidence: 0.9469714 }, Bbox { xmin: 668.5972, ymin: 394.98087, xmax: 809.0648, ymax: 880.43445, id: 0, confidence: 0.9517705 }, Bbox { xmin: 49.653255, ymin: 399.2633, xmax: 247.06194, ymax: 904.75684, id: 0, confidence: 0.9512628 }, Bbox { xmin: 222.2634, ymin: 405.63873, xmax: 345.4751, ymax: 860.4706, id: 0, confidence: 0.9255427 }, Bbox { xmin: 0.29167414, ymin: 550.899, xmax: 74.54556, ymax: 867.50653, id: 0, confidence: 0.70238566 }, Bbox { xmin: 283.0564, ymin: 484.21506, xmax: 297.04556, ymax: 520.7864, id: 27, confidence: 0.42629278 }]), Keypoints: None, Masks: None }]
corrupted double-linked list
[1] 5290 IOT instruction (core dumped) cargo run -r --example rtdetr
Given that the results print before the crash I assume it may be occurring when something is dropped. Are you able to use a debugger to step through & see where it crashes?
I tried yolov8 example and found out that this problem can be be reproduced.
Demo files
run
cargo run -r --example yolov8
and you will see
Finished release [optimized] target(s) in 38.40s
Running `target/release/examples/yolov8`
corrupted double-linked list
[1] 47699 IOT instruction (core dumped) cargo run -r --example yolov8
The model inference results is ok, and the image annotated will show up. This bug appears when I press the button.
I tried some code with ort=1.16.3
and onnxruntime=1.16.3
, this bug won't show up.
2.0.0 rc0 also has this problem when running yolov8 example using CUDA
execute provider.
@jamjamjon What's rustc --version
say?
I could reproduce this on Windows using rustc 1.78.0-nightly (2dceda4f3 2024-03-01)
but only with --release
. --profile dev
doesn't crash.
rustc 1.78.0-nightly (46b180ec2 2024-03-08)
(latest nightly) and rustc 1.76.0 (07dca489a 2024-02-04)
(stable) do not crash in either profile. Seems like a regression in rustc that's already been fixed.