corrupted double-linked list [1] 14443 IOT instruction (core dumped)

Question

corrupted double-linked list [1] 14443 IOT instruction (core dumped)

jamjamjon opened this issue 4 months ago · comments

When I use CUDA provider to run the corresponding onnx model and complete the inference, this problem will appear from time to time. (The inference result is totally correct!)

corrupted double-linked list [1]    
14443 IOT instruction (core dumped)

Enviroment

cuda: 11.7
ort: 2.0.0 alpha4
onnxruntime:1.7.1 & 1.7.0
gpu: GeForce RTX3060
OS: ubuntu 23.04, X86_64

Code snippest

pub fn new(config: &Config) -> Result<Self> {
        // build session first in order to get `shape, dtype, names` from onnx model.
        ort::init().commit()?;
        let session =
        Session::builder()?.with_model_from_file(onnx_path)?;
        
        // --------------------------------------------------------------------
        //   do something about inputs and outputs here
        // --------------------------------------------------------------------
  
        // build session again
        let builder = Session::builder()?;
        let cuda = CUDAExecutionProvider::default().build();
        if cuda.is_available()? && cuda.register(&builder).is_ok() {
            println!("> Using CUDA");
        } else {
            println!("> Using CPU");
        }
        let session = builder
            .with_optimization_level(ort::GraphOptimizationLevel::Level3)?
            .with_model_from_file(onnx_path)?;



    pub fn run_fp32(
        &self,
        xs: &Array<f32, IxDyn>,
    ) -> Result<Vec<Array<f32, IxDyn>>> {
        // run
        let ys = self.session.run(ort::inputs![xs.view()]?)?;
        Ok(ys
            .iter()
            .map(|(_, v)| {
                v.extract_tensor::<f32>()
                    .unwrap()
                    .view()
                    .clone()
                    .into_owned()
            })
            .collect::<Vec<Array<_, _>>>())
    }

When using TensorRT and CPU providers, there will be fine.

Need your help, please.

Carson M. · Answer 1 · Sat Mar 09 2024 13:28:34 GMT+0800 (China Standard Time)

Is run_fp32 being called from multiple threads/or a different thread than the one the session was created on?

Jamjamjon · Answer 2 · Sat Mar 09 2024 13:40:27 GMT+0800 (China Standard Time)

Yes! I use [mpsc](std::sync::mpsc::channel())

   // use mpsc
    let (tx, rx) = std::sync::mpsc::channel();
    thread::spawn(move || {
        for (images, _paths) in dl.into_iter() {
            tx.send(images).unwrap();
        }
    });
    thread::spawn(move || {
        for (_i, message) in rx.iter().enumerate() {
            let _ys = model.run(&message).unwrap();
        }
    })
    .join()
    .unwrap();

But, this problem will appear even if I use the single thread.

    // load then run
    let x = image::io::Reader::open("./assets/demo.jpg")?.decode()?;
    let y = model.run(&vec![x])?;

Here is the output:

 cargo run -r --example rtdetr
    Finished release [optimized] target(s) in 0.05s
     Running `target/release/examples/rtdetr`
> Using CUDA
[ORT Inference]: 5.165937265s
Results saved at: runs/RT-DETR/2024-03-09-13-39-23-465984633.jpg
[Results { probs: None, Bboxes: Some([Bbox { xmin: 23.76523, ymin: 229.94244, xmax: 804.8533, ymax: 730.4618, id: 5, confidence: 0.9469714 }, Bbox { xmin: 668.5972, ymin: 394.98087, xmax: 809.0648, ymax: 880.43445, id: 0, confidence: 0.9517705 }, Bbox { xmin: 49.653255, ymin: 399.2633, xmax: 247.06194, ymax: 904.75684, id: 0, confidence: 0.9512628 }, Bbox { xmin: 222.2634, ymin: 405.63873, xmax: 345.4751, ymax: 860.4706, id: 0, confidence: 0.9255427 }, Bbox { xmin: 0.29167414, ymin: 550.899, xmax: 74.54556, ymax: 867.50653, id: 0, confidence: 0.70238566 }, Bbox { xmin: 283.0564, ymin: 484.21506, xmax: 297.04556, ymax: 520.7864, id: 27, confidence: 0.42629278 }]), Keypoints: None, Masks: None }]
corrupted double-linked list
[1]    5290 IOT instruction (core dumped)  cargo run -r --example rtdetr

Carson M. · Answer 3 · Sat Mar 09 2024 13:58:44 GMT+0800 (China Standard Time)

Given that the results print before the crash I assume it may be occurring when something is dropped. Are you able to use a debugger to step through & see where it crashes?

Jamjamjon · Answer 4 · Sat Mar 09 2024 22:11:58 GMT+0800 (China Standard Time)

I tried yolov8 example and found out that this problem can be be reproduced.

Demo files

yolov8-ort-bug-demo.zip

run

cargo run -r --example yolov8

and you will see

    Finished release [optimized] target(s) in 38.40s
     Running `target/release/examples/yolov8`
corrupted double-linked list
[1]    47699 IOT instruction (core dumped)  cargo run -r --example yolov8

The model inference results is ok, and the image annotated will show up. This bug appears when I press the button.

I tried some code with ort=1.16.3 and onnxruntime=1.16.3, this bug won't show up.

Jamjamjon · Answer 5 · Sat Mar 09 2024 23:12:09 GMT+0800 (China Standard Time)

2.0.0 rc0 also has this problem when running yolov8 example using CUDA execute provider.

Carson M. · Answer 6 · Sun Mar 10 2024 00:02:49 GMT+0800 (China Standard Time)

@jamjamjon What's rustc --version say?

Carson M. · Answer 7 · Sun Mar 10 2024 01:36:30 GMT+0800 (China Standard Time)

I could reproduce this on Windows using rustc 1.78.0-nightly (2dceda4f3 2024-03-01) but only with --release. --profile dev doesn't crash.

rustc 1.78.0-nightly (46b180ec2 2024-03-08) (latest nightly) and rustc 1.76.0 (07dca489a 2024-02-04) (stable) do not crash in either profile. Seems like a regression in rustc that's already been fixed.

Jamjamjon · Answer 8 · Sun Mar 10 2024 10:44:05 GMT+0800 (China Standard Time)

1.76.0在 2024年3月10日，00:03，Carson M. ***@***.***> 写道： @jamjamjon What's rustc --version say? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>