tracel-ai / burn

Burn is a new comprehensive dynamic Deep Learning Framework built using Rust with extreme flexibility, compute efficiency and portability as its primary goals.

Home Page:https://burn.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Too many bindings of type StorageBuffers in Stage ShaderStages(COMPUTE)

ianmarmour opened this issue · comments

Describe the bug

When attempting to perform any training of my model that with a batch size > 1 there are too many bindings of type StorageBuffer. This leads to a panic in WGPU and causes the training process to crash. It appears that the correct limits for the adapter are being inferred but potentially not respected by fusion if fusion is disabled this error isn't encountered.

To Reproduce

  1. Use WGPU as the device type on a Mac.
  2. .... unsure...
  3. cargo run to perform training with batch size > 1
  4. On the second batch of items the panic will occur.

Expected behavior

Proper allocation of the correct number of maximum available StorageBuffers.

Screenshots

2024-07-04T01:51:02.976015Z  INFO burn_fusion::stream::store::base: New execution plan 76 - Operations: 1 - Triggers 1    
2024-07-04T01:51:02.976063Z  INFO burn_fusion::stream::store::base: New execution plan 77 - Operations: 3 - Triggers 1    
2024-07-04T01:51:02.985188Z  INFO burn_jit::fusion::kernel: Compiling ... "mri0pvnx16y16z16g0vs7ubdfg"    
2024-07-04T01:51:03.194143Z  INFO burn_jit::fusion::kernel: Compiling ... "mri0pvnx16y16z1675o2aql5m4"    
2024-07-04T01:51:03.404701Z  INFO burn_compute::tune::tuner: Fastest result burn_jit::fusion::kernel::AutotunableKernel<burn_wgpu::runtime::WgpuRuntime>-Fusion ElemWise - num_operations: 4 shape: [32, 64, 2, 64]    
2024-07-04T01:51:03.428641Z  INFO burn_jit::fusion::kernel: Compiling ... "mi0o0ri0pvnx16y16z16g0vs7ubdfg"    
2024-07-04T01:51:03.515667Z  INFO burn_fusion::stream::store::base: New execution plan 78 - Operations: 99 - Triggers 1    
2024-07-04T01:51:03.595443Z  INFO burn_jit::fusion::kernel: Compiling ... "mri0pi2pi3pi4pi5pi6pi7pi8pi9pi10pi11pi12pi13pi14pi15pi16pi17pi18pi19pi20pi21pi22pi23pi24pi25pi26pi27pi28pi29pi30pi31pi32pi33pi34pi35pi36pi37pi38pi39pi40pi41pi42pi43pi44pi45pi46pi47pi48pi49pi50pi51pi52pi53pi54pi55pi56pi57pi58pi59pi60pi61pi62pi63pi64pi66pvnx16y16z160u635da9uo"    
2024-07-04T01:51:03.597239Z ERROR wgpu::backend::wgpu_core: Handling wgpu errors as fatal by default    
2024-07-04T01:51:03.597268Z ERROR burn_train::learner::application_logger: PANIC => panicked at /Users/ian/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.20.1/src/backend/wgpu_core.rs:2996:5:
wgpu error: Validation Error

Caused by:
    In Device::create_compute_pipeline
    Unable to derive an implicit layout
    Too many bindings of type StorageBuffers in Stage ShaderStages(COMPUTE), limit is 31, count was 102. Check the limit `max_storage_buffers_per_shader_stage` passed to `Adapter::request_device`

Desktop (please complete the following information):

  • OS: MacOS Sonoma
  • CPU: M3 Max
  • Version: 14.5
  • Burn Version: 0.14.0

Additional context

Training Configuration

#[derive(Config)]
pub struct TrainingConfig {
    pub model: TasNetConfig,
    pub optimizer: AdamConfig,
    #[config(default = 10)]
    pub num_epochs: usize,
    #[config(default = 2)]
    pub batch_size: usize,
    #[config(default = 4)]
    pub num_workers: usize,
    #[config(default = 42)]
    pub seed: u64,
    #[config(default = 3.0e-4)]
    pub learning_rate: f64,
}
fn main() {
    let device = WgpuDevice::BestAvailable;

    training::train::<Autodiff<Wgpu>>(
        "/tmp/guide",
        training::TrainingConfig::new(TasNetConfig::new(500, 40, 1000, 10, 2), AdamConfig::new()),
        device,
    );
}

Metal GPU Feature Set Table: Indicates the inferred maximum limit of 31 is correct.