Serialize Executables crashing when compiling LLaMa on async-cpu

Question

Serialize Executables crashing when compiling LLaMa on async-cpu

rsuderman opened this issue 3 months ago · comments

The following dispatches appear to cause a crash when compiling a llama model. Unrolling / vectorization makes 20K+ lines of generated code which likely causes final LLVM compilation to completely fail.

module_prefill_bs4$async_dispatch_1.zip
module_decode_bs4$async_dispatch_2.zip

Rob Suderman · Answer 1 · Wed May 01 2024 07:03:45 GMT+0800 (China Standard Time)

It appears the issue is in LLVMCPUVectorTransferLowering. There is a full unrolling making the dispatch rather unruly.

Han-Chung Wang · Answer 2 · Wed May 01 2024 07:27:42 GMT+0800 (China Standard Time)

It appears the issue is in LLVMCPUVectorTransferLowering. There is a full unrolling making the dispatch rather unruly.

The unrolling is needed because LLVM backend wants 1D vector. It could be the issue in tile size selection, and vector shape optimization potentially can help with it.

Rob Suderman · Answer 3 · Thu May 02 2024 03:34:33 GMT+0800 (China Standard Time)

Some additional guilty lines:

  %12 = vector.transfer_read %10[%c0, %c0], %cst_2 {in_bounds = [true, true]} : tensor<32000x3200xf16>, vector<32000x3200xf16>
  %13 = arith.extf %12 : vector<32000x3200xf16> to vector<32000x3200xf32>
  %14 = vector.transfer_write %13, %11[%c0, %c0] {in_bounds = [true, true]} : vector<32000x3200xf32>, tensor<32000x3200xf32>

If we defer the unrolling of the vector.transfer_write we have the arith.extf unroll inside of the convert-to-llvm. I would imagine generic vectorization should generate an actual loop of vector instructions instead of resulting in the whole operation unrolling during LLVM generation.

Han-Chung Wang · Answer 4 · Thu May 02 2024 05:41:25 GMT+0800 (China Standard Time)

Okay, so this is similar to what I'm seeing in #17226 (comment)

IMO, we should not fuse these two generic ops. TileAndFuse is basically broken for the case. There are no dependency captured by operands. I'll talk to Mahesh to see if we can disable such fusion.

Han-Chung Wang · Answer 5 · Thu May 02 2024 07:23:29 GMT+0800 (China Standard Time)

@pashu123 please help take a look if there are other issues, apart from the fusion issue.

Scott Todd · Answer 6 · Wed May 08 2024 03:58:49 GMT+0800 (China Standard Time)

Do we have a workaround for this or any patches we could try?

I'm also seeing unusably slow behavior after running LLVMCPUVectorTransferLowering on open_llama_3b_v2_f16_gguf from https://github.com/nod-ai/sharktank. Logs and IR here: https://gist.github.com/ScottTodd/17734adbbd570dbfa3d275c8c7a8e9a9

Han-Chung Wang · Answer 7 · Wed May 08 2024 05:02:46 GMT+0800 (China Standard Time)

Perhaps you can try llvm/torch-mlir#3277 . It should fix the embedding lookup issue at torch level.

Scott Todd · Answer 8 · Wed May 08 2024 05:53:39 GMT+0800 (China Standard Time)

Perhaps you can try llvm/torch-mlir#3277 . It should fix the embedding lookup issue at torch level.

That gets further, yeah :D. Might be enough to call this particular issue fixed?

I do see another error with
iree-compile open_llama_3b_v2_f16.mlir --iree-hal-target-backends=llvm-cpu -o /tmp/open_llama_3b_v2_f16_cpu.vmfb:

failed to legalize operation 'arith.extui'
note: see current operation: %1401 = "arith.extui"(%1398) : (i1) -> i64

pretty late in compilation: https://gist.github.com/ScottTodd/6fbe7edd118bbb53c0abc2582459158d

Han-Chung Wang · Answer 9 · Wed May 08 2024 06:00:58 GMT+0800 (China Standard Time)

That gets further, yeah :D. Might be enough to call this particular issue fixed?

There is an action item at LinAlg level: #17226 (comment)

I do see another error with iree-compile open_llama_3b_v2_f16.mlir --iree-hal-target-backends=llvm-cpu -o /tmp/open_llama_3b_v2_f16_cpu.vmfb

@ScottTodd can you provide the mlir file? @pashu123 please help triage and provide possible solutions

Scott Todd · Answer 10 · Wed May 08 2024 06:07:16 GMT+0800 (China Standard Time)

@ScottTodd can you provide the mlir file? @pashu123 please help triage and provide possible solutions

This is the input file I'm working with: https://sharkpublic.blob.core.windows.net/sharkpublic/scotttodd/issue_reports/open_llama_3b_v2_f16.mlir

Prashant Kumar · Answer 11 · Wed May 08 2024 10:12:43 GMT+0800 (China Standard Time)

@ScottTodd I think you should add -iree-opt-demote-i64-to-i32 flag. Meanwhile, I'll double check this.

Prashant Kumar · Answer 12 · Wed May 08 2024 11:05:51 GMT+0800 (China Standard Time)

-iree-opt-demote-i64-to-i32

Verified adding this generates .vmfb.

Han-Chung Wang · Answer 13 · Fri May 10 2024 01:02:41 GMT+0800 (China Standard Time)

After thinking a while, I think we can close the issue. The action item I mentioned is tracking in the other issue, and we don't have action items for this issue now.