mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation

Home Page:https://llm.mlc.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug] mlc-llm compile bug when cutlass and cublas enabled

BBuf opened this issue · comments

When I enable cublas and cutlass in build relax(mlc-ai/relax) and compile the GPU model using mlc-llm with q0f16, it crashes. However, compiling with other configurations, such as q4f16_1, works fine. Additionally, when I don't enable cublas and cutlass, all configurations compile normally in mlc-llm. The error stack from the aforementioned issue is:

/bbuf> python3 -m mlc_llm.build --hf-path StarRing2022/RWKV-4-World-7B --target cuda --quantization q0f16
Weights exist at dist/models/RWKV-4-World-7B, skipping download.
Using path "dist/models/RWKV-4-World-7B" for model "RWKV-4-World-7B"
Target configured: cuda -keys=cuda,gpu -arch=sm_80 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_80 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Start computing and quantizing weights... This may take a while.
Finish computing and quantizing weights.
Total param size: 14.003204345703125 GB
Start storing to cache dist/RWKV-4-World-7B-q0f16/params
[0582/0582] saving param_581
All finished, 227 total shards committed, record saved to dist/RWKV-4-World-7B-q0f16/params/ndarray-cache.json
Finish exporting chat config to dist/RWKV-4-World-7B-q0f16/params/mlc-chat-config.json
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/bbuf/.local/lib/python3.8/site-packages/mlc_llm/build.py", line 13, in <module>
    main()
  File "/home/bbuf/.local/lib/python3.8/site-packages/mlc_llm/build.py", line 10, in main
    core.build_model_from_args(parsed_args)
  File "/home/bbuf/.local/lib/python3.8/site-packages/mlc_llm/core.py", line 584, in build_model_from_args
    mod = mod_transform_before_build(mod, param_manager, args, model_config)
  File "/home/bbuf/.local/lib/python3.8/site-packages/mlc_llm/core.py", line 407, in mod_transform_before_build
    mod = tvm.transform.Sequential(
  File "/bbuf/relax/python/tvm/ir/transform.py", line 238, in __call__
    return _ffi_transform_api.RunPass(self, mod)
  File "/bbuf/relax/python/tvm/_ffi/_ctypes/packed_func.py", line 238, in __call__
    raise get_last_ffi_error()
tvm.error.InternalError: Traceback (most recent call last):
  22: TVMFuncCall
  21: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::IRModule (tvm::transform::Pass, tvm::IRModule)>::AssignTypedLambda<tvm::transform::{lambda(tvm::transform::Pass, tvm::IRModule)#7}>(tvm::transform::{lambda(tvm::transform::Pass, tvm::IRModule)#7}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::TVMRetValue)
  20: tvm::transform::Pass::operator()(tvm::IRModule) const
  19: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  18: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  17: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  16: tvm::transform::ModulePassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  15: _ZN3tvm7runtime13PackedFuncObj9ExtractorINS0_16PackedFuncSubObjIZNS0_15TypedPackedFuncIFNS_8IRModuleES5_NS_9transform11PassContextEEE17AssignTypedLambdaIZNS_5relax9transform16FuseOpsByPatternERKNS0_5ArrayINSC_13FusionPatternEvEEbbEUlS5_S7_E_EEvT_EUlRKNS0_7TVMArgsEPNS0_11TVMRetValueEE_EEE4CallEPKS1_SK_SO_
  14: tvm::relax::FuseOpsByPattern(tvm::runtime::Array<tvm::relax::transform::FusionPattern, void> const&, tvm::IRModule, bool, bool)
  13: tvm::relax::MakeGroupedFunctions(tvm::IRModule, std::unordered_map<tvm::runtime::Object const*, tvm::relay::GraphPartitioner::Group*, std::hash<tvm::runtime::Object const*>, std::equal_to<tvm::runtime::Object const*>, std::allocator<std::pair<tvm::runtime::Object const* const, tvm::relay::GraphPartitioner::Group*> > > const&, bool)
  12: tvm::relax::OperatorFusor::Transform()
  11: tvm::relax::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  10: _ZZN3tvm5relax11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  9: tvm::relax::ExprMutator::VisitExpr_(tvm::relax::FunctionNode const*)
  8: tvm::relax::ExprMutator::VisitWithNewScope(tvm::RelayExpr const&, tvm::runtime::Optional<tvm::runtime::Array<tvm::relax::Var, void> >)
  7: tvm::relax::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  6: _ZZN3tvm5relax11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlRKNS_7r
  5: tvm::relax::ExprMutator::VisitExpr_(tvm::relax::SeqExprNode const*)
  4: tvm::relax::OperatorFusor::VisitBindingBlock(tvm::relax::BindingBlock const&)
  3: tvm::relax::OperatorFusor::VisitBindingBlock_(tvm::relax::DataflowBlockNode const*)
  2: tvm::relax::OperatorFusor::CollectFuncBoundary(tvm::runtime::Array<tvm::relax::Binding, void> const&)
  1: tvm::relax::PostOrderVisit(tvm::RelayExpr const&, std::function<void (tvm::RelayExpr const&)>)
  0: tvm::relax::OperatorFusor::CollectFuncBoundary(tvm::runtime::Array<tvm::relax::Binding, void> const&)::{lambda(tvm::RelayExpr const&)#1}::operator()(tvm::RelayExpr const&) const
  File "/bbuf/relax/src/relax/transform/fuse_ops.cc", line 876
InternalError: Check failed: (depgroup != cur_group) is false: A cyclic dependency detected between the groups lv2757 and lv2756 are in.

mlc-llm compile command is:

python3 -m mlc_llm.build --hf-path StarRing2022/RWKV-4-World-7B --target cuda --quantization q0f16

This should work now