mlc-ai / web-llm

High-performance In-browser LLM Inference Engine

Home Page:https://webllm.mlc.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[MLC-LLM] Uncaught (in promise) LinkError: WebAssembly.instantiate(): Import #4 "env"

DavidGOrtega opened this issue · comments

I have setup mlc-llm with latest to compile my models, however they do not work.

Uncaught (in promise) LinkError: WebAssembly.instantiate(): Import #4 "env" "_ZN3mlc3llm5serve16JSONSchemaToEBNFENSt3__212basic_stringIcNS2_11char_traitsIcEENS2_9allocatorIcEEEENS2_8optionalIiEENS9_INS2_4pairIS8_S8_EEEEb": function import requires a callable

How to reproduce:
clone phi-2

export TVM_HOME=/your/path/mlc-llm/3rdparty/tvm
export MLC_LLM_HOME=/your/path/mlc-llm

export MODEL=/your/path/models/phi-2
export QUANTIZATION=q0f16

mlc_llm convert_weight $MODEL --quantization $QUANTIZATION -o $MODEL/MLC
mlc_llm gen_config $MODEL --quantization $QUANTIZATION  --conv-template phi-2 -o $MODEL/MLC
mlc_llm compile $MODEL/MLC/mlc-chat-config.json --device webgpu -o $MODEL/MLC/webllm.wasm

Ouput:

mlc_llm compile $MODEL/MLC/mlc-chat-config.json --device webgpu -o $MODEL/MLC/webllm.wasm
[2024-04-21 14:17:06] INFO auto_config.py:69: Found model configuration: /models/phi-2/MLC/mlc-chat-config.json
[2024-04-21 14:17:06] INFO auto_config.py:153: Found model type: phi. Use `--model-type` to override.
Compiling with arguments:
  --config          Phi1Config(vocab_size=51200, hidden_size=2560, intermediate_size=10240, num_hidden_layers=32, num_attention_heads=32, layer_norm_eps=1e-05, position_embedding_base=10000.0, partial_rotary_factor=0.4, num_key_value_heads=32, context_window_size=2048, prefill_chunk_size=2048, head_dim=80, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
  --quantization    NoQuantize(name='q0f16', kind='no-quant', model_dtype='float16')
  --model-type      phi
  --target          {"host": {"kind": "llvm", "tag": "", "keys": ["cpu"], "mtriple": "wasm32-unknown-unknown-wasm"}, "max_num_threads": 256, "kind": "webgpu", "tag": "", "keys": ["webgpu", "gpu"]}
  --opt             flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
  --system-lib-prefix ""
  --output          /models/phi-2/MLC/webllm.wasm
  --overrides       context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None
[2024-04-21 14:17:06] INFO compile.py:137: Creating model from: Phi1Config(vocab_size=51200, hidden_size=2560, intermediate_size=10240, num_hidden_layers=32, num_attention_heads=32, layer_norm_eps=1e-05, position_embedding_base=10000.0, partial_rotary_factor=0.4, num_key_value_heads=32, context_window_size=2048, prefill_chunk_size=2048, head_dim=80, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
[2024-04-21 14:17:06] INFO compile.py:156: Exporting the model to TVM Unity compiler
[2024-04-21 14:17:06] INFO compile.py:162: Running optimizations using TVM Unity
[2024-04-21 14:17:06] INFO compile.py:176: Registering metadata: {'model_type': 'phi', 'quantization': 'q0f16', 'context_window_size': 2048, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 2048, 'tensor_parallel_shards': 1, 'kv_cache_bytes': 0}
[2024-04-21 14:17:06] WARNING auto_target.py:123: --system-lib-prefix is not specified when building a static library
[2024-04-21 14:17:07] INFO pipeline.py:50: Running TVM Relax graph-level optimizations
[2024-04-21 14:17:08] INFO pipeline.py:50: Lowering to TVM TIR kernels
[2024-04-21 14:17:10] INFO pipeline.py:50: Running TVM TIR-level optimizations
[2024-04-21 14:17:15] INFO pipeline.py:50: Running TVM Dlight low-level optimizations
[2024-04-21 14:17:16] INFO pipeline.py:50: Lowering to VM bytecode
[2024-04-21 14:17:17] INFO estimate_memory_usage.py:57: [Memory usage] Function `alloc_embedding_tensor`: 10.00 MB
[2024-04-21 14:17:17] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_decode`: 3.91 MB
[2024-04-21 14:17:17] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_prefill`: 100.78 MB
[2024-04-21 14:17:17] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_verify`: 100.00 MB
[2024-04-21 14:17:17] INFO estimate_memory_usage.py:57: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-04-21 14:17:17] INFO estimate_memory_usage.py:57: [Memory usage] Function `decode`: 0.05 MB
[2024-04-21 14:17:17] INFO estimate_memory_usage.py:57: [Memory usage] Function `embed`: 10.00 MB
[2024-04-21 14:17:17] INFO estimate_memory_usage.py:57: [Memory usage] Function `prefill`: 100.01 MB
[2024-04-21 14:17:17] INFO estimate_memory_usage.py:57: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-04-21 14:17:18] INFO pipeline.py:50: Compiling external modules
[2024-04-21 14:17:18] INFO pipeline.py:50: Compilation complete! Exporting to disk
[14:17:20] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/tvm/src/target/llvm/codegen_llvm.cc:185: Warning: Set native vector bits to be 128 for wasm32
[2024-04-21 14:17:47] INFO model_metadata.py:96: Total memory usage: 5402.61 MB (Parameters: 5301.83 MB. KVCache: 0.00 MB. Temporary buffer: 100.78 MB)
[2024-04-21 14:17:47] INFO model_metadata.py:105: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
[2024-04-21 14:17:47] INFO compile.py:198: Generated: /models/phi-2/MLC/webllm.wasm
(mlc-chat-venv) Davids-MacBook-Pro:mlc-llm davidgortega$ mlc_llm gen_config $MODEL --quantization $QUANTIZATION  --conv-template phi-2 -o $
MODEL/MLC
[2024-04-21 14:18:59] INFO auto_config.py:115: Found model configuration: /models/phi-2/config.json
[2024-04-21 14:18:59] INFO auto_config.py:153: Found model type: phi. Use `--model-type` to override.
[2024-04-21 14:18:59] INFO phi_model.py:53: context_window_size not found in config.json. Falling back to max_position_embeddings (2048)
[2024-04-21 14:18:59] INFO config.py:106: Overriding max_batch_size from 1 to 80
[2024-04-21 14:18:59] INFO gen_config.py:187: [generation_config.json] Setting eos_token_id: 50256
[2024-04-21 14:18:59] INFO gen_config.py:187: [generation_config.json] Setting bos_token_id: 50256
[2024-04-21 14:18:59] INFO gen_config.py:201: Not found tokenizer config: /models/phi-2/tokenizer.model
[2024-04-21 14:18:59] INFO gen_config.py:199: Found tokenizer config: /models/phi-2/tokenizer.json. Copying to /models/phi-2/MLC/tokenizer.json
[2024-04-21 14:18:59] INFO gen_config.py:199: Found tokenizer config: /models/phi-2/vocab.json. Copying to /models/phi-2/MLC/vocab.json
[2024-04-21 14:18:59] INFO gen_config.py:199: Found tokenizer config: /models/phi-2/merges.txt. Copying to /models/phi-2/MLC/merges.txt
[2024-04-21 14:18:59] INFO gen_config.py:199: Found tokenizer config: /models/phi-2/added_tokens.json. Copying to /models/phi-2/MLC/added_tokens.json
[2024-04-21 14:18:59] INFO gen_config.py:199: Found tokenizer config: /models/phi-2/tokenizer_config.json. Copying to /models/phi-2/MLC/tokenizer_config.json
[2024-04-21 14:18:59] INFO gen_config.py:76: [System default] Setting pad_token_id: 0
[2024-04-21 14:18:59] INFO gen_config.py:76: [System default] Setting temperature: 0.7
[2024-04-21 14:18:59] INFO gen_config.py:76: [System default] Setting presence_penalty: 0.0
[2024-04-21 14:18:59] INFO gen_config.py:76: [System default] Setting frequency_penalty: 0.0
[2024-04-21 14:18:59] INFO gen_config.py:76: [System default] Setting repetition_penalty: 1.0
[2024-04-21 14:18:59] INFO gen_config.py:76: [System default] Setting top_p: 0.95
[2024-04-21 14:18:59] INFO gen_config.py:76: [System default] Setting mean_gen_len: 128
[2024-04-21 14:18:59] INFO gen_config.py:76: [System default] Setting max_gen_len: 512
[2024-04-21 14:18:59] INFO gen_config.py:76: [System default] Setting shift_fill_factor: 0.3
[2024-04-21 14:18:59] INFO gen_config.py:263: Dumping configuration file to: /models/phi-2/MLC/mlc-chat-config.json
(mlc-chat-venv) Davids-MacBook-Pro:mlc-llm davidgortega$ mlc_llm compile $MODEL/MLC/mlc-chat-config.json --device webgpu -o $MODEL/MLC/webllm.wasm
[2024-04-21 14:19:07] INFO auto_config.py:69: Found model configuration: /models/phi-2/MLC/mlc-chat-config.json
[2024-04-21 14:19:07] INFO auto_config.py:153: Found model type: phi. Use `--model-type` to override.
Compiling with arguments:
  --config          Phi1Config(vocab_size=51200, hidden_size=2560, intermediate_size=10240, num_hidden_layers=32, num_attention_heads=32, layer_norm_eps=1e-05, position_embedding_base=10000.0, partial_rotary_factor=0.4, num_key_value_heads=32, context_window_size=2048, prefill_chunk_size=2048, head_dim=80, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
  --quantization    NoQuantize(name='q0f16', kind='no-quant', model_dtype='float16')
  --model-type      phi
  --target          {"host": {"kind": "llvm", "tag": "", "keys": ["cpu"], "mtriple": "wasm32-unknown-unknown-wasm"}, "max_num_threads": 256, "kind": "webgpu", "tag": "", "keys": ["webgpu", "gpu"]}
  --opt             flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
  --system-lib-prefix ""
  --output         /models/phi-2/MLC/webllm.wasm
  --overrides       context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None
[2024-04-21 14:19:07] INFO compile.py:137: Creating model from: Phi1Config(vocab_size=51200, hidden_size=2560, intermediate_size=10240, num_hidden_layers=32, num_attention_heads=32, layer_norm_eps=1e-05, position_embedding_base=10000.0, partial_rotary_factor=0.4, num_key_value_heads=32, context_window_size=2048, prefill_chunk_size=2048, head_dim=80, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
[2024-04-21 14:19:07] INFO compile.py:156: Exporting the model to TVM Unity compiler
[2024-04-21 14:19:07] INFO compile.py:162: Running optimizations using TVM Unity
[2024-04-21 14:19:07] INFO compile.py:176: Registering metadata: {'model_type': 'phi', 'quantization': 'q0f16', 'context_window_size': 2048, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 2048, 'tensor_parallel_shards': 1, 'kv_cache_bytes': 0}
[2024-04-21 14:19:07] WARNING auto_target.py:123: --system-lib-prefix is not specified when building a static library
[2024-04-21 14:19:08] INFO pipeline.py:50: Running TVM Relax graph-level optimizations
[2024-04-21 14:19:10] INFO pipeline.py:50: Lowering to TVM TIR kernels
[2024-04-21 14:19:12] INFO pipeline.py:50: Running TVM TIR-level optimizations
[2024-04-21 14:19:17] INFO pipeline.py:50: Running TVM Dlight low-level optimizations
[2024-04-21 14:19:18] INFO pipeline.py:50: Lowering to VM bytecode
[2024-04-21 14:19:19] INFO estimate_memory_usage.py:57: [Memory usage] Function `alloc_embedding_tensor`: 10.00 MB
[2024-04-21 14:19:19] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_decode`: 3.91 MB
[2024-04-21 14:19:19] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_prefill`: 100.78 MB
[2024-04-21 14:19:19] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_verify`: 100.00 MB
[2024-04-21 14:19:19] INFO estimate_memory_usage.py:57: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-04-21 14:19:19] INFO estimate_memory_usage.py:57: [Memory usage] Function `decode`: 0.05 MB
[2024-04-21 14:19:19] INFO estimate_memory_usage.py:57: [Memory usage] Function `embed`: 10.00 MB
[2024-04-21 14:19:19] INFO estimate_memory_usage.py:57: [Memory usage] Function `prefill`: 100.01 MB
[2024-04-21 14:19:19] INFO estimate_memory_usage.py:57: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-04-21 14:19:20] INFO pipeline.py:50: Compiling external modules
[2024-04-21 14:19:20] INFO pipeline.py:50: Compilation complete! Exporting to disk
[14:19:22] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/tvm/src/target/llvm/codegen_llvm.cc:185: Warning: Set native vector bits to be 128 for wasm32
[2024-04-21 14:19:48] INFO model_metadata.py:96: Total memory usage: 5402.61 MB (Parameters: 5301.83 MB. KVCache: 0.00 MB. Temporary buffer: 100.78 MB)
[2024-04-21 14:19:48] INFO model_metadata.py:105: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
[2024-04-21 14:19:48] INFO compile.py:198: Generated: /models/phi-2/MLC/webllm.wasm

If I point the wasm to mlc's phi-2 it works.

Im trying to work in adding webgou tests using cypress, but Im spending more time looking at this than anything else.

It should be fixed now via mlc-ai/mlc-llm#2187.

We recently used EMCC to include runtime code from https://github.com/mlc-ai/mlc-llm into the model WASM; as of now mainly for the grammar usages. Currently, I am prioritizing to ensure users who use the prebuilt models have a smooth experience. For instance, we introduced wasm versioning.

For compiling a customized model, as a current workaround, using the commits specified in the WASM version PRs should guarantee a working WASM (though admittedly a bit inconvenient):

In future, we will find ways to make sure that the compile-runtime flow is also smooth.