WasmEdge / WasmEdge

WasmEdge is a lightweight, high-performance, and extensible WebAssembly runtime for cloud native, edge, and decentralized applications. It powers serverless apps, embedded functions, microservices, smart contracts, and IoT devices.

Home Page:https://WasmEdge.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

feat: ggml: support more parameters from llama.cpp

dm4 opened this issue · comments

commented

Summary

We currently support some parameters from llama.cpp, such as n_gpu_layers, cox-size, thread, etc., and we expect to support even more parameters.

Details

Refer to llama.cpp/common/common.cpp/gpt_params_find_arg(), planning to support additional parameters.

Appendix

List all options:

  • --seed
  • --threads
  • --threads-batch
  • --threads-draft
  • --threads-batch-draft
  • --prompt
  • --escape
  • --prompt-cache
  • --prompt-cache-all
  • --prompt-cache-ro
  • --binary-file
  • --file
  • --n-predict
  • --top-k
  • --ctx-size
  • --grp-attn-n
  • --grp-attn-w
  • --rope-freq-base
  • --rope-freq-scale
  • --rope-scaling
  • --rope-scale
  • --yarn-orig-ctx
  • --yarn-ext-factor
  • --yarn-attn-factor
  • --yarn-beta-fast
  • --yarn-beta-slow
  • --pooling
  • --defrag-thold
  • --samplers
  • --sampling-seq
  • --top-p
  • --min-p
  • --temp
  • --tfs
  • --typical
  • --repeat-last-n
  • --repeat-penalty
  • --frequency-penalty
  • --presence-penalty
  • --dynatemp-range
  • --dynatemp-exp
  • --mirostat
  • --mirostat-lr
  • --mirostat-ent
  • --cfg-negative-prompt
  • --cfg-negative-prompt-file
  • --cfg-scale
  • --batch-size
  • --ubatch-size
  • --keep
  • --draft
  • --chunks
  • --parallel
  • --sequences
  • --p-split
  • --model
  • --model-draft
  • --alias
  • --model-url
  • --hf-repo
  • --hf-file
  • --lora
  • --lora-scaled
  • --lora-base
  • --control-vector
  • --control-vector-scaled
  • --control-vector-layer-range
  • --mmproj
  • --image
  • --interactive
  • --embedding
  • --interactive-first
  • --instruct
  • --chatml
  • --infill
  • --dump-kv-cache
  • --no-kv-offload
  • --cache-type-k
  • --cache-type-v
  • --multiline-input
  • --simple-io
  • --cont-batching
  • --color
  • --mlock
  • --gpu-layers --n-gpu-layers
  • --gpu-layers-draft --n-gpu-layers-draft
  • --main-gpu
  • --split-mode
  • --tensor-split
  • --no-mmap
  • --numa
  • --verbose-prompt
  • --no-display-prompt
  • --reverse-prompt
  • --logdir
  • --lookup-cache-static
  • --lookup-cache-dynamic
  • --save-all-logits --kl-divergence-base
  • --perplexity --all-logits
  • --ppl-stride
  • --print-token-count
  • --ppl-output-type
  • --hellaswag
  • --hellaswag-tasks
  • --winogrande
  • --winogrande-tasks
  • --multiple-choice
  • --multiple-choice-tasks
  • --kl-divergence
  • --ignore-eos
  • --no-penalize-nl
  • --logit-bias
  • --help
  • --version
  • --random-prompt
  • --in-prefix-bos
  • --in-prefix
  • --in-suffix
  • --grammar
  • --grammar-file
  • --override-kv

is this issue open for contributions? if yes I would love to look into this.

commented

is this issue open for contributions? if yes I would love to look into this.

Yes, this issue is open for contributions. We welcome your input and any code related to this issue.

some parameters, such as --parallel and --draft, are not directly used in internal implementation of llama.cpp, according to search result for "n_parallel" in llama.cpp.
only some parameters would affect internal behavior of llama.cpp functions, like parameters related to RoPE, otherwise integrating processing logics to support the additional parameters could totally change implementation of compute(), like the example below:

Abstract of integrating `--parallel` `--draft` and parsing it as an optional parameter in WasmEdge
struct Graph {
    // ...
    uint64_t NParallel = 1; 
    uint64_t NDraft = 1;
}

Expect<ErrNo> compute(WasiNNEnvironment &Env, uint32_t ContextId) noexcept {
    // ...
    // if --draft and --parallel are set
    ReturnCode = SpeculativeDecoding(GraphRef, CxtRef);
    // else use current implementation
    // ...
}

ErrNo SpeculativeDecoding(Graph &GraphRef, Context &CxtRef) noexcept {
    // implementation like https://github.com/ggerganov/llama.cpp/blob/3292733f95d4632a956890a438af5192e7031c12/examples/speculative/speculative.cpp
}

detailed code: https://github.com/Fusaaaann/WasmEdge/blob/ae718df452658df555e2b4fe35e8c90e69c5c55f/plugins/wasi_nn/strategies/strategies.cpp#L234

what is WasmEdge's future planning for supporting these parameters, if wasi-nn functions could become too complex to fit in one ggml.cpp file due to support for these parameters?

commented

Hi @Fusaaaann
We don't have a robust timeline for supporting the above parameters. If there is an application that will require such options, then we will increase the priority of them. There are already two different ways to handle normal LLM and LLaVA applications in our plugin; we don't matter if the complexity increases after adding more parameters.