feat: ggml: support more parameters from llama.cpp
dm4 opened this issue · comments
Summary
We currently support some parameters from llama.cpp, such as n_gpu_layers
, cox-size
, thread
, etc., and we expect to support even more parameters.
Details
Refer to llama.cpp/common/common.cpp/gpt_params_find_arg(), planning to support additional parameters.
Appendix
List all options:
-
--seed
-
--threads
-
--threads-batch
-
--threads-draft
-
--threads-batch-draft
-
--prompt
-
--escape
-
--prompt-cache
-
--prompt-cache-all
-
--prompt-cache-ro
-
--binary-file
-
--file
-
--n-predict
-
--top-k
-
--ctx-size
-
--grp-attn-n
-
--grp-attn-w
-
--rope-freq-base
-
--rope-freq-scale
-
--rope-scaling
-
--rope-scale
-
--yarn-orig-ctx
-
--yarn-ext-factor
-
--yarn-attn-factor
-
--yarn-beta-fast
-
--yarn-beta-slow
-
--pooling
-
--defrag-thold
-
--samplers
-
--sampling-seq
-
--top-p
-
--min-p
-
--temp
-
--tfs
-
--typical
-
--repeat-last-n
-
--repeat-penalty
-
--frequency-penalty
-
--presence-penalty
-
--dynatemp-range
-
--dynatemp-exp
-
--mirostat
-
--mirostat-lr
-
--mirostat-ent
-
--cfg-negative-prompt
-
--cfg-negative-prompt-file
-
--cfg-scale
-
--batch-size
-
--ubatch-size
-
--keep
-
--draft
-
--chunks
-
--parallel
-
--sequences
-
--p-split
-
--model
-
--model-draft
-
--alias
-
--model-url
-
--hf-repo
-
--hf-file
-
--lora
-
--lora-scaled
-
--lora-base
-
--control-vector
-
--control-vector-scaled
-
--control-vector-layer-range
-
--mmproj
-
--image
-
--interactive
-
--embedding
-
--interactive-first
-
--instruct
-
--chatml
-
--infill
-
--dump-kv-cache
-
--no-kv-offload
-
--cache-type-k
-
--cache-type-v
-
--multiline-input
-
--simple-io
-
--cont-batching
-
--color
-
--mlock
-
--gpu-layers
--n-gpu-layers
-
--gpu-layers-draft
--n-gpu-layers-draft
-
--main-gpu
-
--split-mode
-
--tensor-split
-
--no-mmap
-
--numa
-
--verbose-prompt
-
--no-display-prompt
-
--reverse-prompt
-
--logdir
-
--lookup-cache-static
-
--lookup-cache-dynamic
-
--save-all-logits
--kl-divergence-base
-
--perplexity
--all-logits
-
--ppl-stride
-
--print-token-count
-
--ppl-output-type
-
--hellaswag
-
--hellaswag-tasks
-
--winogrande
-
--winogrande-tasks
-
--multiple-choice
-
--multiple-choice-tasks
-
--kl-divergence
-
--ignore-eos
-
--no-penalize-nl
-
--logit-bias
-
--help
-
--version
-
--random-prompt
-
--in-prefix-bos
-
--in-prefix
-
--in-suffix
-
--grammar
-
--grammar-file
-
--override-kv
is this issue open for contributions? if yes I would love to look into this.
is this issue open for contributions? if yes I would love to look into this.
Yes, this issue is open for contributions. We welcome your input and any code related to this issue.
some parameters, such as --parallel
and --draft
, are not directly used in internal implementation of llama.cpp, according to search result for "n_parallel" in llama.cpp.
only some parameters would affect internal behavior of llama.cpp functions, like parameters related to RoPE, otherwise integrating processing logics to support the additional parameters could totally change implementation of compute()
, like the example below:
Abstract of integrating `--parallel` `--draft` and parsing it as an optional parameter in WasmEdge
struct Graph {
// ...
uint64_t NParallel = 1;
uint64_t NDraft = 1;
}
Expect<ErrNo> compute(WasiNNEnvironment &Env, uint32_t ContextId) noexcept {
// ...
// if --draft and --parallel are set
ReturnCode = SpeculativeDecoding(GraphRef, CxtRef);
// else use current implementation
// ...
}
ErrNo SpeculativeDecoding(Graph &GraphRef, Context &CxtRef) noexcept {
// implementation like https://github.com/ggerganov/llama.cpp/blob/3292733f95d4632a956890a438af5192e7031c12/examples/speculative/speculative.cpp
}
detailed code: https://github.com/Fusaaaann/WasmEdge/blob/ae718df452658df555e2b4fe35e8c90e69c5c55f/plugins/wasi_nn/strategies/strategies.cpp#L234
what is WasmEdge's future planning for supporting these parameters, if wasi-nn functions could become too complex to fit in one ggml.cpp file due to support for these parameters?
Hi @Fusaaaann
We don't have a robust timeline for supporting the above parameters. If there is an application that will require such options, then we will increase the priority of them. There are already two different ways to handle normal LLM and LLaVA applications in our plugin; we don't matter if the complexity increases after adding more parameters.