predibase / lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Home Page:https://loraexchange.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] Usage about the `adapter-memory-fraction`

thincal opened this issue · comments

commented

Feature request

  1. does adapter-memory-fraction include the base_model memory ?
  2. what's the difference between adapter-memory-fraction and cuda-memory-fraction ? what will happend if both are set ?

Motivation

Just a question.

Your contribution

Just a question.

Btw, maybe we could create a new category of issue for the question ?

commented
free_memory = max(
    0, total_free_memory - (1 - MEMORY_FRACTION + ADAPTER_MEMORY_FRACTION) * total_gpu_memory
)
logger.info("Memory remaining for kv cache: {} MB", free_memory / 1024 / 1024)

OK, so that the reserved memory of cuda-memory-fraction and adapter memory are counted into the total usage besides the kv cache.