mudler / LocalAI

:robot: The free, Open Source OpenAI alternative. Self-hosted, community-driven and local-first. Drop-in replacement for OpenAI running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. It allows to generate Text, Audio, Video, Images. Also with voice cloning capabilities.

Home Page:https://localai.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

generation eval time is slower than llama-cli in pure llama.cpp

JimHeo opened this issue · comments

LocalAI version:

185ab93 local build

Environment, CPU architecture, OS, and Version:

Intel i9-10850K CPU @ 3.60GHz, RTX 3090, Ubuntu 20.04
Linux Jiminthebox 5.15.0-113-generic 123~20.04.1-Ubuntu SMP Wed Jun 12 17:33:13 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Describe the bug

When running llama.cpp with LocalAI, the generation eval time is significantly slower compared to running it with pure llama.cpp.

To Reproduce

Pure llama.cpp

./llama-cli -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n -1 -c 0 -ngl 999 -t $(nproc) -p "Print all ascii code." --color --mlock --batch-size 512

LocalAI

./local-ai --debug=true # with CUDA build
name: llama3-8b-instruct-Q4_K_M
context_size: 8192
threads: 20
f16: true
mmap: true
mmlock: false
no_kv_offloading: false
low_vram: false
backend: llama-cpp
cuda: true
gpu_layers: 999
parameters:
  model: Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
rope_scaling: linear
stopwords:
- <|im_end|>
- <dummy32000>
- <|eot_id|>
- <|end_of_text|>
template:
  chat: |
    <|begin_of_text|>{{.Input }}
    <|start_header_id|>assistant<|end_header_id|>
  chat_message: |
    <|start_header_id|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "tool"}}tool{{else if eq .RoleName "user"}}user{{end}}<|end_header_id|>

    {{ if .FunctionCall -}}
    Function call:
    {{ else if eq .RoleName "tool" -}}
    Function response:
    {{ end -}}
    {{ if .Content -}}
    {{.Content -}}
    {{ else if .FunctionCall -}}
    {{ toJson .FunctionCall -}}
    {{ end -}}
    <|eot_id|>
  completion: |
    {{.Input}}
  function: |
    <|start_header_id|>system<|end_header_id|>

    You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools:
    <tools>
    {{range .Functions}}
    {'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
    {{end}}
    </tools>
    Use the following pydantic model json schema for each tool call you will make:
    {'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Arguments', 'type': 'object'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['arguments', 'name']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
    Function call:

Expected behavior

The generation evaluation time when using LocalAI should be comparable to running llama.cpp directly, considering the same hardware specifications.

Logs

Pure llama.cpp

llama_print_timings:        load time =     718.94 ms
llama_print_timings:      sample time =     112.66 ms /  1516 runs   (    0.07 ms per token, 13457.01 tokens per second)
llama_print_timings: prompt eval time =      14.48 ms /     6 tokens (    2.41 ms per token,   414.36 tokens per second)
llama_print_timings:        eval time =   13200.77 ms /  1515 runs   (    8.71 ms per token,   114.77 tokens per second)
llama_print_timings:       total time =   14213.27 ms /  1521 tokens

LocalAI

11:53AM DBG GRPC(Meta-Llama-3-8B-Instruct-Q4_K_M.gguf-127.0.0.1:40159): stdout {"timestamp":1720752788,"level":"INFO","function":"print_timings","line":327,"message":"prompt eval time     =      49.55 ms /    19 tokens (    2.61 ms per token,   383.46 tokens per second)","slot_id":0,"task_id":0,"t_prompt_processing":49.549,"num_prompt_tokens_processed":19,"t_token":2.607842105263158,"n_tokens_second":383.45879836121816}
11:53AM DBG GRPC(Meta-Llama-3-8B-Instruct-Q4_K_M.gguf-127.0.0.1:40159): stdout {"timestamp":1720752788,"level":"INFO","function":"print_timings","line":341,"message":"generation eval time =   23612.37 ms /  1392 runs   (   16.96 ms per token,    58.95 tokens per second)","slot_id":0,"task_id":0,"t_token_generation":23612.366,"n_decoded":1392,"t_token":16.962906609195404,"n_tokens_second":58.952160914327685}
11:53AM DBG GRPC(Meta-Llama-3-8B-Instruct-Q4_K_M.gguf-127.0.0.1:40159): stdout {"timestamp":1720752788,"level":"INFO","function":"print_timings","line":351,"message":"          total time =   23661.92 ms","slot_id":0,"task_id":0,"t_prompt_processing":49.549,"t_token_generation":23612.366,"t_total":23661.915}
11:53AM DBG GRPC(Meta-Llama-3-8B-Instruct-Q4_K_M.gguf-127.0.0.1:40159): stdout {"timestamp":1720752788,"level":"INFO","function":"update_slots","line":1596,"message":"slot released","slot_id":0,"task_id":0,"n_ctx":8192,"n_past":1410,"n_system_tokens":0,"n_cache_tokens":1411,"truncated":false}
11:53AM DBG GRPC(Meta-Llama-3-8B-Instruct-Q4_K_M.gguf-127.0.0.1:40159): stdout {"timestamp":1720752788,"level":"INFO","function":"update_slots","line":1549,"message":"all slots are idle and system prompt is empty, clear the KV cache"}

Additional context

Thank you for creating such a great project.
I am not sure how to achieve the same speed. Any assistance would be greatly appreciated.

The llama.cpp command you posted is quite basic and have very low defaults, for instance, the context size by default is 512 which affects both speed and memory usage.

I see several things that leads to a different benchmark result from a quick scan, but I guess there would be even more:

Take into consideration that LocalAI consumes llama.cpp vanilla, and we tend to be up-to-date with the latest llama.cpp commits: so the speed should be the same given you run things with the same options.

I hope it makes sense - Cheers

Thank you for your prompt response.
I will check the information you provided and inform accordingly.

Based on your response, I conducted the tests again. However, the speed of llama.cpp when running with LocalAI is still slow. I understand that LocalAI uses vanilla llama.cpp and logically the speeds should be the same. Therefore, I am confused because I expected the speeds to be identical.

I am not sure what additional configurations might be set internally; the results from my tests show that it is slower when using LocalAI.

For the tests, I used OpenAI’s API for communication, and the code is as follows:

import time
from openai import OpenAI

prompts = [
    "What are the benefits of exercise?",
    "Translate the following sentence into French: 'The quick brown fox jumps over the lazy dog.'",
    "Correct the grammatical errors in the following sentence: 'She go to the market every days.'",
    "What is the capital of France?",
    "Summarize the following paragraph: 'Artificial Intelligence (AI) is the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term may also be applied to any machine that exhibits traits associated with a human mind such as learning and problem-solving.'",
    "Generate a dialogue between a customer and a support agent where the customer is asking for a refund.",
    "Complete the following sentence: 'In the near future, artificial intelligence will...'",
    "Identify the named entities in the following sentence: 'Barack Obama was born in Hawaii and served as the 44th President of the United States.'",
    "Say 10 times: \'abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ\'"
]

api_url= 'http://localhost:8080/v1'
client = OpenAI(base_url=api_url)
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant."
    },
    {
        "role": "user",
        "content": ""
    }
]

response_times = []
for prompt in prompts:
    messages[1]["content"] = prompt
    start_time = time.time()
    response = client.chat.completions.create(messages=messages, max_tokens=1024, model='llama3-8b-instruct-Q4_K_M', stream=False, temperature=0.1)
    end_time = time.time()
    response_times.append(end_time - start_time)
    
    print(response.choices[0].message.content)


total_response_time = sum(response_times)
print(f"Total response time: {total_response_time:.2f} seconds")
average_response_time = total_response_time / len(response_times)
print(f"Average response time: {average_response_time:.2f} seconds")

For llama.cpp, I started the server with the following command:

$ ./llama-server -m ../local-ai/models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf --ctx_size 512 -ngl 999 -t 10 --host 0.0.0.0 --port 8080

Additionally, here is the YAML file I used to run the model in LocalAI:

name: llama3-8b-instruct-Q4_K_M
threads: 10
mmap: true
backend: llama-cpp
gpu_layers: 999
context_size: 512
parameters:
  model: Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
stopwords:
- <|im_end|>
- <dummy32000>
- <|eot_id|>
- <|end_of_text|>
template:
  chat: |
    <|begin_of_text|>{{.Input }}
    <|start_header_id|>assistant<|end_header_id|>
  chat_message: |
    <|start_header_id|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "tool"}}tool{{else if eq .RoleName "user"}}user{{end}}<|end_header_id|>

    {{ if .FunctionCall -}}
    Function call:
    {{ else if eq .RoleName "tool" -}}
    Function response:
    {{ end -}}
    {{ if .Content -}}
    {{.Content -}}
    {{ else if .FunctionCall -}}
    {{ toJson .FunctionCall -}}
    {{ end -}}
    <|eot_id|>
  completion: |
    {{.Input}}
  function: |
    <|start_header_id|>system<|end_header_id|>

    You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools:
    <tools>
    {{range .Functions}}
    {'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
    {{end}}
    </tools>
    Use the following pydantic model json schema for each tool call you will make:
    {'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Arguments', 'type': 'object'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['arguments', 'name']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
    Function call:

Both the llama.cpp and LocalAI tests were conducted after warm-up. Below are the results for each.

result of llama.cpp

Exercise is one of the most effective ways to improve overall health and well-being. Here are some of the many benefits of regular exercise:

1. **Weight Management**: Exercise helps burn calories and maintain a healthy weight, reducing the risk of obesity-related diseases.
2. **Cardiovascular Health**: Regular exercise strengthens the heart and lungs, improving circulation and reducing the risk of heart disease, stroke, and high blood pressure.
3. **Increased Strength and Flexibility**: Exercise, especially resistance training, helps build muscle mass and bone density, making daily activities easier and reducing the risk of osteoporosis.
4. **Improved Mental Health**: Exercise releases endorphins, also known as "feel-good" hormones, which can help alleviate symptoms of anxiety and depression.
5. **Better Sleep**: Regular physical activity can help improve sleep quality and duration.
6. **Increased Energy**: Exercise boosts energy levels and reduces fatigue, making it easier to tackle daily tasks.
7. **Improved Brain Function**: Exercise has been shown to improve cognitive function, including memory, concentration, and problem-solving skills.
8. **Reduced Risk of Chronic Diseases**: Regular exercise can reduce the risk of developing type 2 diabetes, certain types of cancer, and other chronic diseases.
9. **Improved Bone Density**: Exercise, especially weight-bearing and resistance exercises, can help improve bone density, reducing the risk of osteoporosis and fractures.
10. **Enhanced Immune Function**: Exercise has been shown to boost the immune system, reducing the risk of illness and infection.
11. **Better Digestion**: Regular physical activity can improve digestion and reduce the risk of constipation, diverticulitis, and other gastrointestinal disorders.
12. **Increased Self-Esteem**: Exercise can improve body image and self-esteem, leading to a more positive and confident outlook on life.
13. **Social Benefits**: Exercise can provide opportunities for social interaction, building relationships and a sense of community.
14. **Reduced Stress**: Exercise is a natural stress-reliever, helping to reduce anxiety and promote relaxation.
15. **Improved Overall Health**: Regular exercise can improve overall health and well-being, reducing the risk of premature death and increasing life expectancy.

Remember, every individual is unique, and the benefits of exercise may vary depending on factors such as age, fitness level, and health status. It's essential to consult with a healthcare professional before starting a new exercise program.
The translation of the sentence "The quick brown fox jumps over the lazy dog" into French is:

"Le renard rapide brun saute par-dessus le chien paresseux."

Note: This sentence is a well-known pangram, meaning it uses all the letters of the alphabet at least once.
The corrected sentence would be:

"She goes to the market every day."

The errors in the original sentence were:

* "She go" should be "She goes" (subject-verb agreement)
* "every days" should be "every day" (plural noun "days" is not needed, and "day" is the correct word to use in this context)
The capital of France is Paris!
Here is a summary of the paragraph:

Artificial Intelligence (AI) is the creation of machines that can think and act like humans, mimicking their intelligence and abilities. This includes machines that can learn, solve problems, and exhibit other human-like traits.
Here is a dialogue between a customer and a support agent:

**Customer:** Hi, I'm calling about the product I purchased from your company last week. I'm not satisfied with it and I'd like to request a refund.

**Support Agent:** Thank you for reaching out to us, [Customer's Name]. I apologize to hear that you're not satisfied with the product. Can you please tell me more about what's not meeting your expectations? Was there a specific issue with the product or did it not work as you had hoped?

**Customer:** Well, I was expecting it to be more durable. The material feels cheap and it's already started to show signs of wear and tear. I've only used it a few times.

**Support Agent:** I apologize for the inconvenience. Can you please provide me with your order number so I can look into this further?

**Customer:** Yeah, it's #123456.

**Support Agent:** Thank you, [Customer's Name]. I've located your order and I can see that you purchased the product on [Date]. I'd be happy to assist you with a refund. However, I do need to let you know that our return policy states that all returns must be made within 30 days of purchase.

**Customer:** Okay, I understand. But I'm still within that timeframe, right? I purchased it just last week.

**Support Agent:** That's correct. You are still within the 30-day window. I can process a full refund for you. Would you like me to send you a prepaid return shipping label so you can send the product back to us?

**Customer:** Yes, that would be great. Thank you for your help.

**Support Agent:** You're welcome, [Customer's Name]. I'll go ahead and process the refund and send you the return shipping label via email. You should receive it within the next 24 hours. If you have any other questions or concerns, please don't hesitate to reach out to us. We appreciate your business and hope to have the opportunity to serve you better in the future.

**Customer:** Thank you, I appreciate it.

**Support Agent:** You're welcome. Have a great day!
"...play a crucial role in augmenting human capabilities, transforming industries, and improving the quality of life for billions of people around the world. As AI becomes increasingly sophisticated, it will enable us to tackle complex challenges such as climate change, healthcare, education, and economic inequality, ultimately leading to a more sustainable, efficient, and equitable society."
The named entities in the sentence are:

* Barack Obama (person)
* Hawaii (location)
* United States (location)
* President (title)
* 44th (number)
Here it goes:

1. abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ
2. abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ
3. abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ
4. abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ
5. abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ
6. abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ
7. abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ
8. abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ
9. abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ
10. abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ
Total response time: 13.79 seconds
Average response time: 1.53 seconds

result of LocalAI

Exercise is an essential part of a healthy lifestyle, and it offers numerous benefits for the body, mind, and overall well-being. Here are some of the most significant advantages of regular exercise:

1. **Improves Physical Health**: Exercise helps to maintain a healthy weight, boost cardiovascular health, strengthen bones, and reduce the risk of chronic diseases like diabetes, certain types of cancer, and heart disease.
2. **Boosts Mental Health**: Exercise releases endorphins, which are natural mood-boosters that can help alleviate symptoms of anxiety and depression. It can also improve sleep quality and reduce stress levels.
3. **Increases Energy**: Regular physical activity can increase energy levels and reduce fatigue, making it easier to tackle daily tasks and activities.
4. **Enhances Cognitive Function**: Exercise has been shown to improve cognitive function, including memory, concentration, and problem-solving skills.
5. **Supports Bone Density**: Weight-bearing exercises, such as running or weightlifting, can help maintain strong bones and reduce the risk of osteoporosis.
6. **Reduces Inflammation**: Exercise has anti-inflammatory effects, which can help reduce inflammation and improve overall health.
7. **Improves Self-Esteem**: Regular exercise can enhance self-esteem and body image, leading to a more positive and confident outlook on life.
8. **Increases Social Connections**: Exercising with others can help build social connections and a sense of community, which is essential for overall well-being.
9. **Reduces Risk of Chronic Diseases**: Regular exercise has been shown to reduce the risk of chronic diseases, such as heart disease, stroke, and certain types of cancer.
10. **Increases Longevity**: Studies have shown that regular exercise can increase life expectancy and reduce the risk of premature death.
11. **Improves Digestion**: Exercise can help regulate bowel movements, reduce symptoms of irritable bowel syndrome (IBS), and improve overall digestive health.
12. **Enhances Immune Function**: Exercise has been shown to boost the immune system, reducing the risk of illness and infection.
13. **Reduces Stress**: Exercise is a natural stress-reliever and can help reduce symptoms of stress and anxiety.
14. **Improves Coordination and Balance**: Regular exercise can improve coordination, balance, and overall physical fitness.
15. **Increases Productivity**: Exercise has been shown to improve productivity, creativity, and
Here is the translation of the sentence "The quick brown fox jumps over the lazy dog" into French:

Le renard rapide brun saute par-dessus le chien paresseux.

Let me know if you need any further assistance!
The corrected sentence would be:

"She goes to the market every day."

The error was the use of the singular verb "go" instead of the plural verb "goes" to agree with the subject "she", which is a singular noun. Additionally, "days" is a plural noun, so it should be replaced with the singular noun "day" to match the verb "goes".
The capital of France is Paris!
Here is a summary of the paragraph:

Artificial Intelligence (AI) refers to the creation of machines that can think and act like humans, mimicking their intelligence and abilities, such as learning and problem-solving.
Here is a dialogue between a customer and a support agent:

**Customer:** Hi, I'm calling about my recent purchase of the "SmartFit" fitness tracker. I'm requesting a refund.

**Support Agent:** Thank you for reaching out to us. Can you please provide me with your order number so I can look into this further?

**Customer:** Sure thing. It's #123456.

**Support Agent:** Okay, thank you. Can you tell me a little bit more about the issue you're experiencing with the product? What's not meeting your expectations?

**Customer:** Honestly, I just didn't find it to be as accurate as I thought it would be. The tracking data was off by a significant amount, and the customer reviews I read online were misleading. I'm really disappointed in the product.

**Support Agent:** I apologize for the inconvenience. I'd be happy to help you with a refund. Can you please confirm that you've tried troubleshooting the issue and that it's not just a matter of adjusting the settings?

**Customer:** Yeah, I've tried everything I can think of. I've reset the device, updated the software, and even contacted the manufacturer's support team. Nothing seems to have worked.

**Support Agent:** Okay, thank you for trying those steps. In that case, I'd be happy to process a full refund for you. Would you like me to send you a prepaid return shipping label so you can send the device back to us?

**Customer:** That would be great, thank you.

**Support Agent:** You're welcome! I'll go ahead and process the refund and send you the return shipping label via email. You should receive it within the next 24 hours. If you have any other questions or concerns, please don't hesitate to reach out.

**Customer:** Thank you so much for your help. I really appreciate it.

**Support Agent:** You're welcome! We apologize again for the inconvenience and hope to have the opportunity to serve you better in the future.
"...revolutionize the way we live and work, enabling humans to focus on creative and high-value tasks while AI handles routine and mundane tasks, leading to increased productivity, efficiency, and innovation."
The named entities in the sentence are:

1. Barack Obama
2. Hawaii
3. United States
Here it goes:

1. abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ
2. abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ
3. abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ
4. abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ
5. abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ
6. abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ
7. abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ
8. abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ
9. abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ
10. abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ
Total response time: 28.28 seconds
Average response time: 3.14 seconds

And here are the speeds output in the server debugging when executing the final prompt "Say 10 times: 'abc def ghi jkl mno pqr stu vwxyz ABC DEF GHI JKL MNO PQR STU VWXYZ'":

llama.cpp

INFO [   launch_slot_with_task] slot is processing task | tid="140540703879168" timestamp=1720876541 id_slot=0 id_task=5641
INFO [            update_slots] kv cache rm [p0, end) | tid="140540703879168" timestamp=1720876541 id_slot=0 id_task=5641 p0=0
INFO [           print_timings] prompt eval time     =      20.90 ms /    54 tokens (    0.39 ms per token,  2584.10 tokens per second) | tid="140540703879168" timestamp=1720876543 id_slot=0 id_task=5641 t_prompt_processing=20.897 n_prompt_tokens_processed=54 t_token=0.3869814814814814 n_tokens_second=2584.10298128918
INFO [           print_timings] generation eval time =    2515.80 ms /   294 runs   (    8.56 ms per token,   116.86 tokens per second) | tid="140540703879168" timestamp=1720876543 id_slot=0 id_task=5641 t_token_generation=2515.799 n_decoded=294 t_token=8.557139455782313 n_tokens_second=116.86148217723276
INFO [           print_timings]           total time =    2536.70 ms | tid="140540703879168" timestamp=1720876543 id_slot=0 id_task=5641 t_prompt_processing=20.897 t_token_generation=2515.799 t_total=2536.696
INFO [            update_slots] slot released | tid="140540703879168" timestamp=1720876543 id_slot=0 id_task=5641 n_ctx=512 n_past=347 n_system_tokens=0 n_cache_tokens=0 truncated=false

LocalAI

10:21PM DBG GRPC(Meta-Llama-3-8B-Instruct-Q4_K_M.gguf-127.0.0.1:39633): stdout {"timestamp":1720876883,"level":"INFO","function":"update_slots","line":1785,"message":"kv cache rm [p0, end)","slot_id":0,"task_id":4025,"p0":0}
10:21PM DBG GRPC(Meta-Llama-3-8B-Instruct-Q4_K_M.gguf-127.0.0.1:39633): stdout {"timestamp":1720876889,"level":"INFO","function":"print_timings","line":327,"message":"prompt eval time     =      33.51 ms /    57 tokens (    0.59 ms per token,  1700.88 tokens per second)","slot_id":0,"task_id":4025,"t_prompt_processing":33.512,"num_prompt_tokens_processed":57,"t_token":0.5879298245614035,"n_tokens_second":1700.8832656958703}
10:21PM DBG GRPC(Meta-Llama-3-8B-Instruct-Q4_K_M.gguf-127.0.0.1:39633): stdout {"timestamp":1720876889,"level":"INFO","function":"print_timings","line":341,"message":"generation eval time =    5703.68 ms /   294 runs   (   19.40 ms per token,    51.55 tokens per second)","slot_id":0,"task_id":4025,"t_token_generation":5703.684,"n_decoded":294,"t_token":19.400285714285715,"n_tokens_second":51.545632612185386}
10:21PM DBG GRPC(Meta-Llama-3-8B-Instruct-Q4_K_M.gguf-127.0.0.1:39633): stdout {"timestamp":1720876889,"level":"INFO","function":"print_timings","line":351,"message":"          total time =    5737.20 ms","slot_id":0,"task_id":4025,"t_prompt_processing":33.512,"t_token_generation":5703.684,"t_total":5737.196}
10:21PM DBG GRPC(Meta-Llama-3-8B-Instruct-Q4_K_M.gguf-127.0.0.1:39633): stdout {"timestamp":1720876889,"level":"INFO","function":"update_slots","line":1596,"message":"slot released","slot_id":0,"task_id":4025,"n_ctx":512,"n_past":350,"n_system_tokens":0,"n_cache_tokens":351,"truncated":false}

Did you tried disabling mirostat sampling? By default LocalAI enables mirostat for better results , but that have impact on speed as well.

Yay!

Thanks to you, I finally found the cause. When I set mirostat to 0, the speeds are identical. I had no idea there was such a significant difference in speed when using mirostat sampling versus not using it.
I really appreciate your help!

Glad to hear it's solved! Nevertheless it makes sense to document this out - will try to catch up on this today 👍thanks for the deep dive and the detective work :)