rustformers / llm

[Unmaintained, see README] An ecosystem of Rust libraries for working with large language models

Home Page:https://docs.rs/llm/latest/llm/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Metal Prompt Feeding

jafioti opened this issue · comments

I'm trying to run llama on mac using metal, but I noticed on the accelerators doc it states metal cannot be used for feeding in a prompt with more than 1 token. Is this an underlying limitation with ggml, or llm?

I'd love to help enable this, but I'm not sure where to begin.

I haven't been tracking this, but ggerganov/llama.cpp#2428 suggests that it's still an issue and it's because the Metal kernels don't support batching. You can probably find more on the repo if you look.

Not sure if there's anything we can do from our end, outside of implementing it and submitting it upstream 😅

I've gotten a change to look into this more, and I don't think it's an issue upstream. Here's my prompt:

def merge_sort(arr):
    if len(arr) <= 1:
        return arr
    
    mid = len(arr) // 2
    left = arr[:mid]
    right = arr[mid:]
    
    left = merge_sort(left)
    right = merge_sort(right)
    
    return merge(left, right)


def merge(left, right):
    if not left:
        return right
    if not right:
        return left
    
    result = []
    
    i = j = 0
    
    while len(left) > i and len(right) > j:
        if left[i] < right[j]:
            result.append(left[i])
            i += 1
        else:
            result.append(right[j])
            j += 1
            
    result.extend(left[i:])
    result.extend(right[j:])
    
    return result

print(merge_sort([56,32,5,2,43]))

And when I run the same prompt through each project:
llama.cpp

llama_print_timings:        load time =   298.62 ms
llama_print_timings:      sample time =    90.26 ms /   128 runs   (    0.71 ms per token,  1418.17 tokens per second)
llama_print_timings: prompt eval time =  1676.02 ms /   250 tokens (    6.70 ms per token,   149.16 tokens per second)
llama_print_timings:        eval time =  3909.66 ms /   127 runs   (   30.78 ms per token,    32.48 tokens per second)
llama_print_timings:       total time =  5686.13 ms

llm

feed_prompt_duration: 6956ms
prompt_tokens: 251
predict_duration: 7818ms
predict_tokens: 269
per_token_duration: 29.063ms

llm did attach another token to the prompt, but the main issue is the prompt processing time difference. Both were compiled and ran on an M1 Pro with metal enabled and all layers offloaded to GPU

For reference, here's the perf for llm without metal enabled:

feed_prompt_duration: 8508ms
prompt_tokens: 251
predict_duration: 8565ms
predict_tokens: 252
per_token_duration: 33.988ms

Also, this link was put in the code referencing the llama.cpp fallback to accelerate for matrix x matrix compute: https://github.com/ggerganov/llama.cpp/blob/e1886cf4fe0d0f31661dda52a4a9f34bd9b9009a/llama.cpp#L1692

However that branch is not in the newest llama.cpp, so I'm guessing it was worked out.

@philpax
For reference, this commit (ggerganov/llama.cpp@bf83bff) fixed the matrix matrix mul, which is about 2 weeks newer than the currently used llama.cpp, so we'd need to upgrade to a llama cpp version past it. I'm sure you already are doing an upgrade with the gguf stuff which should be to a sufficiently new version as to support the mul.

Thanks for the investigation! Yes, indeed, that's quite promising. Hoping the upgrade + changing the feeding logic will fix this 🙏