philschmid / easyllm

Home Page:https://philschmid.github.io/easyllm/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

(Chat)Completion objects cannot generate diverse outputs

KoutchemeCharles opened this issue · comments

Hello,

I have noticed that the interface returns the same generations independently of the number of responses requested (n > 1). Easy reproduction:

from easyllm.clients import huggingface

# helper to build llama2 prompt
huggingface.prompt_builder = "llama2"

response = huggingface.ChatCompletion.create(
    model="meta-llama/Llama-2-70b-chat-hf",
    messages=[
        {"role": "system", "content": "\nYou are a helpful assistant speaking like a pirate. argh!"},
        {"role": "user", "content": "What is the sun?"},
    ],
    temperature=0.9,
    top_p=0.6,
    max_tokens=256,
    n=10
)

print(response)

You will notice that the content of 'choices' will be exactly the same.

Looking at the code base, it seems that the issue comes from the fact that you are performing 'n' independent HTTP requests with the same generation parameters (but with a fixed seed).

# Normally this would not have been an issue since most of the time we are sampling
# from the model, however, gen_kwargs 
# have the same seed so the output will be the same for each
for _i in range(request.n):
                res = client.text_generation(
                    prompt,
                    details=True,
                    **gen_kwargs,
                )

I believe a solution would be to either change gen_kwargs to directly return n outputs by setting num_return_sequences to n, or by artificially generating different seeds for each request.

This most likely due to the fact the inference API caching the requests.

Yes, it seems like it. Do you have any proper workaround? I forked the repo and tried disabling caching by specifying headers but it does not work. I guess the API is using server-side caching.

I think a workaround (from the user side) would be to slightly alter the generation parameters to avoid the caching mechanism. For instance, one could manually perform 'n' different calls to (Chat)Completions specifying different values of max_new_tokens (or slightly altering the temperature) for each generation.

easyllm is using the huggingface_hub library. I talked to @Wauplin. At the moment it is not possible to deactivate the cache when using the InferenceClient.
A workaround would be if you deploy the model as Inference Endpoint

Another workaround could be that we add a seed argument when sending the multiple requests this should lead to none cached outputs. @KoutchemeCharles could you try this ?

You would need to extend here the gen_kwargs with seed

Yes, that worked for me! It's also quite straightforward to bring the required changes:

for _i in range(n):
    gen_kwargs["seed"] = _i 
    res = client.text_generation(
        prompt,
        details=True,
        **gen_kwargs,
    )
# rest of (Chat)Completion.create

This will yield a situation where the user will properly receive n different outputs if calling the create function, but these n outputs will be the same if (Chat)Completion.create is called again (within a short period of time) with the same arguments. I don't know if this is what you want. If we want to avoid the latter situation we can do something like this:

# import random on top
UPPER_RANGE = 100000
n_seeds = random.sample(range(0, UPPER_RANGE + n), n)
for _i, seed in enumerate(n_seeds):
    gen_kwargs["seed"] = seed
    res = client.text_generation(
        prompt,
        details=True,
        **gen_kwargs,
    )
# rest of (Chat)Completion.create

Can open a PR with that change?