Add contrastive search sampler
152334H opened this issue · comments
Contrastive Search is a new decoding method that outperforms existing samplers on open-ended text generation. According to the paper, it achieves better results on code generation as well:
Anecdotally, I've tested this myself and like the results.
Implementing this does not appear to be a trivial task. afaik triton has not implemented constrastive search upstream yet, and switching to using Huggingface transformers would require significant effort.
This is a great idea! Also, using HF Transformers is in the works right now – I just need to finish testing and merging PR #86 :)
The PR is merged! I think all that would be needed to use contrastive search is to change these lines to use something like penalty_alpha=0.6, top_k=4
?
https://github.com/moyix/fauxpilot/blob/main/python_backend/model.py#L79-L84
It would be nice to let people choose the sampling strategy at runtime as well though (the way they can with top-p/top-k now). Maybe we can add an extra optional parameter to the completion request...
I'm getting a RuntimeError: "baddbmm_with_gemm" not implemented for 'Half'
in my attempts to use it:
This is what I'm currently testing with:
diff --git a/python_backend/model.py b/python_backend/model.py
index 8960243..32b54da 100644
--- a/python_backend/model.py
+++ b/python_backend/model.py
@@ -73,14 +73,16 @@ class TritonPythonModel:
top_k = pb_utils.get_input_tensor_by_name(request, "runtime_top_k").as_numpy().tolist()[0]
top_p = pb_utils.get_input_tensor_by_name(request, "runtime_top_p").as_numpy().tolist()[0]
temperature = pb_utils.get_input_tensor_by_name(request, "temperature").as_numpy().tolist()[0]
+ #penalty_alpha = pb_utils.get_input_tensor_by_name(request, "penalty_alpha").as_numpy().tolist()[0]
+ penalty_alpha, top_k, top_p, temperature, do_sample = 0.4, 3, None, None, False
# n_samples = pb_utils.get_input_tensor_by_name(request, "n")
n_samples = 1 # TODO: client doesn't send this yet. instead it duplicates the request n times
# Generate
output_ids = self.model.generate(
input_ids=input_ids_torch, attention_mask=attention_mask,
- max_new_tokens=max_new_tokens, do_sample=True, top_k=top_k, top_p=top_p, num_return_sequences=n_samples,
- temperature=temperature,
+ max_new_tokens=max_new_tokens, do_sample=do_sample, top_k=top_k, top_p=top_p, num_return_sequences=n_samples,
+ temperature=temperature, penalty_alpha=penalty_alpha
)
I'll uh, look into it more later.
Are you able to run the 350M model with contrastive search on your host system (i.e. not inside FauxPilot+Triton in a container)?
BTW, I'm not able to reproduce their results on HumanEval—Nucleus sampling does much better than they say; I'm not sure why? Here's the pass@1 I got, using the same k and alpha as the paper, and nucleus sampling with temperature = 0.1:
Model | Nucleus | Contrastive |
---|---|---|
codegen-350M | 13.4% | 15.8% |
codegen-2B | 24.8% | 22.0% |
codegen-6B | 27.6% | 26.8% |
codegen-16B | 32.7% | 29.9% |
code-davinci-002 | 47.1% | N/A |
I'm sorry, I guess it is actually worse and not worth doing.
It might still be useful! I'll close for now though, but we can re-open if new results come out :)