Add contrastive search sampler

Question

Add contrastive search sampler

152334H opened this issue 2 years ago · comments

Contrastive Search is a new decoding method that outperforms existing samplers on open-ended text generation. According to the paper, it achieves better results on code generation as well:

Anecdotally, I've tested this myself and like the results.

Implementing this does not appear to be a trivial task. afaik triton has not implemented constrastive search upstream yet, and switching to using Huggingface transformers would require significant effort.

Brendan Dolan-Gavitt · Answer 1 · Sun Nov 13 2022 03:27:38 GMT+0800 (China Standard Time)

This is a great idea! Also, using HF Transformers is in the works right now – I just need to finish testing and merging PR #86 :)

Brendan Dolan-Gavitt · Answer 2 · Sat Nov 26 2022 01:20:20 GMT+0800 (China Standard Time)

The PR is merged! I think all that would be needed to use contrastive search is to change these lines to use something like penalty_alpha=0.6, top_k=4 ?

https://github.com/moyix/fauxpilot/blob/main/python_backend/model.py#L79-L84

It would be nice to let people choose the sampling strategy at runtime as well though (the way they can with top-p/top-k now). Maybe we can add an extra optional parameter to the completion request...

152334H · Answer 3 · Sat Nov 26 2022 14:04:43 GMT+0800 (China Standard Time)

I'm getting a RuntimeError: "baddbmm_with_gemm" not implemented for 'Half' in my attempts to use it:

This is what I'm currently testing with:

diff --git a/python_backend/model.py b/python_backend/model.py
index 8960243..32b54da 100644
--- a/python_backend/model.py
+++ b/python_backend/model.py
@@ -73,14 +73,16 @@ class TritonPythonModel:
             top_k = pb_utils.get_input_tensor_by_name(request, "runtime_top_k").as_numpy().tolist()[0]
             top_p = pb_utils.get_input_tensor_by_name(request, "runtime_top_p").as_numpy().tolist()[0]
             temperature = pb_utils.get_input_tensor_by_name(request, "temperature").as_numpy().tolist()[0]
+            #penalty_alpha = pb_utils.get_input_tensor_by_name(request, "penalty_alpha").as_numpy().tolist()[0]
+            penalty_alpha, top_k, top_p, temperature, do_sample = 0.4, 3, None, None, False
             # n_samples = pb_utils.get_input_tensor_by_name(request, "n")
             n_samples = 1  # TODO: client doesn't send this yet. instead it duplicates the request n times

             # Generate
             output_ids = self.model.generate(
                 input_ids=input_ids_torch, attention_mask=attention_mask,
-                max_new_tokens=max_new_tokens, do_sample=True, top_k=top_k, top_p=top_p, num_return_sequences=n_samples,
-                temperature=temperature,
+                max_new_tokens=max_new_tokens, do_sample=do_sample, top_k=top_k, top_p=top_p, num_return_sequences=n_samples,
+                temperature=temperature, penalty_alpha=penalty_alpha
             )

I'll uh, look into it more later.

Brendan Dolan-Gavitt · Answer 4 · Wed Dec 21 2022 23:28:50 GMT+0800 (China Standard Time)

Are you able to run the 350M model with contrastive search on your host system (i.e. not inside FauxPilot+Triton in a container)?

Brendan Dolan-Gavitt · Answer 5 · Fri Dec 23 2022 10:15:56 GMT+0800 (China Standard Time)

BTW, I'm not able to reproduce their results on HumanEval—Nucleus sampling does much better than they say; I'm not sure why? Here's the pass@1 I got, using the same k and alpha as the paper, and nucleus sampling with temperature = 0.1:

Model	Nucleus	Contrastive
codegen-350M	13.4%	15.8%
codegen-2B	24.8%	22.0%
codegen-6B	27.6%	26.8%
codegen-16B	32.7%	29.9%
code-davinci-002	47.1%	N/A

152334H · Answer 6 · Fri Dec 23 2022 11:22:12 GMT+0800 (China Standard Time)

I'm sorry, I guess it is actually worse and not worth doing.

Brendan Dolan-Gavitt · Answer 7 · Fri Dec 23 2022 12:56:16 GMT+0800 (China Standard Time)

It might still be useful! I'll close for now though, but we can re-open if new results come out :)