fauxpilot / fauxpilot

FauxPilot - an open-source alternative to GitHub Copilot server

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add contrastive search sampler

152334H opened this issue · comments

Contrastive Search is a new decoding method that outperforms existing samplers on open-ended text generation. According to the paper, it achieves better results on code generation as well:

IMG_20221111_082808

Anecdotally, I've tested this myself and like the results.


Implementing this does not appear to be a trivial task. afaik triton has not implemented constrastive search upstream yet, and switching to using Huggingface transformers would require significant effort.

This is a great idea! Also, using HF Transformers is in the works right now – I just need to finish testing and merging PR #86 :)

The PR is merged! I think all that would be needed to use contrastive search is to change these lines to use something like penalty_alpha=0.6, top_k=4 ?

https://github.com/moyix/fauxpilot/blob/main/python_backend/model.py#L79-L84

It would be nice to let people choose the sampling strategy at runtime as well though (the way they can with top-p/top-k now). Maybe we can add an extra optional parameter to the completion request...

I'm getting a RuntimeError: "baddbmm_with_gemm" not implemented for 'Half' in my attempts to use it:
image

This is what I'm currently testing with:

diff --git a/python_backend/model.py b/python_backend/model.py
index 8960243..32b54da 100644
--- a/python_backend/model.py
+++ b/python_backend/model.py
@@ -73,14 +73,16 @@ class TritonPythonModel:
             top_k = pb_utils.get_input_tensor_by_name(request, "runtime_top_k").as_numpy().tolist()[0]
             top_p = pb_utils.get_input_tensor_by_name(request, "runtime_top_p").as_numpy().tolist()[0]
             temperature = pb_utils.get_input_tensor_by_name(request, "temperature").as_numpy().tolist()[0]
+            #penalty_alpha = pb_utils.get_input_tensor_by_name(request, "penalty_alpha").as_numpy().tolist()[0]
+            penalty_alpha, top_k, top_p, temperature, do_sample = 0.4, 3, None, None, False
             # n_samples = pb_utils.get_input_tensor_by_name(request, "n")
             n_samples = 1  # TODO: client doesn't send this yet. instead it duplicates the request n times

             # Generate
             output_ids = self.model.generate(
                 input_ids=input_ids_torch, attention_mask=attention_mask,
-                max_new_tokens=max_new_tokens, do_sample=True, top_k=top_k, top_p=top_p, num_return_sequences=n_samples,
-                temperature=temperature,
+                max_new_tokens=max_new_tokens, do_sample=do_sample, top_k=top_k, top_p=top_p, num_return_sequences=n_samples,
+                temperature=temperature, penalty_alpha=penalty_alpha
             )

I'll uh, look into it more later.

Are you able to run the 350M model with contrastive search on your host system (i.e. not inside FauxPilot+Triton in a container)?

BTW, I'm not able to reproduce their results on HumanEval—Nucleus sampling does much better than they say; I'm not sure why? Here's the pass@1 I got, using the same k and alpha as the paper, and nucleus sampling with temperature = 0.1:

Model Nucleus Contrastive
codegen-350M 13.4% 15.8%
codegen-2B 24.8% 22.0%
codegen-6B 27.6% 26.8%
codegen-16B 32.7% 29.9%
code-davinci-002 47.1% N/A

I'm sorry, I guess it is actually worse and not worth doing.

It might still be useful! I'll close for now though, but we can re-open if new results come out :)