Exactly same output generations for the same prompt

Question

Exactly same output generations for the same prompt

anmolagarwal999 opened this issue a year ago · comments

I was trying inference on the HumanEval dataset using WizardCoder/src/humaneval_gen.py with the following parameters (state of "generation_config"):

{'max_length': 2048,
 'max_new_tokens': 1000,
 'min_length': 0,
 'min_new_tokens': None,
 'early_stopping': False,
 'max_time': None,
 'do_sample': True,
 'num_beams': 5,
 'num_beam_groups': 1,
 'penalty_alpha': None,
 'use_cache': True,
 'temperature': 10.1,
 'top_k': 50,
 'top_p': 0.95,
 'typical_p': 1.0,
 'epsilon_cutoff': 0.0,
 'eta_cutoff': 0.0,
 'diversity_penalty': 0.0,
 'repetition_penalty': 1.0,
 'encoder_repetition_penalty': 1.0,
 'length_penalty': 1.0,
 'no_repeat_ngram_size': 0,
 'bad_words_ids': None,
 'force_words_ids': None,
 'renormalize_logits': False,
 'constraints': None,
 'forced_bos_token_id': None,
 'forced_eos_token_id': None,
 'remove_invalid_values': False,
 'exponential_decay_length_penalty': None,
 'suppress_tokens': None,
 'begin_suppress_tokens': None,
 'forced_decoder_ids': None,
 'sequence_bias': None,
 'guidance_scale': None,
 'num_return_sequences': 20,
 'output_attentions': False,
 'output_hidden_states': False,
 'output_scores': False,
 'return_dict_in_generate': False,
 'pad_token_id': 49152,
 'bos_token_id': None,
 'eos_token_id': 0,
 'encoder_no_repeat_ngram_size': 0,
 'decoder_start_token_id': None,
 'generation_kwargs': {},
 '_from_model_config': False,
 '_commit_hash': None,
 'transformers_version': '4.31.0'}

All 20 generations for the prompt seem to be exactly the same. I have tried setting a very high temperature (around 10), a high top_p rate (0.99) and the observation still persists. Am I doing something wrong or are the model outputs highly deterministic ?

ChiYeung Law · Answer 1 · Sat Jul 29 2023 10:42:20 GMT+0800 (China Standard Time)

Have you tried more samples?

Anmol Agarwal · Answer 2 · Sat Jul 29 2023 15:57:15 GMT+0800 (China Standard Time)

Have you tried more samples?

I did try this for different prompts (including non-coding based instructions).

ChiYeung Law · Answer 3 · Sat Jul 29 2023 19:03:09 GMT+0800 (China Standard Time)

I think you have done some wrong, but I cannot figure it out from your config.
We check the generated results on humaneval with n=20. They are not the same.