nlpxucan / WizardLM

LLMs build upon Evol Insturct: WizardLM, WizardCoder, WizardMath

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Exactly same output generations for the same prompt

anmolagarwal999 opened this issue · comments

I was trying inference on the HumanEval dataset using WizardCoder/src/humaneval_gen.py with the following parameters (state of "generation_config"):

{'max_length': 2048,
 'max_new_tokens': 1000,
 'min_length': 0,
 'min_new_tokens': None,
 'early_stopping': False,
 'max_time': None,
 'do_sample': True,
 'num_beams': 5,
 'num_beam_groups': 1,
 'penalty_alpha': None,
 'use_cache': True,
 'temperature': 10.1,
 'top_k': 50,
 'top_p': 0.95,
 'typical_p': 1.0,
 'epsilon_cutoff': 0.0,
 'eta_cutoff': 0.0,
 'diversity_penalty': 0.0,
 'repetition_penalty': 1.0,
 'encoder_repetition_penalty': 1.0,
 'length_penalty': 1.0,
 'no_repeat_ngram_size': 0,
 'bad_words_ids': None,
 'force_words_ids': None,
 'renormalize_logits': False,
 'constraints': None,
 'forced_bos_token_id': None,
 'forced_eos_token_id': None,
 'remove_invalid_values': False,
 'exponential_decay_length_penalty': None,
 'suppress_tokens': None,
 'begin_suppress_tokens': None,
 'forced_decoder_ids': None,
 'sequence_bias': None,
 'guidance_scale': None,
 'num_return_sequences': 20,
 'output_attentions': False,
 'output_hidden_states': False,
 'output_scores': False,
 'return_dict_in_generate': False,
 'pad_token_id': 49152,
 'bos_token_id': None,
 'eos_token_id': 0,
 'encoder_no_repeat_ngram_size': 0,
 'decoder_start_token_id': None,
 'generation_kwargs': {},
 '_from_model_config': False,
 '_commit_hash': None,
 'transformers_version': '4.31.0'}

All 20 generations for the prompt seem to be exactly the same. I have tried setting a very high temperature (around 10), a high top_p rate (0.99) and the observation still persists. Am I doing something wrong or are the model outputs highly deterministic ?

Have you tried more samples?

Have you tried more samples?

I did try this for different prompts (including non-coding based instructions).

I think you have done some wrong, but I cannot figure it out from your config.
We check the generated results on humaneval with n=20. They are not the same.