llm-eval: not responding to 'what is your name?' or 'what is the difference between star wars and star trek?'

Question

llm-eval: not responding to 'what is your name?' or 'what is the difference between star wars and star trek?'

CharlieTLe opened this issue 4 months ago · comments

On my Mac, I see the error

CLIENT ERROR: TUINSRemoteViewController does not override -viewServiceDidTerminateWithError: and thus cannot react to catastrophic errors beyond logging them

It does respond to compare python and swift fine though

David Koski · Answer 1 · Sat Mar 02 2024 14:38:59 GMT+0800 (China Standard Time)

That actually looks "right":

python -m mlx_lm.generate --model ~/Documents/huggingface/models/mlx-community/phi-2-hf-4bit-mlx --prompt 'Instruct: what is your name?. Output: '
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
==========
Prompt: Instruct: what is your name?. Output: 


==========
Prompt: 32.359 tokens-per-sec
Generation: 0.000 tokens-per-sec

The problem seems to be in the prompt template:

        "Instruct: \(prompt). Output: "

it should be:

        "Instruct: \(prompt)\nOutput: "

that gives a much better response, though in (perhaps) Chinese?

Nothing from 'what is the difference between star wars and star trek?' but the python version doesn't either.

David Koski · Answer 2 · Sat Mar 02 2024 15:32:19 GMT+0800 (China Standard Time)

It looks like phi2 can't answer that prompt -- maybe it doesn't cover that info or maybe it is tool small? mistral7B4bit aka mlx-community/Mistral-7B-v0.1-hf-4bit-mlx seems to do an ok job, though sometimes a bit silly.

Three changes were made and I think this fixes or greatly improves the response here:

the prompt for Phi was adjusted to fit the format better -- it is sensitive to the exact wording
the temperature was set up to 0.6 to match the python code
a new random seed is generated for each time you generate -- you can explore a little

You may need to switch to a larger model like Mistral 7B to see more interesting responses for a wider range of inputs.