When to stop in the LLMEval?

Question

When to stop in the LLMEval?

MatthewWaller opened this issue 6 months ago · comments

In the LLMEval project, the generation stops after reaching a limit on tokens. Is there a way to configure stopping when it finds a special token? I tried to look for the Phi 3's end token but it seems to go off the rails earlier than when <|end|> or <|endoftext|> appear. Thoughts?

Awni Hannun · Answer 1 · Thu Apr 25 2024 10:45:45 GMT+0800 (China Standard Time)

It should stop at the end of sentence id: https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/LLM/Evaluate.swift#L199

The fact that it's not stopping likely means it doesn't have the right EOS token ID set. Which model did you try?

Matthew Waller · Answer 2 · Thu Apr 25 2024 10:57:29 GMT+0800 (China Standard Time)

@awni was working with phi34bit

Awni Hannun · Answer 3 · Thu Apr 25 2024 11:10:24 GMT+0800 (China Standard Time)

Looks like this is the eos token for that model: https://huggingface.co/mlx-community/Phi-3-mini-4k-instruct-4bit-no-q-embed/blob/main/tokenizer_config.json#L340. We'll need to check to make sure the IDs match / the tokenizer is reading it correctly.

David Koski · Answer 4 · Fri Apr 26 2024 00:05:04 GMT+0800 (China Standard Time)

Specifically the code is looking for either the unknown token or the eos token:

        if t == tokenizer.unknownTokenId || t == tokenizer.eosTokenId {

https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/LLM/Evaluate.swift#L199

The didGenerate block that is passed in can also return .stop if you are implementing this yourself.

Matthew Waller · Answer 5 · Fri Apr 26 2024 05:48:15 GMT+0800 (China Standard Time)

Alright, well unknownTokenId is 0 and eosTokenId is 32000, which I believe is correct, and it matches "eos_token": "<|endoftext|>", from HuggingFace. I can see in the debugger that the eosToken is <|endoftext|>. The model just never seems to produce that token. Hmmm. For instance, I can tell phi3 to "Write 3 words" and on HuggingFace chat, it appropriately stops. So I'm guessing it's producing that token for them. It just never shows up in the output I'm getting.

David Koski · Answer 6 · Fri Apr 26 2024 05:57:51 GMT+0800 (China Standard Time)

It may be related to this: huggingface/swift-transformers#92 -- we are not passing in a proper prompt and the generation may be impacted.

That issue is a bit terse but basically the extra tokens are not being honored when tokenizing.

Matthew Waller · Answer 7 · Fri Apr 26 2024 06:02:29 GMT+0800 (China Standard Time)

@t4k · Answer 8 · Tue Apr 30 2024 23:46:02 GMT+0800 (China Standard Time)

Saw that: huggingface/swift-transformers#92 -- has been closed and special tokens should now be accounted for. I'm still running into issues with the model itself returning the '<|end|>' token when the assistant is done, wondering if anyone has found a more manual solution to getting the correct model (phi-3) response?

Matthew Waller · Answer 9 · Tue Apr 30 2024 23:49:42 GMT+0800 (China Standard Time)

I made a little project where I directly looked for that token (32001) and returned .stop if I found it, in the LLMEvaluator. Once I did that, and got the correct tokens in preparePrompt, everything worked correctly.

@t4k · Answer 10 · Wed May 01 2024 00:06:57 GMT+0800 (China Standard Time)

Gotcha, so something similar to:

let result = await MLXLLM.generate(
    promptTokens: promptTokens, parameters: generateParameters, model: model,
    tokenizer: tokenizer
) { tokens in
    // update the output -- this will make the view show the text as it generates
    let endGen = tokens.contains(32001)
    if tokens.count % displayEveryNTokens == 0 {
        let text = tokenizer.decode(tokens: tokens)
        await MainActor.run {
            self.output = text
        }
    }

    if tokens.count >= maxTokens || endGen {
        return .stop
    } else {
        return .more
    }
}

Matthew Waller · Answer 11 · Wed May 01 2024 00:12:08 GMT+0800 (China Standard Time)

Exactly, and heads up that there is a little bug you may run into at the end, below that bit. I had to change it to

// update the text if needed, e.g. we haven't displayed because of displayEveryNTokens
            var validTokens = Array(result.tokens.prefix(while: { $0 != 32001 }))
            validTokens.removeLast()
            let text = tokenizer.decode(tokens: validTokens)
            await MainActor.run {
                if result.output != self.output {
                    self.output = text
                }
                running = false
                self.stat = " Tokens/second: \(String(format: "%.3f", result.tokensPerSecond))"
            }

Because you can still get the <|end|> token and more in there when it does that final bit of output.

Matthew Waller · Answer 12 · Wed May 01 2024 00:16:14 GMT+0800 (China Standard Time)

Closing now that the main issue has been resolved with transformers.