OpenGenerativeAI / llm-colosseum

Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM

Home Page:https://huggingface.co/spaces/junior-labs/llm-colosseum

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

larger model, worse peformance?

WSPeng opened this issue Β· comments

hi, the leader board shows arger model, worse peformance, is it because of the inference time? smaller model have high action frequency. if so, the bench is not very useful.

i think maybe change the game so it can pause, then we can compare models without bias on inference latency.

The goal here is to evaluate an LLM in realtime. We give them the ability to make 3-5 moves ahead of time. Large LLMs can generate more move but yes they take longer.

The goal is to have that inference latency but we could add an option to remove this with a parameter for some games.

Please feel free to open a PR to put this into place but optionnaly and not by default ;)

in my experience, yes. small model has high token/second, always generate actions. while big model waits for tokens to know how to re-act. @_@

The record show small model can generate more actions with high token/second

0.5b wins 3 rounds!

Player 1 using: ollama:qwen:14b-chat-v1.5-fp16
Player 2 using: ollama:qwen:0.5b-chat-v1.5-fp16

Round 1

🏟️ (0647) (0)Starting game
🏟️ (0647) (0)Waiting for fight to start
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
2024-03-30 12:20:26.448 | WARNING | agent.robot:get_moves_from_llm:317 - Many invalid moves: ['Evaluate Opponent', 'Assess Distance for Effective Attacks']
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: super attack 3
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: low kick
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: super attack 3
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: fireball
Player 2 move: super attack 2
Player 2 move: super attack 3
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: jump closer
Player 1 move: megapunch
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: low kick
Player 2 move: medium kick
Player 2 move: high kick
Player 2 move: medium kick
Player 2 move: high kick
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: move closer
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: fireball
Player 2 move: high kick
Player 2 move: low kick
Player 2 move: super attack 2
Player 2 move: super attack 3
Player 2 move: super attack 4
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: fireball
Player 1 move: move closer
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: fireball
Player 1 move: move closer
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: low kick
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: super attack 2
Player 1 move: move closer
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: low kick
Player 2 move: medium kick
Player 2 move: high kick
Player 2 move: jump away
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: move closer
Player 1 move: high punch
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: move closer
Player 1 move: high punch
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: jump away
Player 1 move: megapunch
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: high kick
Player 2 move: low punch
Player 2 move: high punch
Player 2 move: low kick
Player 2 move: low punch
Player 2 move: low punch
2024-03-30 12:21:41.329 | WARNING | agent.robot:get_moves_from_llm:317 - Many invalid moves: ['Mid Punch', 'Mid Punch']
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: move closer
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
🏟️ (0647) (0)Round won by P2
(0)Moving to next round
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: megapunch
Player 1 move: hurricane
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: low punch
Player 2 move: medium punch
Player 2 move: high punch
Player 2 move: low kick
Player 2 move: medium kick
Player 2 move: high kick
Player2 ollama:qwen:0.5b-chat-v1.5-fp16 Daddy won!

β€”β€”β€”β€”β€”β€”β€”β€”β€”

round 2

🏟️ (2b8a) (0)Starting game
🏟️ (2b8a) (0)Waiting for fight to start
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: super attack 3
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: high kick
Player 2 move: low kick
Player 2 move: low punch
Player 2 move: medium punch
Player 2 move: high punch
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: super attack 3
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: high punch
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: high punch
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: high punch
Player 2 move: low kick
Player 2 move: medium punch
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: move closer
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: jump closer
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: fireball
Player 2 move: jump closer
Player 2 move: jump away
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: move closer
Player 1 move: jump closer
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: jump away
Player 1 move: super attack 2
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: jump closer
Player 1 move: high punch
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: megafireball
Player 1 move: move closer
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: jump closer
Player 1 move: megapunch
Player 1 move: low punch
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: megafireball
Player 1 move: super attack 2
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: high kick
Player 2 move: super attack 2
Player 2 move: super attack 3
Player 2 move: super attack 4
Player 2 move: low punch
Player 2 move: medium punch
Player 2 move: high punch
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: high punch
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: high punch
Player 1 move: jump closer
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: low punch
Player 2 move: medium punch
Player 2 move: high punch
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: high punch
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: high punch
Player 2 move: high punch
Player 2 move: high punch
Player 2 move: megapunch
Player 2 move: low punch
Player 2 move: low punch
Player 2 move: low kick
🏟️ (2b8a) (0)Round won by P2
(0)Moving to next round
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: high kick
Player 1 move: megapunch
Player2 ollama:qwen:0.5b-chat-v1.5-fp16 Daddy won!

β€”β€”β€”β€”β€”β€”β€”

Round 3

🏟️ (b34c) (0)Starting game
🏟️ (b34c) (0)Waiting for fight to start
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: super attack 3
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: super attack 3
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: move closer
Player 1 move: high punch
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: move away
Player 2 move: medium punch
Player 2 move: super attack 2
Player 2 move: high punch
Player 2 move: low kick
Player 2 move: medium kick
Player 2 move: high kick
Player 2 move: jump closer
Player 2 move: jump away
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
2024-03-30 12:28:29.109 | WARNING | agent.robot:get_moves_from_llm:317 - Many invalid moves: ['Move Closer to get into better attacking range', 'Megafireball or Super attack 2 as a powerful offensive option while closing in']
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: fireball
Player 2 move: megapunch
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: megafireball
Player 1 move: high punch
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: megafireball
Player 2 move: super attack 2
Player 2 move: super attack 3
Player 2 move: super attack 4
Player 2 move: low punch
Player 2 move: medium punch
Player 2 move: high punch
Player 2 move: jump closer
Player 2 move: jump away
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: move closer
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: move away
Player 2 move: high punch
Player 2 move: low kick
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: megafireball
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: fireball
Player 2 move: high kick
Player 2 move: fireball
Player 2 move: high kick
Player 2 move: fireball
Player 2 move: high kick
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: jump closer
Player 1 move: megafireball
Player 1 move: medium punch
Player 1 move: fireball
2024-03-30 12:28:58.413 | WARNING | agent.robot:get_moves_from_llm:317 - Many invalid moves: ['Assess the distance to the opponent', 'If close', 'If far', 'Move Clo']
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: high kick
Player 2 move: low kick
Player 2 move: low kick
Player 2 move: medium kick
Player 2 move: high kick
Player 2 move: low kick
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: megafireball
Player 1 move: move closer
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: low kick
Player 2 move: medium kick
Player 2 move: high kick
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: super attack 2
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: jump closer
Player 1 move: megafireball
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: fireball
Player 2 move: megapunch
Player 2 move: hurricane
Player 2 move: megafireball
Player 2 move: super attack 2
Player 2 move: super attack 3
Player 2 move: super attack 4
Player 2 move: low punch
Player 2 move: medium punch
🏟️ (b34c) (0)Round won by P2
(0)Moving to next round
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 1 move: move closer
INFO:httpx:HTTP Request: POST http://localhost:11434/v1/chat/completions "HTTP/1.1 200 OK"
Player 2 move: move away
Player 2 move: super attack 3
Player 2 move: low kick
Player 2 move: high kick
Player 2 move: jump closer
Player 2 move: jump away
Player2 ollama:qwen:0.5b-chat-v1.5-fp16 Daddy won!

WechatIMG83

win rate 44% after 50 rounds

@oulianov

Very interesting results!