Towards Standardized and Verifiable Evaluation of Multimodal Game Agents
1National University of Singapore 2University of Oxford
Building truly capable AI agents that can interact with the real world requires mastering visual perception, strategic planning, precise timing, and sustained action over long horizons. Video games offer an ideal testbed for these capabilities, but systematically evaluating them has been held back by inconsistent action interfaces and unreliable verification methods. GameWorld is a new benchmark designed to solve these problems. It provides a standardized, browser-based environment for evaluating multimodal large language models (MLLMs) as game agents. Two types of agent interfaces are studied: Computer-Use Agents that directly control keyboard and mouse, and Generalist Agents that operate through semantic action functions. Across 34 diverse games and 170 tasks, results from 18 model-interface pairs reveal that even the best-performing agents are far from matching human gameplay capabilities.
34 browser games spanning 5 genres (Arcade, Platformer, Puzzle, Runner, Simulation) with 170 tasks. Supports both Computer-Use Agents and Generalist Agents under a shared, executable action space via deterministic Semantic Action Parsing.
A universal outcome-based evaluator that programmatically checks agent performance through game state inspection. Unlike prior benchmarks relying on heuristics or LLM-as-judge, GameWorld verifies actual outcomes from the game's internal state.
Extensive experiments covering benchmark robustness through repeated full-benchmark reruns, plus focused studies on real-time interaction, context-memory sensitivity, action validity, and detailed failure mode analysis.
GameWorld studies two distinct approaches to controlling game agents. At each step, the agent observes a screenshot of the current game state, produces an action through the model, and the environment executes it. A verifiable evaluator then checks the resulting state against the task objective.
Directly emits low-level keyboard and mouse controls, acting like a human operating a computer. The model must translate its game understanding into precise physical inputs.
Uses semantic, game-specific function calls (e.g., move_forward(), action_jump()). Actions are deterministically parsed into low-level controls via Semantic Action Parsing.
move_right(), weapon_fire()Evaluating agents in games introduces challenges that existing benchmarks have not fully addressed. Most cover only a few games within narrow genres, and in real-time games, agent performance gets entangled with inference speed. GameWorld addresses these challenges through four design principles:
Standardized Action Interfaces via Semantic Action Parsing that deterministically maps semantic functions to low-level controls
Browser-Based Sandbox with pause-step execution that decouples reasoning latency from gameplay timing
Diverse Game Coverage across 5 genres with 34 games and 170 tasks testing distinct capabilities
State-Verifiable Evaluation using programmatic game state inspection for outcome-based assessment
GameWorld comprises 34 browser-based games spanning five genres: Runner, Arcade, Platformer, Puzzle, and Simulation. Each genre tests distinct agent capabilities — from fast reflexes and spatial navigation to long-term planning and resource management. Each task pairs a natural-language instruction with a quantitative target and a verifiable evaluator.
The central design goal is to decouple agent decision quality from inference speed. In real-time games, a slower model faces a harder game state by the time it acts, conflating thinking time with gameplay ability. GameWorld's sandbox pauses game execution between agent steps, ensuring each model is evaluated on the same game state regardless of its inference latency.
Each game runs in an isolated browser instance (Playwright) following a strict observation-action loop: capture screenshot, query the model, execute one action. A readiness gate ensures the game is fully loaded and in a stable state before evaluation begins.
Unlike benchmarks that rely on heuristic scoring or LLM-as-judge, GameWorld evaluates agents by inspecting the game's actual internal state. Each game exposes a structured JSON state containing score, level progress, player position, and task-specific metrics. Task success is verified programmatically against this state — no subjective judgment required.
GameWorld evaluates 13 base models across both CUA and Generalist interfaces, yielding 18 model-agent-interface pairs. Models include proprietary systems (Claude-Sonnet-4.6, Gemini-3-Flash-Preview, GPT-5.2, Grok-4.1) and open-source models (Qwen3-VL-235B, Qwen3-VL-30B, UI-TARS-1.5). All models are evaluated under the same paused protocol so scores reflect decision quality, not response speed.
Best overall progress. Strong performance across Arcade and Runner genres
Close second. Particularly strong in Puzzle games with strategic reasoning
Solid third place. Consistent performance but gaps in real-time games
To verify GameWorld functions as a reproducible measurement platform rather than a one-off snapshot, repeated full-benchmark evaluations were performed on Qwen3-VL-30B and Qwen3-VL-235B, each in both CUA and Generalist modes.
The standard deviation of overall Progress remains in a low single-digit band across all four settings, and Success Rate variation is likewise limited. This confirms GameWorld provides stable, reproducible measurements suitable for reliable model comparison.
Genre-level averages alone cannot reveal whether failures stem from weak control grounding, slow reactive behavior, poor navigation, or limited reasoning. GameWorld organizes its 170 tasks into a 5-level capability curriculum to enable more targeted diagnosis:
Basic Control & Timing Grounding — Simple input-output mapping. Can the agent press the right button at the right time?
System-1 Reactive Control — Fast, reflexive responses to immediate stimuli without deliberation
System-2 Navigation — Deliberate spatial reasoning and pathfinding through complex environments
Reasoning & Strategy — Multi-step planning, resource management, and strategic decision-making
Long-Horizon & Coordination — Sustained complex planning over many steps with goal tracking
When switching from the default paused evaluation to real-time continuous execution, nearly all models see dramatic performance drops. The smaller 30B model is substantially faster but the 235B model achieves slightly higher progress. Success rates remain very low throughout, revealing that faster inference alone does not solve the real-time challenge. This exposes a fundamental limitation: current MLLM inference latency is incompatible with time-critical game interactions.
Increasing context memory (keeping recent action history and visual screenshots) substantially raises both prompt length and latency, but the impact on performance varies by interface. Generalist agents benefit more from memory because semantic action trajectories preserve useful task context. CUA agents see mixed results since their raw pixel-level action logs are less informative when replayed as text.
Agents cannot act freely — they must obey action-space rules at every step. Invalid actions fall into two categories: No-Tool-Call (model fails to emit any executable action, often due to truncation) and Malformed-Call (action format is incorrect or uses non-existent functions). Lower Invalid Action Rate (IAR) serves as a direct proxy for instruction-following ability.
Analysis of task failures across models reveals four major failure categories:
This case study compares matched trajectories of Mario under CUA and Generalist interfaces using the same model family. CUA generates low-level keyboard and mouse actions while the Generalist uses semantic action functions. The comparison reveals how the choice of interface fundamentally shapes action selection strategies, even when pursuing the same score-driven objective.
In this open-ended Minecraft resource collection task, the agent repeatedly mines toward a target number. The run reaches 90% progress yet fails to complete — the failure is not instruction-following but missing closure: the agent cannot bridge the final gap between near-completion and actual success, a common challenge in long-horizon tasks.
Consecutive frames look nearly identical, but the correct action alternates between waiting and flapping. A slightly early or late flap determines whether any progress can be earned from visually similar states. This highlights the critical real-time control difficulty — even perfect perception is not enough without precise timing.
Current multimodal game agents can often make partial progress, yet still struggle to convert that progress into reliable task completion across diverse browser games. The gap between reaching ~38% average progress and achieving consistent success reveals fundamental limitations in perception, timing, and long-horizon planning.
GameWorld provides a standardized and verifiable benchmark for evaluating these capabilities. Across 34 games, 170 tasks, and 18 model-interface pairs, the results establish that even the best-performing agents are far from human-level gameplay — while the benchmark itself is robust and reproducible enough to reliably measure future progress.
Looking ahead, expanding game diversity, adding multiplayer scenarios, and developing more sophisticated agent architectures that can handle real-time constraints and long-horizon planning will be key directions for advancing multimodal game agents.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.