GameWorld: Standardized Evaluation of Multimodal Game Agents

Abstract

34 Games 170 Tasks 18 Models 5 Genres

Building truly capable AI agents that can interact with the real world requires mastering visual perception, strategic planning, precise timing, and sustained action over long horizons. Video games offer an ideal testbed for these capabilities, but systematically evaluating them has been held back by inconsistent action interfaces and unreliable verification methods. GameWorld is a new benchmark designed to solve these problems. It provides a standardized, browser-based environment for evaluating multimodal large language models (MLLMs) as game agents. Two types of agent interfaces are studied: Computer-Use Agents that directly control keyboard and mouse, and Generalist Agents that operate through semantic action functions. Across 34 diverse games and 170 tasks, results from 18 model-interface pairs reveal that even the best-performing agents are far from matching human gameplay capabilities.

Key Contributions

01

Standardized Benchmark

34 browser games spanning 5 genres (Arcade, Platformer, Puzzle, Runner, Simulation) with 170 tasks. Supports both Computer-Use Agents and Generalist Agents under a shared, executable action space via deterministic Semantic Action Parsing.

02

State-Verifiable Evaluation

A universal outcome-based evaluator that programmatically checks agent performance through game state inspection. Unlike prior benchmarks relying on heuristics or LLM-as-judge, GameWorld verifies actual outcomes from the game's internal state.

03

Comprehensive Analysis

Extensive experiments covering benchmark robustness through repeated full-benchmark reruns, plus focused studies on real-time interaction, context-memory sensitivity, action validity, and detailed failure mode analysis.

Game Agent Interfaces

GameWorld studies two distinct approaches to controlling game agents. At each step, the agent observes a screenshot of the current game state, produces an action through the model, and the environment executes it. A verifiable evaluator then checks the resulting state against the task objective.

GameWorld System Architecture — **Figure 2:** GameWorld system architecture showing four core components: (i) MLLMs as Game Agents with two interface types, (ii) Browser-based Sandbox Environment, (iii) Games & Tasks Library, and (iv) Outcome-based State-Verifiable Evaluation

Computer-Use Agent (CUA)

Directly emits low-level keyboard and mouse controls, acting like a human operating a computer. The model must translate its game understanding into precise physical inputs.

Actions: mouse click, scroll, key press, drag, text entry
Requires pixel-level understanding of UI elements
More general but less precise for game-specific tasks

Generalist Agent (GEN)

Uses semantic, game-specific function calls (e.g., move_forward(), action_jump()). Actions are deterministically parsed into low-level controls via Semantic Action Parsing.

Actions: game-specific semantic functions like move_right(), weapon_fire()
Semantic actions are deterministically mapped to controls
More precise for game tasks, but requires action space definition

Benchmark Design

Evaluating agents in games introduces challenges that existing benchmarks have not fully addressed. Most cover only a few games within narrow genres, and in real-time games, agent performance gets entangled with inference speed. GameWorld addresses these challenges through four design principles:

1

Standardized Action Interfaces via Semantic Action Parsing that deterministically maps semantic functions to low-level controls

2

Browser-Based Sandbox with pause-step execution that decouples reasoning latency from gameplay timing

3

Diverse Game Coverage across 5 genres with 34 games and 170 tasks testing distinct capabilities

4

State-Verifiable Evaluation using programmatic game state inspection for outcome-based assessment

Benchmark Comparison Table — **Table 2:** Comparison with existing game and computer-use agent benchmarks. GameWorld is the only benchmark combining visual input, online environments, standardized actions, state-verifiable evaluation, and browser-based architecture

34 Games, 5 Genres

GameWorld comprises 34 browser-based games spanning five genres: Runner, Arcade, Platformer, Puzzle, and Simulation. Each genre tests distinct agent capabilities — from fast reflexes and spatial navigation to long-term planning and resource management. Each task pairs a natural-language instruction with a quantitative target and a verifiable evaluator.

Arcade

Platformer

Puzzle

Runner

Simulation

All 34 GameWorld Games (Part 1) — **Figure 3:** Screenshots of all 34 games in GameWorld, spanning arcade classics, platformers, puzzles, runners, and simulation games

Browser Sandbox & State-Verifiable Evaluation

Browser-Based Sandbox

The central design goal is to decouple agent decision quality from inference speed. In real-time games, a slower model faces a harder game state by the time it acts, conflating thinking time with gameplay ability. GameWorld's sandbox pauses game execution between agent steps, ensuring each model is evaluated on the same game state regardless of its inference latency.

Each game runs in an isolated browser instance (Playwright) following a strict observation-action loop: capture screenshot, query the model, execute one action. A readiness gate ensures the game is fully loaded and in a stable state before evaluation begins.

State-Verifiable Evaluation

Unlike benchmarks that rely on heuristic scoring or LLM-as-judge, GameWorld evaluates agents by inspecting the game's actual internal state. Each game exposes a structured JSON state containing score, level progress, player position, and task-specific metrics. Task success is verified programmatically against this state — no subjective judgment required.

Game State JSON Structure — **Figure 10:** Example game state JSON for Super Mario, showing structured fields for score, level, player state, and metrics that enable programmatic verification

Experiments & Results

GameWorld evaluates 13 base models across both CUA and Generalist interfaces, yielding 18 model-agent-interface pairs. Models include proprietary systems (Claude-Sonnet-4.6, Gemini-3-Flash-Preview, GPT-5.2, Grok-4.1) and open-source models (Qwen3-VL-235B, Qwen3-VL-30B, UI-TARS-1.5). All models are evaluated under the same paused protocol so scores reflect decision quality, not response speed.

Main Results Heatmap — **Figure 4:** Performance heatmap across 34 games and multiple model-interface pairs. Green indicates higher progress; red indicates lower. Even top-performing models show significant weaknesses across certain game types

Top Performers

1st

Claude-Sonnet-4.6 (GEN)

38.0%

Best overall progress. Strong performance across Arcade and Runner genres

2nd

Gemini-3-Flash-Preview (GEN)

36.2%

Close second. Particularly strong in Puzzle games with strategic reasoning

3rd

GPT-5.2 (GEN)

30.1%

Solid third place. Consistent performance but gaps in real-time games

Main Leaderboard Results — **Table 5:** Full leaderboard showing Success Rate (SR) and Progress (PG) across 5 genres for all 18 model-interface pairs

Benchmark Robustness

To verify GameWorld functions as a reproducible measurement platform rather than a one-off snapshot, repeated full-benchmark evaluations were performed on Qwen3-VL-30B and Qwen3-VL-235B, each in both CUA and Generalist modes.

The standard deviation of overall Progress remains in a low single-digit band across all four settings, and Success Rate variation is likewise limited. This confirms GameWorld provides stable, reproducible measurements suitable for reliable model comparison.

Capability-Aligned Curriculum

Genre-level averages alone cannot reveal whether failures stem from weak control grounding, slow reactive behavior, poor navigation, or limited reasoning. GameWorld organizes its 170 tasks into a 5-level capability curriculum to enable more targeted diagnosis:

L1

Basic Control & Timing Grounding — Simple input-output mapping. Can the agent press the right button at the right time?

L2

System-1 Reactive Control — Fast, reflexive responses to immediate stimuli without deliberation

L3

System-2 Navigation — Deliberate spatial reasoning and pathfinding through complex environments

L4

Reasoning & Strategy — Multi-step planning, resource management, and strategic decision-making

L5

Long-Horizon & Coordination — Sustained complex planning over many steps with goal tracking

Challenges & Analysis

Real-Time Interaction (GameWorld-RT)

When switching from the default paused evaluation to real-time continuous execution, nearly all models see dramatic performance drops. The smaller 30B model is substantially faster but the 235B model achieves slightly higher progress. Success rates remain very low throughout, revealing that faster inference alone does not solve the real-time challenge. This exposes a fundamental limitation: current MLLM inference latency is incompatible with time-critical game interactions.

Context-Memory Sensitivity

Increasing context memory (keeping recent action history and visual screenshots) substantially raises both prompt length and latency, but the impact on performance varies by interface. Generalist agents benefit more from memory because semantic action trajectories preserve useful task context. CUA agents see mixed results since their raw pixel-level action logs are less informative when replayed as text.

Action Validity & Instruction Following

Agents cannot act freely — they must obey action-space rules at every step. Invalid actions fall into two categories: No-Tool-Call (model fails to emit any executable action, often due to truncation) and Malformed-Call (action format is incorrect or uses non-existent functions). Lower Invalid Action Rate (IAR) serves as a direct proxy for instruction-following ability.

Failure Modes

Analysis of task failures across models reveals four major failure categories:

Perception failures: The agent misreads visual state (objects, UI cues, spatial layout), causing incorrect decisions. Often visible in the model's reasoning when it incorrectly describes on-screen elements.
Fine-grained action failures: The high-level intent is correct but execution is mistimed or imprecise — jump timing off, key-combo duration wrong. The agent knows what to do but cannot do it precisely enough.
Instruction-following failures: The agent proposes actions that violate declared controls or task constraints. Under longer interaction trajectories, models increasingly deviate from instructions.
Long-horizon memory failures: The agent loses critical historical context, repeats ineffective loops, or fails to preserve multi-step plans. Especially common in weaker models that repeatedly issue the same ineffective action.

Case Studies

CUA vs. Generalist: Super Mario Bros

This case study compares matched trajectories of Mario under CUA and Generalist interfaces using the same model family. CUA generates low-level keyboard and mouse actions while the Generalist uses semantic action functions. The comparison reveals how the choice of interface fundamentally shapes action selection strategies, even when pursuing the same score-driven objective.

CUA vs Generalist Agent - Mario — **Figure 7:** Step-by-step comparison of CUA and Generalist agent trajectories playing Super Mario Bros, showing how different interfaces lead to different strategies

Long-Horizon Planning: Minecraft

In this open-ended Minecraft resource collection task, the agent repeatedly mines toward a target number. The run reaches 90% progress yet fails to complete — the failure is not instruction-following but missing closure: the agent cannot bridge the final gap between near-completion and actual success, a common challenge in long-horizon tasks.

Minecraft Long-Horizon Simulation — **Figure 8:** Minecraft resource collection showing 6 steps of long-horizon planning, demonstrating the challenge of sustained goal pursuit in 3D environments

Real-Time Reaction: Flappy Bird

Consecutive frames look nearly identical, but the correct action alternates between waiting and flapping. A slightly early or late flap determines whether any progress can be earned from visually similar states. This highlights the critical real-time control difficulty — even perfect perception is not enough without precise timing.

Flappy Bird Real-Time Control — **Figure 9:** Flappy Bird real-time control showing how slight timing differences between tap and wait determine success or failure

Related Work

Conclusion

Current multimodal game agents can often make partial progress, yet still struggle to convert that progress into reliable task completion across diverse browser games. The gap between reaching ~38% average progress and achieving consistent success reveals fundamental limitations in perception, timing, and long-horizon planning.

GameWorld provides a standardized and verifiable benchmark for evaluating these capabilities. Across 34 games, 170 tasks, and 18 model-interface pairs, the results establish that even the best-performing agents are far from human-level gameplay — while the benchmark itself is robust and reproducible enough to reliably measure future progress.

Looking ahead, expanding game diversity, adding multiplayer scenarios, and developing more sophisticated agent architectures that can handle real-time constraints and long-horizon planning will be key directions for advancing multimodal game agents.

References

[1] Bakhtin et al. Human-level play in the game of Diplomacy by combining language models with strategic reasoning. Science, 2022.
[2] Brohan et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. CoRL, 2023.
[3] Brown et al. Language Models are Few-Shot Learners. NeurIPS, 2020.
[4] Chen et al. Fireact: Toward language agent fine-tuning. 2023.
[5] Claude-Sonnet-4.6. Anthropic, 2026.
[6] Driess et al. PaLM-E: An Embodied Multimodal Language Model. ICML, 2023.
[7] Du et al. Guiding pretraining in reinforcement learning with large language models. ICML, 2023.
[8] Bai et al. Qwen3 Technical Report. arXiv, 2025.

GameWorld

Abstract

Key Contributions

Standardized Benchmark

State-Verifiable Evaluation

Comprehensive Analysis

Game Agent Interfaces

Computer-Use Agent (CUA)

Generalist Agent (GEN)

Benchmark Design

34 Games, 5 Genres

Browser Sandbox & State-Verifiable Evaluation

Browser-Based Sandbox

State-Verifiable Evaluation

Experiments & Results

Top Performers

Claude-Sonnet-4.6 (GEN)

Gemini-3-Flash-Preview (GEN)

GPT-5.2 (GEN)

Benchmark Robustness

Capability-Aligned Curriculum

Challenges & Analysis

Real-Time Interaction (GameWorld-RT)

Context-Memory Sensitivity

Action Validity & Instruction Following

Failure Modes

Case Studies

CUA vs. Generalist: Super Mario Bros

Long-Horizon Planning: Minecraft

Real-Time Reaction: Flappy Bird

Related Work

Computer-Use Benchmarks

Video Game Benchmarks

Game Agents & Scalable Infrastructure

Conclusion