OccuBench: Evaluating AI Agents on Real-World Professional Tasks

Key Findings

100

Professional Task Scenarios

Industry Categories

Frontier Models Evaluated

382

Evaluation Instances

Finding 1

No Single Model Dominates

Each model has a distinct capability profile across industries. GPT-5.2 leads overall at 79.6%, but Gemini 3.1 Pro tops Education (84%) and Claude Opus 4.6 excels in Transportation (77%). No single model is the best choice for every domain.

Finding 2

Implicit Faults Are Hardest

Implicit faults (truncated data, missing fields) cause larger performance drops (53.4%) than explicit errors like HTTP 500s (62.6%). Without clear error signals, agents fail to detect degraded data and make decisions on incomplete information.

Finding 3

Scaling Consistently Helps

Larger models, newer generations, and higher reasoning effort all improve performance. GPT-5.2 gains a dramatic 27.5 points when scaling from minimal to maximum reasoning effort.

Finding 4

Strong Agents ≠ Strong Simulators

GPT-5.2 ranks #1 as an agent (79.6%) but produces the worst environment simulation quality. Simulator choice significantly affects evaluation rankings, with pairwise agreement as low as 75%.

Abstract

AI agents are expected to perform professional work across hundreds of occupational domains—from emergency department triage to nuclear reactor safety monitoring to customs import processing—yet existing benchmarks can only evaluate agents in the few domains where public environments exist. OccuBench is a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments through LLM-driven tool response generation. A multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection. An evaluation of 15 frontier models across 8 model families reveals that no single model dominates all industries, implicit faults are harder than explicit errors, scaling consistently improves performance, and strong agents are not necessarily strong environment simulators.

Introduction

AI agents are increasingly expected to perform professional work across diverse occupational domains: triaging emergency patients, auditing financial reports, scheduling factory production lines, responding to network intrusions, processing customs declarations, and coordinating wildfire evacuations. These represent the highest-value applications of AI agent technology, where autonomous decision-making through multi-step tool use can augment or replace costly human expertise. However, a fundamental evaluation gap exists: the professional domains where agents would deliver the most value are precisely the domains where no benchmarks exist.

Can an agent triage patients in an emergency department? No public environment exists.
Can an agent manage a nuclear reactor safety alert? No benchmark covers this.
Can an agent process customs import declarations? No API is available.
Can an agent control greenhouse irrigation based on sensor data? No testbed exists.

The Untestable Majority

The professional domains where AI agents are most needed—healthcare, finance, legal, manufacturing, energy, governance, and logistics—are bound to enterprise systems with no public access. This makes benchmark construction impossible with traditional approaches.

Prohibitive Scaling Cost

Even within covered domains, each benchmark is constrained by its environment implementation. Adding a new domain requires deploying and configuring entire web applications or APIs—costs that scale linearly with domain count.

No Robustness Evaluation

Real-world environments are noisy: APIs time out, data arrives incomplete, services degrade silently. Yet existing benchmarks evaluate agents exclusively on the “happy path,” missing a critical dimension of deployment readiness.

The OccuBench Approach

The key insight is that the environment itself can be simulated by an LLM. A Language Environment Simulator (LES) takes an environment configuration—system prompt, tool schema, initial state, and state description—and generates realistic tool responses based on the LLM's pre-trained knowledge of domain-specific operational logic.

Based on LESs, OccuBench covers 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, with 382 evaluation instances. It evaluates agents on both task completion (multi-step decision-making across industries) and environmental robustness (performance under explicit errors, implicit data degradation, and mixed faults).

Language Environment Simulator

The core innovation of OccuBench is the Language Environment Simulator (LES)—a function that simulates domain-specific environments through LLM-driven tool response generation. The LES is formally defined as:

(s_{t+1}, o_{t+1}) = f_e(s_t, a_t; c)

What does this formula mean? Think of it like a turn-based game: the agent takes an action (a tool call like "check patient vitals"), the environment processes it and returns the next state plus an observation. The function f_e is the Language Environment Simulator—an LLM that plays the role of the environment. Instead of building a real hospital system or factory, you describe the rules in a prompt, and the LLM generates realistic responses. For example, if the agent calls get_patient_vitals(room_2), the LES generates a plausible JSON response with heart rate, blood pressure, etc., following the rules defined in the configuration c.

Here, c is the environment configuration (system prompt, tool schema, initial state, state description), s_t is the latent environment state maintained implicitly through the LLM's context window, a_t is the agent's action (a tool call), and o_t+1 is the observation returned to the agent. Unlike traditional world models that learn from data, LESs leverage pre-trained knowledge of domain-specific operational logic.

LES evaluation loop diagram showing the interaction between Agent LLM, Language Environment Simulator, Verifier, History, and Configuration components — **Figure 1:** LES evaluation loop. The agent issues tool calls, the LES generates observations conditioned on its configuration and conversation history. The trajectory is scored by a rubric-based verifier.

Environment Configuration

System Prompt

Defines the environment's behavioral rules, simulation logic, error handling protocols, and output format constraints. For example, a hotel revenue management environment specifies pricing rules, occupancy calculations, and metrics relationships.

Tool Schema

Defines the agent's action space as a set of callable functions with typed parameters and example outputs. Each environment contains 2–10 tools (median 5) reflecting realistic operational interfaces.

Initial State

A structured JSON object specifying the environment's starting conditions—for example, room inventory, patient queue, or network topology.

State Description

Semantic annotations for each state field, guiding the LLM to maintain causal consistency (e.g., “remaining inventory decreases after each booking”).

Why LLMs Work as Simulators

LLMs are effective environment simulators because: (1) Format priors—pre-training on API documentation provides strong priors for well-formatted tool responses. (2) Domain knowledge—LLMs encode operational logic for hundreds of professions. (3) Constraint enforcement—system prompts can impose state transition rules that maintain causal consistency.

Why is this approach powerful?

Traditional benchmarks for AI agents require building actual software environments—a real hospital management system, a real factory scheduling tool, a real customs processing API. This is why only a handful of domains (web browsing, coding) have proper benchmarks. The LES approach flips this: instead of engineering environments, you describe them in natural language. The LLM's pre-training on documentation for hundreds of industries means it already "knows" how an emergency department system should respond to a triage query, or how a logistics API should handle route optimization. This makes it possible to benchmark agents across 100 different professional domains—something that would cost millions of dollars with traditional approaches.

Multi-Agent Synthesis Pipeline

Each evaluation instance must satisfy four quality conditions: it must be solvable (a valid solution exists and is verified), verifiable (clear automated success criteria), discriminative (calibrated difficulty that distinguishes agent capabilities), and diverse (structural variation across instances).

✔ Solvable ✔ Verifiable ✔ Discriminative ✔ Diverse

The pipeline employs 16 non-overlapping sub-topics per scenario and constructs a professional reference document for each, covering domain terminology, workflows, state variables, edge cases, and constraints. A multi-agent pipeline powered by Gemini-3-Flash generates environment configurations, task instructions, tool definitions, solution plans, and verification rubrics. Quality filtering removes trivially easy (100% success), unsolvable (0% success), or invalid instances.

Why these four quality conditions matter

Solvable: Every test must have at least one correct solution—otherwise you cannot tell if an agent failed because it is incapable or because the test itself is impossible.
Verifiable: There must be clear, automated success criteria. In professional tasks, this is tricky—"good triage" is subjective, so rubrics must be carefully designed to make pass/fail unambiguous.
Discriminative: If every model scores 100% or 0%, the benchmark is useless. Tasks are calibrated so that the best models pass and weaker models fail, revealing meaningful differences.
Diverse: A benchmark with 100 variations of the same task type does not truly test breadth. Each instance must be structurally different—different tools, different domain knowledge, different failure modes.

OccuBench Benchmark

OccuBench covers 100 professional task scenarios across 10 industry categories and 65 specialized domains. Each scenario maps to a real human job role, ensuring evaluation results have direct practical relevance. After quality filtering, the benchmark contains 382 evaluation instances.

Category	#	Representative Scenarios
Business & Enterprise	19	Resume screening, expense auditing, AML review
Technology & IT	16	Linux ops, CI/CD recovery, intrusion response
Industrial & Engineering	12	Production scheduling, mine ventilation
Transportation & Logistics	11	Last-mile delivery, train dispatch
Commerce & Consumer	9	Dynamic pricing, hotel revenue mgmt.
Education & Culture	8	Adaptive curriculum, fact-checking
Healthcare & Life Sciences	7	Emergency triage, drug interaction screening
Public Service & Governance	7	Permit processing, wildfire evacuation
Agriculture & Environment	7	Irrigation control, crop disease diagnosis
Science & Research	4	Telescope scheduling, excavation planning

Environmental Fault Injection

Clean

No faults injected. Baseline performance measurement. All data is synthesized in clean environments.

Explicit Faults

The LES injects clearly visible error responses: HTTP 500, TimeoutError, ConnectionRefused, ServiceUnavailable. The agent knows the call failed. The correct behavior is to retry.

Implicit Faults

The LES returns degraded responses with no error signal: truncated data, missing fields, incomplete lists, or stale cached values. The response appears superficially correct. The agent must detect the quality issue and re-query.

Mixed

Approximately half explicit, half implicit faults. All faults are transient (retrying recovers normal results), parameterized by fault count and fault duration.

Understanding fault injection levels

Real production systems do not work perfectly all the time. APIs time out, databases return stale data, services degrade silently. OccuBench tests whether AI agents can handle these situations, not just the "happy path."

The key insight is the difference between explicit faults (E1) and implicit faults (E2). When you get an HTTP 500 error, you know something went wrong—the fix is simple: retry. But when an API returns only 2 out of 15 data records with no error message, how would you know data is missing? This is why E2 is harder—the agent must independently recognize that something is off. In a real deployment, imagine an agent processing insurance claims that silently receives truncated patient records. It might approve claims based on incomplete information, a much more dangerous failure mode than a visible crash.

Cross-Industry Evaluation Results

The table below presents E0 (clean environment) completion rates across 10 industry categories for all 15 models evaluated. All models use thinking mode with reasoning effort set to high. Bold values indicate the best score in each category.

Model	Avg	Agri	Biz	Comm	Edu	Hlth	Ind	Pub	Sci	Tech	Trans
GPT-5.2	79.6	84	86	67	77	76	85	84	94	80	72
Gemini 3.1 Pro	72.3	68	73	75	84	62	73	72	81	78	60
Claude Opus 4.6	71.5	74	78	53	75	76	73	68	62	68	77
Qwen 3.5 Plus	69.9	77	70	81	56	81	71	76	69	74	55
DeepSeek V3.2	69.6	65	78	67	66	71	69	72	62	74	64
Claude Opus 4.5	65.2	58	76	56	62	52	65	72	56	68	66
Claude Sonnet 4.5	64.9	65	70	69	50	71	71	60	44	68	62
Claude Sonnet 4.6	64.4	58	71	64	69	67	64	64	69	64	57
Kimi K2.5	64.1	68	62	56	62	81	62	72	56	74	57
GLM-5	62.6	55	75	67	53	57	56	68	62	70	55
Claude Opus 4	61.3	52	75	50	53	57	58	76	81	66	51
Gemini 3.1 FL	61.3	68	70	58	53	67	58	68	62	68	45
Qwen 3.5 Flash	59.7	61	60	67	53	76	53	68	69	60	51
MiniMax M2.7	53.9	48	60	56	31	57	60	60	62	64	40
Claude Sonnet 4	53.4	35	63	61	38	57	51	76	31	60	47

No single model dominates all industries. GPT-5.2 leads overall (79.6%) with the highest scores in Agriculture, Business, Industrial, and Science, but its Commerce score (67%) is far below Qwen 3.5 Plus (81%). Open-source models are highly competitive: Qwen 3.5 Plus (69.9%) and DeepSeek V3.2 (69.6%) outperform most Claude variants, challenging the assumption that closed-source models uniformly outperform open-source alternatives.

**Figure 2:** Radar chart showing model performance profiles across 10 industry categories (E0). Each model has a distinct shape, indicating different occupational specializations.

Environmental Robustness

Even with only 2 fault events of 2 rounds each, performance drops substantially: the average completion rate falls from 67.5% (E0) to 53.4% (E2), a 14.1-point decline. Gemini 3.1 Pro achieves the highest robustness index (0.87), while Kimi K2.5 shows the lowest (0.63).

Model	E0	E1	E2	E3	Rob.
Gemini 3.1 Pro	72.3	73.3	63.1	65.2	0.87
MiniMax M2.7	53.9	52.9	47.1	46.9	0.87
GPT-5.2	79.6	75.9	70.4	67.0	0.84
GLM-5	62.6	59.4	52.6	47.4	0.76
Claude Opus 4.6	71.5	68.1	53.9	63.9	0.75
DeepSeek V3.2	69.6	59.9	56.0	51.6	0.74
Qwen 3.5 Plus	69.9	61.0	51.6	54.2	0.74
Claude Sonnet 4.6	64.4	62.8	45.0	52.9	0.70
Kimi K2.5	64.1	50.0	40.6	40.1	0.63
Average	67.5	62.6	53.4	54.4	0.77

Bar chart comparing E0, E1, E2, E3 completion rates across models — **Figure 3:** Completion rates under clean (E0) and fault-injected (E1–E3) environments. Implicit faults (E2, red) cause the largest drops.

The Robustness Score explained

The Robustness Score (R) measures how well an agent maintains performance under adverse conditions. It is calculated as: R = min(CR_E1, CR_E2, CR_E3) / CR_E0, where CR is the completion rate under each fault condition. A score of 1.0 means no degradation at all (unlikely in practice), while a low score like 0.63 means the agent loses over a third of its capability when things go wrong.

Why use the minimum across fault types? Because a system is only as reliable as its weakest link. An agent that handles explicit errors perfectly (E1) but collapses under implicit faults (E2) is not truly robust—the robustness score captures this worst-case perspective.

Implicit faults are harder than both explicit and mixed faults. Counter-intuitively, 4 out of 9 models perform worse under E2 than E3. Explicit errors (timeouts, HTTP 500) provide unambiguous failure signals that prompt retry, while implicit faults (truncated data, missing fields) require the agent to independently detect that something is wrong—a fundamentally harder capability.

Line charts showing fault sensitivity analysis with varying fault count and duration — **Figure 4:** Fault parameter ablation under E3 mixed faults. Performance declines as both fault count and fault duration increase. Claude Opus 4.6 degrades more gracefully than Qwen 3.5+.

Scaling & Reasoning Analysis

Model Scaling

Larger models consistently outperform smaller counterparts within every model family. The performance gaps range from 11.0% (Gemini Pro vs. Flash-Lite) to 0.3% (Claude 4.5 Opus vs. Sonnet). The Claude 4.5 near-parity is notable, suggesting that generation's scaling benefits were minimal.

Bar chart comparing large vs small model variants — **Figure 5:** Large vs. small model variants within each family (E0). Gaps range from 0.3% to 11.0%.

Generational Progress

Claude Opus shows consistent generational improvement: 61.3% (v4) → 65.2% (v4.5) → 71.5% (v4.6), a total gain of +10.2 points. Sonnet shows a large jump from v4 to v4.5 (+11.5%) but slight regression from v4.5 to v4.6 (−0.5%), possibly reflecting a trade-off between reasoning depth and execution efficiency in the 4.6 adaptive thinking architecture.

Generational progress tracks how the same model family improves across version releases. For Claude, the Opus line improved steadily (+10.2 points over three generations), showing that investment in core capabilities pays off. But the Sonnet line had a slight regression from v4.5 to v4.6, possibly because the new "adaptive thinking" architecture trades raw execution speed for deeper reasoning—a tradeoff that does not always help on these task-oriented benchmarks.

Reasoning Effort

Higher reasoning effort generally leads to better agent performance. GPT-5.2 exhibits a clear monotonic trend: scaling from none (54.7%) to xhigh (82.2%), a 27.5-point improvement. Claude Opus 4.6 shows a similar overall trend, with max effort (73.8%) outperforming low (70.2%) by 3.6 points. Deeper reasoning directly translates to better task execution on professional tasks.

Simulator Quality

A critical finding is that strong agents are not necessarily strong environment simulators. GPT-5.2 ranks #1 as an agent but produces the worst simulation quality. When using a sufficiently capable simulator, pairwise ranking agreement reaches 85.7% (Gemini Flash vs. Qwen 3.5+), but drops to 75% when GPT-5.2 serves as the simulator.

Why "Strong Agent ≠ Strong Simulator" matters

This is one of the paper's most surprising findings. You might expect that the best AI model would also be the best at simulating environments. But GPT-5.2, which ranks #1 as an agent, produces the worst simulation quality. The paper identifies three distinct failure modes:

State fabrication: The simulator invents rooms, resources, or entities that do not exist in the specification. For example, GPT-5.2 as a simulator created two extra empty hospital rooms that were not in the scenario, leading the agent to use them and fail the rubric.
Entity omission: The simulator drops critical information from its responses. In one case, it omitted a database specialist from a roster query, making the correct escalation path impossible.
Rule fabrication: The simulator independently invents business rules (like a return window expiration) that were not in the task specification.

The practical takeaway: if you are building an LES-based evaluation system, do not assume your best model should also serve as the environment. Use a separate, validated model for simulation.

Heatmap showing pairwise ranking agreement across simulators — **Figure 8:** Pairwise ranking agreement across three environment simulators. GPT-5.2 as simulator produces the lowest agreement, suggesting it generates overly difficult or inconsistent environments.

This has important implications for LES-based evaluation: the simulator is part of the evaluation apparatus, not a neutral observer. Simulator selection must be carefully validated, and cross-simulator consistency checks are essential for reliable benchmarking.

Industry Analysis

Industries vary dramatically in difficulty. Business (70.1%) is the easiest category on average, while Transportation (56.2%) is the hardest—a 14-point gap. This reveals that industry context significantly affects agent performance, and single-domain benchmarks cannot capture this variation.

Horizontal bar chart showing industry difficulty rankings — **Figure 9:** Industry difficulty ranked by average performance across 14 models. Business is easiest, Transportation is hardest.

Model-Industry Specialization

Different models excel in different industry profiles: Gemini 3.1 Pro excels in knowledge-intensive domains (Education 84%, Science 81%, Technology 78%). Claude Opus 4.6 excels in operational domains (Transportation 77%, Business 78%). Qwen 3.5 Plus excels in consumer-facing domains (Commerce 81%, Healthcare 81%). Organizations should select agent models based on their specific industry, not solely on aggregate rankings.

Practical implications for organizations

This finding has direct business value. If you are deploying AI agents in your organization, the results suggest you should not pick one model for everything. Instead:

For knowledge-heavy tasks (education, research, tech support): Consider Gemini 3.1 Pro, which excels at factual accuracy and structured knowledge retrieval.
For operational tasks (logistics, business operations, manufacturing): Consider Claude Opus 4.6, which excels at careful state tracking and multi-step procedural execution.
For consumer-facing tasks (e-commerce, healthcare intake, agriculture): Consider Qwen 3.5 Plus, which may benefit from diverse pre-training data.

The 14-point gap between the easiest (Business, 70.1%) and hardest (Transportation, 56.2%) industries also suggests that difficulty varies enormously by domain—so benchmarking on a single domain gives a misleading picture of real-world readiness.

Case Studies

The following case studies illustrate how OccuBench reveals specific agent capabilities and failure modes through realistic professional task scenarios.

Last-Mile Delivery Routing — Proactive Constraint Checking

A delivery agent must identify the highest-priority medical shipment and deliver it while maintaining battery above 15%. Claude Opus 4.6 (PASS): Recognized that 28% battery was risky, recharged before navigating, arrived with 82% battery. DeepSeek V3.2 (FAIL): Navigated immediately, battery dropped to 12.5%, violating the constraint. The recharge came too late.

Key insight: The critical differentiator is whether the agent proactively checks constraints before acting, rather than reactively fixing violations.

Fish Farm Water Quality — Verification Gaps

An agent must detect thermal stratification in a fish farm and take corrective actions. The agent successfully profiled water quality, detected the problem (2.1°C gradient, low dissolved oxygen at bottom), activated mixing, and reduced feeding. However, it failed to re-check ammonia chemistry after corrections, claiming “ammonia remained low” without supporting evidence.

Key insight: Agents can execute correct actions but skip critical verification steps, making claims without evidence—a dangerous pattern in safety-critical domains.

Building Inspection — Regulatory Compliance Ordering

An agent must inspect a building's gas system following NFPA 99 compliance procedures. The agent performed brazing without valid permits, skipped permit renewal during the work, submitted final certification with the oxygen valve still closed, and renewed permits only after all work was complete—too late for compliance.

Key insight: Procedural ordering matters in regulated domains. The agent completed all required actions but in the wrong sequence, resulting in compliance failure.

Fault Resilience: Explicit Faults (E1)

Under E1 fault injection on a public transit task, Kimi K2.5 stopped after encountering a single HTTP 500 error, completing only 1 of 4 required actions (RTPI suppression). It did not retry the failed call, resolve maintenance holds, or reassign the bus to its new route—demonstrating that some agents give up entirely instead of retrying when faced with explicit errors.

Fault Resilience: Implicit Faults (E2)

Under E2 implicit fault injection, Kimi K2.5 received truncated property data (only 2 of 15 units returned). Instead of detecting the incomplete data and re-querying, it assumed all 15 units followed the same pattern as the 2 sampled units. This led to an incorrect NOI calculation ($362,000 vs. actual) and a wrong DSCR assessment (PASS instead of FAIL)—demonstrating the danger of accepting degraded data at face value.

Discussion & Conclusion

Limitations

Simulation fidelity: Language Environment Simulators model domain logic rather than domain data. An LES understands that a drug interaction check should return contraindications, but the specific values are generated rather than retrieved from a real database. OccuBench evaluates an agent's decision-making process rather than its ability to handle exact real-world data values.

Simulator dependence: Evaluation results are tied to the specific simulator used during data synthesis. Tasks verified as solvable under one LES may become unsolvable under a different one, and agent rankings can shift when the simulator changes. The simulator is part of the evaluation apparatus, not a neutral observer.

Conclusion

OccuBench is the first benchmark systematically evaluating AI agents on real-world professional tasks across 100 scenarios, 65 specialized domains, and 10 industry categories. Through Language Environment Simulators, OccuBench makes the “untestable majority” of professional domains evaluable without any real environment infrastructure.

The evaluation of 15 frontier models reveals that: (1) no model dominates across all industries; (2) implicit environmental faults are harder than explicit and mixed faults; (3) scaling consistently improves performance; and (4) strong agents are not necessarily strong environment simulators. These findings have practical implications: organizations should select agent models based on their specific industry needs, invest in robustness testing beyond happy-path evaluation, and carefully validate simulator quality in LES-based benchmarks.