← Flecto

OccuBench

Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

Xiaomeng Hu Yinger Zhang Fei Huang Jianhong Tu Yang Su Lianghao Deng Yuxuan Liu Yantao Liu Dayiheng Liu Tsung-Yi Ho

Qwen Team, Alibaba Group · The Chinese University of Hong Kong

Key Findings

100
Professional Task Scenarios
10
Industry Categories
15
Frontier Models Evaluated
382
Evaluation Instances
Finding 1

No Single Model Dominates

Each model has a distinct capability profile across industries. GPT-5.2 leads overall at 79.6%, but Gemini 3.1 Pro tops Education (84%) and Claude Opus 4.6 excels in Transportation (77%). No single model is the best choice for every domain.

Finding 2

Implicit Faults Are Hardest

Implicit faults (truncated data, missing fields) cause larger performance drops (53.4%) than explicit errors like HTTP 500s (62.6%). Without clear error signals, agents fail to detect degraded data and make decisions on incomplete information.

Finding 3

Scaling Consistently Helps

Larger models, newer generations, and higher reasoning effort all improve performance. GPT-5.2 gains a dramatic 27.5 points when scaling from minimal to maximum reasoning effort.

Finding 4

Strong Agents ≠ Strong Simulators

GPT-5.2 ranks #1 as an agent (79.6%) but produces the worst environment simulation quality. Simulator choice significantly affects evaluation rankings, with pairwise agreement as low as 75%.

Abstract

AI agents are expected to perform professional work across hundreds of occupational domains—from emergency department triage to nuclear reactor safety monitoring to customs import processing—yet existing benchmarks can only evaluate agents in the few domains where public environments exist. OccuBench is a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments through LLM-driven tool response generation. A multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection. An evaluation of 15 frontier models across 8 model families reveals that no single model dominates all industries, implicit faults are harder than explicit errors, scaling consistently improves performance, and strong agents are not necessarily strong environment simulators.

Introduction

AI agents are increasingly expected to perform professional work across diverse occupational domains: triaging emergency patients, auditing financial reports, scheduling factory production lines, responding to network intrusions, processing customs declarations, and coordinating wildfire evacuations. These represent the highest-value applications of AI agent technology, where autonomous decision-making through multi-step tool use can augment or replace costly human expertise. However, a fundamental evaluation gap exists: the professional domains where agents would deliver the most value are precisely the domains where no benchmarks exist.

  • Can an agent triage patients in an emergency department? No public environment exists.
  • Can an agent manage a nuclear reactor safety alert? No benchmark covers this.
  • Can an agent process customs import declarations? No API is available.
  • Can an agent control greenhouse irrigation based on sensor data? No testbed exists.

The Untestable Majority

The professional domains where AI agents are most needed—healthcare, finance, legal, manufacturing, energy, governance, and logistics—are bound to enterprise systems with no public access. This makes benchmark construction impossible with traditional approaches.

Prohibitive Scaling Cost

Even within covered domains, each benchmark is constrained by its environment implementation. Adding a new domain requires deploying and configuring entire web applications or APIs—costs that scale linearly with domain count.

No Robustness Evaluation

Real-world environments are noisy: APIs time out, data arrives incomplete, services degrade silently. Yet existing benchmarks evaluate agents exclusively on the “happy path,” missing a critical dimension of deployment readiness.

The OccuBench Approach

The key insight is that the environment itself can be simulated by an LLM. A Language Environment Simulator (LES) takes an environment configuration—system prompt, tool schema, initial state, and state description—and generates realistic tool responses based on the LLM's pre-trained knowledge of domain-specific operational logic.

Based on LESs, OccuBench covers 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, with 382 evaluation instances. It evaluates agents on both task completion (multi-step decision-making across industries) and environmental robustness (performance under explicit errors, implicit data degradation, and mixed faults).

Language Environment Simulator

The core innovation of OccuBench is the Language Environment Simulator (LES)—a function that simulates domain-specific environments through LLM-driven tool response generation. The LES is formally defined as:

$$(s_{t+1}, o_{t+1}) = f_e(s_t, a_t; c)$$
What does this formula mean? Think of it like a turn-based game: the agent takes an action (a tool call like "check patient vitals"), the environment processes it and returns the next state plus an observation. The function fe is the Language Environment Simulator—an LLM that plays the role of the environment. Instead of building a real hospital system or factory, you describe the rules in a prompt, and the LLM generates realistic responses. For example, if the agent calls get_patient_vitals(room_2), the LES generates a plausible JSON response with heart rate, blood pressure, etc., following the rules defined in the configuration c.

Here, c is the environment configuration (system prompt, tool schema, initial state, state description), st is the latent environment state maintained implicitly through the LLM's context window, at is the agent's action (a tool call), and ot+1 is the observation returned to the agent. Unlike traditional world models that learn from data, LESs leverage pre-trained knowledge of domain-specific operational logic.

LES evaluation loop diagram showing the interaction between Agent LLM, Language Environment Simulator, Verifier, History, and Configuration components
Figure 1: LES evaluation loop. The agent issues tool calls, the LES generates observations conditioned on its configuration and conversation history. The trajectory is scored by a rubric-based verifier.

Environment Configuration

System Prompt

Defines the environment's behavioral rules, simulation logic, error handling protocols, and output format constraints. For example, a hotel revenue management environment specifies pricing rules, occupancy calculations, and metrics relationships.

Tool Schema

Defines the agent's action space as a set of callable functions with typed parameters and example outputs. Each environment contains 2–10 tools (median 5) reflecting realistic operational interfaces.

Initial State

A structured JSON object specifying the environment's starting conditions—for example, room inventory, patient queue, or network topology.

State Description

Semantic annotations for each state field, guiding the LLM to maintain causal consistency (e.g., “remaining inventory decreases after each booking”).

Why LLMs Work as Simulators

LLMs are effective environment simulators because: (1) Format priors—pre-training on API documentation provides strong priors for well-formatted tool responses. (2) Domain knowledge—LLMs encode operational logic for hundreds of professions. (3) Constraint enforcement—system prompts can impose state transition rules that maintain causal consistency.

Why is this approach powerful?

Traditional benchmarks for AI agents require building actual software environments—a real hospital management system, a real factory scheduling tool, a real customs processing API. This is why only a handful of domains (web browsing, coding) have proper benchmarks. The LES approach flips this: instead of engineering environments, you describe them in natural language. The LLM's pre-training on documentation for hundreds of industries means it already "knows" how an emergency department system should respond to a triage query, or how a logistics API should handle route optimization. This makes it possible to benchmark agents across 100 different professional domains—something that would cost millions of dollars with traditional approaches.

Multi-Agent Synthesis Pipeline

Each evaluation instance must satisfy four quality conditions: it must be solvable (a valid solution exists and is verified), verifiable (clear automated success criteria), discriminative (calibrated difficulty that distinguishes agent capabilities), and diverse (structural variation across instances).

✔ Solvable ✔ Verifiable ✔ Discriminative ✔ Diverse

The pipeline employs 16 non-overlapping sub-topics per scenario and constructs a professional reference document for each, covering domain terminology, workflows, state variables, edge cases, and constraints. A multi-agent pipeline powered by Gemini-3-Flash generates environment configurations, task instructions, tool definitions, solution plans, and verification rubrics. Quality filtering removes trivially easy (100% success), unsolvable (0% success), or invalid instances.

Why these four quality conditions matter

  • Solvable: Every test must have at least one correct solution—otherwise you cannot tell if an agent failed because it is incapable or because the test itself is impossible.
  • Verifiable: There must be clear, automated success criteria. In professional tasks, this is tricky—"good triage" is subjective, so rubrics must be carefully designed to make pass/fail unambiguous.
  • Discriminative: If every model scores 100% or 0%, the benchmark is useless. Tasks are calibrated so that the best models pass and weaker models fail, revealing meaningful differences.
  • Diverse: A benchmark with 100 variations of the same task type does not truly test breadth. Each instance must be structurally different—different tools, different domain knowledge, different failure modes.

OccuBench Benchmark

OccuBench covers 100 professional task scenarios across 10 industry categories and 65 specialized domains. Each scenario maps to a real human job role, ensuring evaluation results have direct practical relevance. After quality filtering, the benchmark contains 382 evaluation instances.

Category # Representative Scenarios
Business & Enterprise19Resume screening, expense auditing, AML review
Technology & IT16Linux ops, CI/CD recovery, intrusion response
Industrial & Engineering12Production scheduling, mine ventilation
Transportation & Logistics11Last-mile delivery, train dispatch
Commerce & Consumer9Dynamic pricing, hotel revenue mgmt.
Education & Culture8Adaptive curriculum, fact-checking
Healthcare & Life Sciences7Emergency triage, drug interaction screening
Public Service & Governance7Permit processing, wildfire evacuation
Agriculture & Environment7Irrigation control, crop disease diagnosis
Science & Research4Telescope scheduling, excavation planning

Environmental Fault Injection

E0

Clean

No faults injected. Baseline performance measurement. All data is synthesized in clean environments.

E1

Explicit Faults

The LES injects clearly visible error responses: HTTP 500, TimeoutError, ConnectionRefused, ServiceUnavailable. The agent knows the call failed. The correct behavior is to retry.

E2

Implicit Faults

The LES returns degraded responses with no error signal: truncated data, missing fields, incomplete lists, or stale cached values. The response appears superficially correct. The agent must detect the quality issue and re-query.

E3

Mixed

Approximately half explicit, half implicit faults. All faults are transient (retrying recovers normal results), parameterized by fault count and fault duration.

Understanding fault injection levels

Real production systems do not work perfectly all the time. APIs time out, databases return stale data, services degrade silently. OccuBench tests whether AI agents can handle these situations, not just the "happy path."

The key insight is the difference between explicit faults (E1) and implicit faults (E2). When you get an HTTP 500 error, you know something went wrong—the fix is simple: retry. But when an API returns only 2 out of 15 data records with no error message, how would you know data is missing? This is why E2 is harder—the agent must independently recognize that something is off. In a real deployment, imagine an agent processing insurance claims that silently receives truncated patient records. It might approve claims based on incomplete information, a much more dangerous failure mode than a visible crash.

Cross-Industry Evaluation Results

The table below presents E0 (clean environment) completion rates across 10 industry categories for all 15 models evaluated. All models use thinking mode with reasoning effort set to high. Bold values indicate the best score in each category.

ModelAvgAgriBizCommEduHlthIndPubSciTechTrans
GPT-5.279.684866777768584948072
Gemini 3.1 Pro72.368737584627372817860
Claude Opus 4.671.574785375767368626877
Qwen 3.5 Plus69.977708156817176697455
DeepSeek V3.269.665786766716972627464
Claude Opus 4.565.258765662526572566866
Claude Sonnet 4.564.965706950717160446862
Claude Sonnet 4.664.458716469676464696457
Kimi K2.564.168625662816272567457
GLM-562.655756753575668627055
Claude Opus 461.352755053575876816651
Gemini 3.1 FL61.368705853675868626845
Qwen 3.5 Flash59.761606753765368696051
MiniMax M2.753.948605631576060626440
Claude Sonnet 453.435636138575176316047

No single model dominates all industries. GPT-5.2 leads overall (79.6%) with the highest scores in Agriculture, Business, Industrial, and Science, but its Commerce score (67%) is far below Qwen 3.5 Plus (81%). Open-source models are highly competitive: Qwen 3.5 Plus (69.9%) and DeepSeek V3.2 (69.6%) outperform most Claude variants, challenging the assumption that closed-source models uniformly outperform open-source alternatives.

Radar chart showing model performance profiles across 10 industry categories
Figure 2: Radar chart showing model performance profiles across 10 industry categories (E0). Each model has a distinct shape, indicating different occupational specializations.

Environmental Robustness

Even with only 2 fault events of 2 rounds each, performance drops substantially: the average completion rate falls from 67.5% (E0) to 53.4% (E2), a 14.1-point decline. Gemini 3.1 Pro achieves the highest robustness index (0.87), while Kimi K2.5 shows the lowest (0.63).

ModelE0E1E2E3Rob.
Gemini 3.1 Pro72.373.363.165.20.87
MiniMax M2.753.952.947.146.90.87
GPT-5.279.675.970.467.00.84
GLM-562.659.452.647.40.76
Claude Opus 4.671.568.153.963.90.75
DeepSeek V3.269.659.956.051.60.74
Qwen 3.5 Plus69.961.051.654.20.74
Claude Sonnet 4.664.462.845.052.90.70
Kimi K2.564.150.040.640.10.63
Average67.562.653.454.40.77
Bar chart comparing E0, E1, E2, E3 completion rates across models
Figure 3: Completion rates under clean (E0) and fault-injected (E1–E3) environments. Implicit faults (E2, red) cause the largest drops.

The Robustness Score explained

The Robustness Score (R) measures how well an agent maintains performance under adverse conditions. It is calculated as: R = min(CRE1, CRE2, CRE3) / CRE0, where CR is the completion rate under each fault condition. A score of 1.0 means no degradation at all (unlikely in practice), while a low score like 0.63 means the agent loses over a third of its capability when things go wrong.

Why use the minimum across fault types? Because a system is only as reliable as its weakest link. An agent that handles explicit errors perfectly (E1) but collapses under implicit faults (E2) is not truly robust—the robustness score captures this worst-case perspective.

Implicit faults are harder than both explicit and mixed faults. Counter-intuitively, 4 out of 9 models perform worse under E2 than E3. Explicit errors (timeouts, HTTP 500) provide unambiguous failure signals that prompt retry, while implicit faults (truncated data, missing fields) require the agent to independently detect that something is wrong—a fundamentally harder capability.

Line charts showing fault sensitivity analysis with varying fault count and duration
Figure 4: Fault parameter ablation under E3 mixed faults. Performance declines as both fault count and fault duration increase. Claude Opus 4.6 degrades more gracefully than Qwen 3.5+.

Scaling & Reasoning Analysis

Model Scaling

Larger models consistently outperform smaller counterparts within every model family. The performance gaps range from 11.0% (Gemini Pro vs. Flash-Lite) to 0.3% (Claude 4.5 Opus vs. Sonnet). The Claude 4.5 near-parity is notable, suggesting that generation's scaling benefits were minimal.

Bar chart comparing large vs small model variants
Figure 5: Large vs. small model variants within each family (E0). Gaps range from 0.3% to 11.0%.

Generational Progress

Claude Opus shows consistent generational improvement: 61.3% (v4) → 65.2% (v4.5) → 71.5% (v4.6), a total gain of +10.2 points. Sonnet shows a large jump from v4 to v4.5 (+11.5%) but slight regression from v4.5 to v4.6 (−0.5%), possibly reflecting a trade-off between reasoning depth and execution efficiency in the 4.6 adaptive thinking architecture.

Line chart showing Claude generational progress
Figure 6: Claude generational progress (E0). Both Opus and Sonnet improve across generations, with Opus showing the steepest gains from 4.5 to 4.6.
Generational progress tracks how the same model family improves across version releases. For Claude, the Opus line improved steadily (+10.2 points over three generations), showing that investment in core capabilities pays off. But the Sonnet line had a slight regression from v4.5 to v4.6, possibly because the new "adaptive thinking" architecture trades raw execution speed for deeper reasoning—a tradeoff that does not always help on these task-oriented benchmarks.

Reasoning Effort

Higher reasoning effort generally leads to better agent performance. GPT-5.2 exhibits a clear monotonic trend: scaling from none (54.7%) to xhigh (82.2%), a 27.5-point improvement. Claude Opus 4.6 shows a similar overall trend, with max effort (73.8%) outperforming low (70.2%) by 3.6 points. Deeper reasoning directly translates to better task execution on professional tasks.

Line chart showing effect of reasoning effort on agent performance
Figure 7: Effect of reasoning effort on agent performance (E0). GPT-5.2 shows dramatic gains from minimal to maximum effort.

Simulator Quality

A critical finding is that strong agents are not necessarily strong environment simulators. GPT-5.2 ranks #1 as an agent but produces the worst simulation quality. When using a sufficiently capable simulator, pairwise ranking agreement reaches 85.7% (Gemini Flash vs. Qwen 3.5+), but drops to 75% when GPT-5.2 serves as the simulator.

Why "Strong Agent ≠ Strong Simulator" matters

This is one of the paper's most surprising findings. You might expect that the best AI model would also be the best at simulating environments. But GPT-5.2, which ranks #1 as an agent, produces the worst simulation quality. The paper identifies three distinct failure modes:

  • State fabrication: The simulator invents rooms, resources, or entities that do not exist in the specification. For example, GPT-5.2 as a simulator created two extra empty hospital rooms that were not in the scenario, leading the agent to use them and fail the rubric.
  • Entity omission: The simulator drops critical information from its responses. In one case, it omitted a database specialist from a roster query, making the correct escalation path impossible.
  • Rule fabrication: The simulator independently invents business rules (like a return window expiration) that were not in the task specification.

The practical takeaway: if you are building an LES-based evaluation system, do not assume your best model should also serve as the environment. Use a separate, validated model for simulation.

Heatmap showing pairwise ranking agreement across simulators
Figure 8: Pairwise ranking agreement across three environment simulators. GPT-5.2 as simulator produces the lowest agreement, suggesting it generates overly difficult or inconsistent environments.

This has important implications for LES-based evaluation: the simulator is part of the evaluation apparatus, not a neutral observer. Simulator selection must be carefully validated, and cross-simulator consistency checks are essential for reliable benchmarking.

Industry Analysis

Industries vary dramatically in difficulty. Business (70.1%) is the easiest category on average, while Transportation (56.2%) is the hardest—a 14-point gap. This reveals that industry context significantly affects agent performance, and single-domain benchmarks cannot capture this variation.

Horizontal bar chart showing industry difficulty rankings
Figure 9: Industry difficulty ranked by average performance across 14 models. Business is easiest, Transportation is hardest.

Model-Industry Specialization

Different models excel in different industry profiles: Gemini 3.1 Pro excels in knowledge-intensive domains (Education 84%, Science 81%, Technology 78%). Claude Opus 4.6 excels in operational domains (Transportation 77%, Business 78%). Qwen 3.5 Plus excels in consumer-facing domains (Commerce 81%, Healthcare 81%). Organizations should select agent models based on their specific industry, not solely on aggregate rankings.

Practical implications for organizations

This finding has direct business value. If you are deploying AI agents in your organization, the results suggest you should not pick one model for everything. Instead:

  • For knowledge-heavy tasks (education, research, tech support): Consider Gemini 3.1 Pro, which excels at factual accuracy and structured knowledge retrieval.
  • For operational tasks (logistics, business operations, manufacturing): Consider Claude Opus 4.6, which excels at careful state tracking and multi-step procedural execution.
  • For consumer-facing tasks (e-commerce, healthcare intake, agriculture): Consider Qwen 3.5 Plus, which may benefit from diverse pre-training data.

The 14-point gap between the easiest (Business, 70.1%) and hardest (Transportation, 56.2%) industries also suggests that difficulty varies enormously by domain—so benchmarking on a single domain gives a misleading picture of real-world readiness.

Case Studies

The following case studies illustrate how OccuBench reveals specific agent capabilities and failure modes through realistic professional task scenarios.

Last-Mile Delivery Routing — Proactive Constraint Checking

A delivery agent must identify the highest-priority medical shipment and deliver it while maintaining battery above 15%. Claude Opus 4.6 (PASS): Recognized that 28% battery was risky, recharged before navigating, arrived with 82% battery. DeepSeek V3.2 (FAIL): Navigated immediately, battery dropped to 12.5%, violating the constraint. The recharge came too late.

Key insight: The critical differentiator is whether the agent proactively checks constraints before acting, rather than reactively fixing violations.

Fish Farm Water Quality — Verification Gaps

An agent must detect thermal stratification in a fish farm and take corrective actions. The agent successfully profiled water quality, detected the problem (2.1°C gradient, low dissolved oxygen at bottom), activated mixing, and reduced feeding. However, it failed to re-check ammonia chemistry after corrections, claiming “ammonia remained low” without supporting evidence.

Key insight: Agents can execute correct actions but skip critical verification steps, making claims without evidence—a dangerous pattern in safety-critical domains.

Building Inspection — Regulatory Compliance Ordering

An agent must inspect a building's gas system following NFPA 99 compliance procedures. The agent performed brazing without valid permits, skipped permit renewal during the work, submitted final certification with the oxygen valve still closed, and renewed permits only after all work was complete—too late for compliance.

Key insight: Procedural ordering matters in regulated domains. The agent completed all required actions but in the wrong sequence, resulting in compliance failure.

Fault Resilience: Explicit Faults (E1)

Under E1 fault injection on a public transit task, Kimi K2.5 stopped after encountering a single HTTP 500 error, completing only 1 of 4 required actions (RTPI suppression). It did not retry the failed call, resolve maintenance holds, or reassign the bus to its new route—demonstrating that some agents give up entirely instead of retrying when faced with explicit errors.

Fault Resilience: Implicit Faults (E2)

Under E2 implicit fault injection, Kimi K2.5 received truncated property data (only 2 of 15 units returned). Instead of detecting the incomplete data and re-querying, it assumed all 15 units followed the same pattern as the 2 sampled units. This led to an incorrect NOI calculation ($362,000 vs. actual) and a wrong DSCR assessment (PASS instead of FAIL)—demonstrating the danger of accepting degraded data at face value.

Discussion & Conclusion

Limitations

Simulation fidelity: Language Environment Simulators model domain logic rather than domain data. An LES understands that a drug interaction check should return contraindications, but the specific values are generated rather than retrieved from a real database. OccuBench evaluates an agent's decision-making process rather than its ability to handle exact real-world data values.

Simulator dependence: Evaluation results are tied to the specific simulator used during data synthesis. Tasks verified as solvable under one LES may become unsolvable under a different one, and agent rankings can shift when the simulator changes. The simulator is part of the evaluation apparatus, not a neutral observer.

Conclusion

OccuBench is the first benchmark systematically evaluating AI agents on real-world professional tasks across 100 scenarios, 65 specialized domains, and 10 industry categories. Through Language Environment Simulators, OccuBench makes the “untestable majority” of professional domains evaluable without any real environment infrastructure.

The evaluation of 15 frontier models reveals that: (1) no model dominates across all industries; (2) implicit environmental faults are harder than explicit and mixed faults; (3) scaling consistently improves performance; and (4) strong agents are not necessarily strong environment simulators. These findings have practical implications: organizations should select agent models based on their specific industry needs, invest in robustness testing beyond happy-path evaluation, and carefully validate simulator quality in LES-based benchmarks.

Keywords

AI Agents Benchmark LLM Language Environment Simulator Professional Tasks Robustness Evaluation Multi-Agent Systems Cross-Industry Evaluation

B2B Content

Any content, beautifully transformed for your organization

PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.

View Services Contact Us