Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Qwen Team, Alibaba Group · The Chinese University of Hong Kong
Each model has a distinct capability profile across industries. GPT-5.2 leads overall at 79.6%, but Gemini 3.1 Pro tops Education (84%) and Claude Opus 4.6 excels in Transportation (77%). No single model is the best choice for every domain.
Implicit faults (truncated data, missing fields) cause larger performance drops (53.4%) than explicit errors like HTTP 500s (62.6%). Without clear error signals, agents fail to detect degraded data and make decisions on incomplete information.
Larger models, newer generations, and higher reasoning effort all improve performance. GPT-5.2 gains a dramatic 27.5 points when scaling from minimal to maximum reasoning effort.
GPT-5.2 ranks #1 as an agent (79.6%) but produces the worst environment simulation quality. Simulator choice significantly affects evaluation rankings, with pairwise agreement as low as 75%.
AI agents are expected to perform professional work across hundreds of occupational domains—from emergency department triage to nuclear reactor safety monitoring to customs import processing—yet existing benchmarks can only evaluate agents in the few domains where public environments exist. OccuBench is a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments through LLM-driven tool response generation. A multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection. An evaluation of 15 frontier models across 8 model families reveals that no single model dominates all industries, implicit faults are harder than explicit errors, scaling consistently improves performance, and strong agents are not necessarily strong environment simulators.
AI agents are increasingly expected to perform professional work across diverse occupational domains: triaging emergency patients, auditing financial reports, scheduling factory production lines, responding to network intrusions, processing customs declarations, and coordinating wildfire evacuations. These represent the highest-value applications of AI agent technology, where autonomous decision-making through multi-step tool use can augment or replace costly human expertise. However, a fundamental evaluation gap exists: the professional domains where agents would deliver the most value are precisely the domains where no benchmarks exist.
The professional domains where AI agents are most needed—healthcare, finance, legal, manufacturing, energy, governance, and logistics—are bound to enterprise systems with no public access. This makes benchmark construction impossible with traditional approaches.
Even within covered domains, each benchmark is constrained by its environment implementation. Adding a new domain requires deploying and configuring entire web applications or APIs—costs that scale linearly with domain count.
Real-world environments are noisy: APIs time out, data arrives incomplete, services degrade silently. Yet existing benchmarks evaluate agents exclusively on the “happy path,” missing a critical dimension of deployment readiness.
The key insight is that the environment itself can be simulated by an LLM. A Language Environment Simulator (LES) takes an environment configuration—system prompt, tool schema, initial state, and state description—and generates realistic tool responses based on the LLM's pre-trained knowledge of domain-specific operational logic.
Based on LESs, OccuBench covers 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, with 382 evaluation instances. It evaluates agents on both task completion (multi-step decision-making across industries) and environmental robustness (performance under explicit errors, implicit data degradation, and mixed faults).
The core innovation of OccuBench is the Language Environment Simulator (LES)—a function that simulates domain-specific environments through LLM-driven tool response generation. The LES is formally defined as:
get_patient_vitals(room_2), the LES generates a plausible JSON response with heart rate, blood pressure, etc., following the rules defined in the configuration c.Here, c is the environment configuration (system prompt, tool schema, initial state, state description), st is the latent environment state maintained implicitly through the LLM's context window, at is the agent's action (a tool call), and ot+1 is the observation returned to the agent. Unlike traditional world models that learn from data, LESs leverage pre-trained knowledge of domain-specific operational logic.
Defines the environment's behavioral rules, simulation logic, error handling protocols, and output format constraints. For example, a hotel revenue management environment specifies pricing rules, occupancy calculations, and metrics relationships.
Defines the agent's action space as a set of callable functions with typed parameters and example outputs. Each environment contains 2–10 tools (median 5) reflecting realistic operational interfaces.
A structured JSON object specifying the environment's starting conditions—for example, room inventory, patient queue, or network topology.
Semantic annotations for each state field, guiding the LLM to maintain causal consistency (e.g., “remaining inventory decreases after each booking”).
LLMs are effective environment simulators because: (1) Format priors—pre-training on API documentation provides strong priors for well-formatted tool responses. (2) Domain knowledge—LLMs encode operational logic for hundreds of professions. (3) Constraint enforcement—system prompts can impose state transition rules that maintain causal consistency.
Traditional benchmarks for AI agents require building actual software environments—a real hospital management system, a real factory scheduling tool, a real customs processing API. This is why only a handful of domains (web browsing, coding) have proper benchmarks. The LES approach flips this: instead of engineering environments, you describe them in natural language. The LLM's pre-training on documentation for hundreds of industries means it already "knows" how an emergency department system should respond to a triage query, or how a logistics API should handle route optimization. This makes it possible to benchmark agents across 100 different professional domains—something that would cost millions of dollars with traditional approaches.
Each evaluation instance must satisfy four quality conditions: it must be solvable (a valid solution exists and is verified), verifiable (clear automated success criteria), discriminative (calibrated difficulty that distinguishes agent capabilities), and diverse (structural variation across instances).
The pipeline employs 16 non-overlapping sub-topics per scenario and constructs a professional reference document for each, covering domain terminology, workflows, state variables, edge cases, and constraints. A multi-agent pipeline powered by Gemini-3-Flash generates environment configurations, task instructions, tool definitions, solution plans, and verification rubrics. Quality filtering removes trivially easy (100% success), unsolvable (0% success), or invalid instances.
OccuBench covers 100 professional task scenarios across 10 industry categories and 65 specialized domains. Each scenario maps to a real human job role, ensuring evaluation results have direct practical relevance. After quality filtering, the benchmark contains 382 evaluation instances.
| Category | # | Representative Scenarios |
|---|---|---|
| Business & Enterprise | 19 | Resume screening, expense auditing, AML review |
| Technology & IT | 16 | Linux ops, CI/CD recovery, intrusion response |
| Industrial & Engineering | 12 | Production scheduling, mine ventilation |
| Transportation & Logistics | 11 | Last-mile delivery, train dispatch |
| Commerce & Consumer | 9 | Dynamic pricing, hotel revenue mgmt. |
| Education & Culture | 8 | Adaptive curriculum, fact-checking |
| Healthcare & Life Sciences | 7 | Emergency triage, drug interaction screening |
| Public Service & Governance | 7 | Permit processing, wildfire evacuation |
| Agriculture & Environment | 7 | Irrigation control, crop disease diagnosis |
| Science & Research | 4 | Telescope scheduling, excavation planning |
No faults injected. Baseline performance measurement. All data is synthesized in clean environments.
The LES injects clearly visible error responses: HTTP 500, TimeoutError, ConnectionRefused, ServiceUnavailable. The agent knows the call failed. The correct behavior is to retry.
The LES returns degraded responses with no error signal: truncated data, missing fields, incomplete lists, or stale cached values. The response appears superficially correct. The agent must detect the quality issue and re-query.
Approximately half explicit, half implicit faults. All faults are transient (retrying recovers normal results), parameterized by fault count and fault duration.
Real production systems do not work perfectly all the time. APIs time out, databases return stale data, services degrade silently. OccuBench tests whether AI agents can handle these situations, not just the "happy path."
The key insight is the difference between explicit faults (E1) and implicit faults (E2). When you get an HTTP 500 error, you know something went wrong—the fix is simple: retry. But when an API returns only 2 out of 15 data records with no error message, how would you know data is missing? This is why E2 is harder—the agent must independently recognize that something is off. In a real deployment, imagine an agent processing insurance claims that silently receives truncated patient records. It might approve claims based on incomplete information, a much more dangerous failure mode than a visible crash.
The table below presents E0 (clean environment) completion rates across 10 industry categories for all 15 models evaluated. All models use thinking mode with reasoning effort set to high. Bold values indicate the best score in each category.
| Model | Avg | Agri | Biz | Comm | Edu | Hlth | Ind | Pub | Sci | Tech | Trans |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-5.2 | 79.6 | 84 | 86 | 67 | 77 | 76 | 85 | 84 | 94 | 80 | 72 |
| Gemini 3.1 Pro | 72.3 | 68 | 73 | 75 | 84 | 62 | 73 | 72 | 81 | 78 | 60 |
| Claude Opus 4.6 | 71.5 | 74 | 78 | 53 | 75 | 76 | 73 | 68 | 62 | 68 | 77 |
| Qwen 3.5 Plus | 69.9 | 77 | 70 | 81 | 56 | 81 | 71 | 76 | 69 | 74 | 55 |
| DeepSeek V3.2 | 69.6 | 65 | 78 | 67 | 66 | 71 | 69 | 72 | 62 | 74 | 64 |
| Claude Opus 4.5 | 65.2 | 58 | 76 | 56 | 62 | 52 | 65 | 72 | 56 | 68 | 66 |
| Claude Sonnet 4.5 | 64.9 | 65 | 70 | 69 | 50 | 71 | 71 | 60 | 44 | 68 | 62 |
| Claude Sonnet 4.6 | 64.4 | 58 | 71 | 64 | 69 | 67 | 64 | 64 | 69 | 64 | 57 |
| Kimi K2.5 | 64.1 | 68 | 62 | 56 | 62 | 81 | 62 | 72 | 56 | 74 | 57 |
| GLM-5 | 62.6 | 55 | 75 | 67 | 53 | 57 | 56 | 68 | 62 | 70 | 55 |
| Claude Opus 4 | 61.3 | 52 | 75 | 50 | 53 | 57 | 58 | 76 | 81 | 66 | 51 |
| Gemini 3.1 FL | 61.3 | 68 | 70 | 58 | 53 | 67 | 58 | 68 | 62 | 68 | 45 |
| Qwen 3.5 Flash | 59.7 | 61 | 60 | 67 | 53 | 76 | 53 | 68 | 69 | 60 | 51 |
| MiniMax M2.7 | 53.9 | 48 | 60 | 56 | 31 | 57 | 60 | 60 | 62 | 64 | 40 |
| Claude Sonnet 4 | 53.4 | 35 | 63 | 61 | 38 | 57 | 51 | 76 | 31 | 60 | 47 |
No single model dominates all industries. GPT-5.2 leads overall (79.6%) with the highest scores in Agriculture, Business, Industrial, and Science, but its Commerce score (67%) is far below Qwen 3.5 Plus (81%). Open-source models are highly competitive: Qwen 3.5 Plus (69.9%) and DeepSeek V3.2 (69.6%) outperform most Claude variants, challenging the assumption that closed-source models uniformly outperform open-source alternatives.
Even with only 2 fault events of 2 rounds each, performance drops substantially: the average completion rate falls from 67.5% (E0) to 53.4% (E2), a 14.1-point decline. Gemini 3.1 Pro achieves the highest robustness index (0.87), while Kimi K2.5 shows the lowest (0.63).
| Model | E0 | E1 | E2 | E3 | Rob. |
|---|---|---|---|---|---|
| Gemini 3.1 Pro | 72.3 | 73.3 | 63.1 | 65.2 | 0.87 |
| MiniMax M2.7 | 53.9 | 52.9 | 47.1 | 46.9 | 0.87 |
| GPT-5.2 | 79.6 | 75.9 | 70.4 | 67.0 | 0.84 |
| GLM-5 | 62.6 | 59.4 | 52.6 | 47.4 | 0.76 |
| Claude Opus 4.6 | 71.5 | 68.1 | 53.9 | 63.9 | 0.75 |
| DeepSeek V3.2 | 69.6 | 59.9 | 56.0 | 51.6 | 0.74 |
| Qwen 3.5 Plus | 69.9 | 61.0 | 51.6 | 54.2 | 0.74 |
| Claude Sonnet 4.6 | 64.4 | 62.8 | 45.0 | 52.9 | 0.70 |
| Kimi K2.5 | 64.1 | 50.0 | 40.6 | 40.1 | 0.63 |
| Average | 67.5 | 62.6 | 53.4 | 54.4 | 0.77 |
The Robustness Score (R) measures how well an agent maintains performance under adverse conditions. It is calculated as: R = min(CRE1, CRE2, CRE3) / CRE0, where CR is the completion rate under each fault condition. A score of 1.0 means no degradation at all (unlikely in practice), while a low score like 0.63 means the agent loses over a third of its capability when things go wrong.
Why use the minimum across fault types? Because a system is only as reliable as its weakest link. An agent that handles explicit errors perfectly (E1) but collapses under implicit faults (E2) is not truly robust—the robustness score captures this worst-case perspective.
Implicit faults are harder than both explicit and mixed faults. Counter-intuitively, 4 out of 9 models perform worse under E2 than E3. Explicit errors (timeouts, HTTP 500) provide unambiguous failure signals that prompt retry, while implicit faults (truncated data, missing fields) require the agent to independently detect that something is wrong—a fundamentally harder capability.
Larger models consistently outperform smaller counterparts within every model family. The performance gaps range from 11.0% (Gemini Pro vs. Flash-Lite) to 0.3% (Claude 4.5 Opus vs. Sonnet). The Claude 4.5 near-parity is notable, suggesting that generation's scaling benefits were minimal.
Claude Opus shows consistent generational improvement: 61.3% (v4) → 65.2% (v4.5) → 71.5% (v4.6), a total gain of +10.2 points. Sonnet shows a large jump from v4 to v4.5 (+11.5%) but slight regression from v4.5 to v4.6 (−0.5%), possibly reflecting a trade-off between reasoning depth and execution efficiency in the 4.6 adaptive thinking architecture.
Higher reasoning effort generally leads to better agent performance. GPT-5.2 exhibits a clear monotonic trend: scaling from none (54.7%) to xhigh (82.2%), a 27.5-point improvement. Claude Opus 4.6 shows a similar overall trend, with max effort (73.8%) outperforming low (70.2%) by 3.6 points. Deeper reasoning directly translates to better task execution on professional tasks.
A critical finding is that strong agents are not necessarily strong environment simulators. GPT-5.2 ranks #1 as an agent but produces the worst simulation quality. When using a sufficiently capable simulator, pairwise ranking agreement reaches 85.7% (Gemini Flash vs. Qwen 3.5+), but drops to 75% when GPT-5.2 serves as the simulator.
This is one of the paper's most surprising findings. You might expect that the best AI model would also be the best at simulating environments. But GPT-5.2, which ranks #1 as an agent, produces the worst simulation quality. The paper identifies three distinct failure modes:
The practical takeaway: if you are building an LES-based evaluation system, do not assume your best model should also serve as the environment. Use a separate, validated model for simulation.
This has important implications for LES-based evaluation: the simulator is part of the evaluation apparatus, not a neutral observer. Simulator selection must be carefully validated, and cross-simulator consistency checks are essential for reliable benchmarking.
Industries vary dramatically in difficulty. Business (70.1%) is the easiest category on average, while Transportation (56.2%) is the hardest—a 14-point gap. This reveals that industry context significantly affects agent performance, and single-domain benchmarks cannot capture this variation.
Different models excel in different industry profiles: Gemini 3.1 Pro excels in knowledge-intensive domains (Education 84%, Science 81%, Technology 78%). Claude Opus 4.6 excels in operational domains (Transportation 77%, Business 78%). Qwen 3.5 Plus excels in consumer-facing domains (Commerce 81%, Healthcare 81%). Organizations should select agent models based on their specific industry, not solely on aggregate rankings.
This finding has direct business value. If you are deploying AI agents in your organization, the results suggest you should not pick one model for everything. Instead:
The 14-point gap between the easiest (Business, 70.1%) and hardest (Transportation, 56.2%) industries also suggests that difficulty varies enormously by domain—so benchmarking on a single domain gives a misleading picture of real-world readiness.
The following case studies illustrate how OccuBench reveals specific agent capabilities and failure modes through realistic professional task scenarios.
A delivery agent must identify the highest-priority medical shipment and deliver it while maintaining battery above 15%. Claude Opus 4.6 (PASS): Recognized that 28% battery was risky, recharged before navigating, arrived with 82% battery. DeepSeek V3.2 (FAIL): Navigated immediately, battery dropped to 12.5%, violating the constraint. The recharge came too late.
Key insight: The critical differentiator is whether the agent proactively checks constraints before acting, rather than reactively fixing violations.
An agent must detect thermal stratification in a fish farm and take corrective actions. The agent successfully profiled water quality, detected the problem (2.1°C gradient, low dissolved oxygen at bottom), activated mixing, and reduced feeding. However, it failed to re-check ammonia chemistry after corrections, claiming “ammonia remained low” without supporting evidence.
Key insight: Agents can execute correct actions but skip critical verification steps, making claims without evidence—a dangerous pattern in safety-critical domains.
An agent must inspect a building's gas system following NFPA 99 compliance procedures. The agent performed brazing without valid permits, skipped permit renewal during the work, submitted final certification with the oxygen valve still closed, and renewed permits only after all work was complete—too late for compliance.
Key insight: Procedural ordering matters in regulated domains. The agent completed all required actions but in the wrong sequence, resulting in compliance failure.
Under E1 fault injection on a public transit task, Kimi K2.5 stopped after encountering a single HTTP 500 error, completing only 1 of 4 required actions (RTPI suppression). It did not retry the failed call, resolve maintenance holds, or reassign the bus to its new route—demonstrating that some agents give up entirely instead of retrying when faced with explicit errors.
Under E2 implicit fault injection, Kimi K2.5 received truncated property data (only 2 of 15 units returned). Instead of detecting the incomplete data and re-querying, it assumed all 15 units followed the same pattern as the 2 sampled units. This led to an incorrect NOI calculation ($362,000 vs. actual) and a wrong DSCR assessment (PASS instead of FAIL)—demonstrating the danger of accepting degraded data at face value.
Simulation fidelity: Language Environment Simulators model domain logic rather than domain data. An LES understands that a drug interaction check should return contraindications, but the specific values are generated rather than retrieved from a real database. OccuBench evaluates an agent's decision-making process rather than its ability to handle exact real-world data values.
Simulator dependence: Evaluation results are tied to the specific simulator used during data synthesis. Tasks verified as solvable under one LES may become unsolvable under a different one, and agent rankings can shift when the simulator changes. The simulator is part of the evaluation apparatus, not a neutral observer.
OccuBench is the first benchmark systematically evaluating AI agents on real-world professional tasks across 100 scenarios, 65 specialized domains, and 10 industry categories. Through Language Environment Simulators, OccuBench makes the “untestable majority” of professional domains evaluable without any real environment infrastructure.
The evaluation of 15 frontier models reveals that: (1) no model dominates across all industries; (2) implicit environmental faults are harder than explicit and mixed faults; (3) scaling consistently improves performance; and (4) strong agents are not necessarily strong environment simulators. These findings have practical implications: organizations should select agent models based on their specific industry needs, invest in robustness testing beyond happy-path evaluation, and carefully validate simulator quality in LES-based benchmarks.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.