CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

Xiangru Jian, Shravan Nayak, Kevin Qinghong Lin, Aarash Feizi, Kaixin Li, Patrice Bechard, Spandana Gella, Sai Rajeswar

ServiceNow · University of Waterloo · Mila · Université de Montréal · McGill University · University of Oxford · National University of Singapore

Project Page →

55 hours

6 million frames of continuous 30fps expert video across 10,000 tasks

Professional desktop applications across 12 diverse categories

497 words

Average per-step multi-layered reasoning annotations

3.6M

UI element annotations across 56K screenshots in GroundCUA

The CUA-Suite Ecosystem

CUA-Suite Overview: Human GUI trajectories are recorded across desktop platforms, expert-verified, and annotated with keyframes, bounding boxes, and interaction logs.

Figure 1: CUA-Suite Overview. Human GUI trajectories are recorded across desktop platforms, expert-verified, and annotated with keyframes, bounding boxes, and interaction logs. The suite comprises UI-Vision, GroundCUA, and VideoCUA.

What is a "Computer-Use Agent" (CUA)?

A Computer-Use Agent is an AI system that can operate a desktop computer by observing the screen (as pixels or parsed UI elements) and performing actions like clicking, typing, dragging, and scrolling. Unlike web APIs or code execution, CUAs interact with software visually — the same way a human does — making them applicable to any application without needing special integrations. The core challenges are: understanding what's currently on screen, knowing what action achieves the goal, and precisely localizing where to click on a complex interface.

~10,000 human-demonstrated tasks
55 hours of continuous 30fps screen recordings
Kinematic cursor traces with millisecond precision
Multi-layered reasoning annotations (observation, thought, action, reflection)
Format-compatible with OpenCUA and ScaleCUA

56K annotated screenshots
3.6 million UI element annotations
Human-verified bounding boxes at pixel precision
8 semantic categories for 50% of elements
700K instruction-tuning dataset for grounding

450 high-quality task demonstrations
Element Grounding: localize UI elements from text
Layout Grounding: identify functionally related groups
Action Prediction: predict next correct action
Multi-faceted diagnosis of agent failures

How VideoCUA Compares

Why do existing datasets fall short for desktop CUAs?

Most CUA research has focused on web browsers and mobile apps because those environments are easy to instrument automatically. Web DOM provides ground-truth element locations; mobile accessibility trees provide semantic labels. Professional desktop applications (Blender, FreeCAD, Krita, QGIS) use custom-drawn widgets with no accessibility metadata — the model must understand the UI purely from pixels. Existing datasets either cover the wrong platforms, use only still screenshots (missing the temporal trajectory of multi-step tasks), or are synthetically generated (introducing noise and unrealistic interaction patterns). VideoCUA addresses all three gaps simultaneously.

Existing GUI trajectory datasets face critical limitations: web and mobile datasets like Mind2Web and AITW lack desktop coverage; screenshot-based datasets like ScaleCUA and OpenCUA miss temporal dynamics by capturing only final click coordinates; and synthesized datasets suffer from noise inherent to automated generation. VideoCUA is the only dataset that simultaneously provides continuous 30fps video, desktop focus, human curation, and rich multi-layered chain-of-thought annotations at scale—more than 2.5× the largest existing open dataset.

Table 2: Comparison of VideoCUA with existing GUI trajectory and agent datasets. VideoCUA is the only dataset providing continuous 30fps video for professional desktop applications with long multi-layered CoT annotations.

Evaluation Results

Table 1: Element Grounding Performance on UI-Vision

MAI-UI-32B achieves 47.7% average accuracy, leading 16 evaluated models. While Basic and Functional categories approach 60%, the Spatial split remains stubbornly difficult (max 26.9%), indicating spatial reasoning as a major hurdle.

What is "Element Grounding" and the @50px metric?

Element Grounding means: given a text description of a UI element ("the save button", "the layers panel scroll bar"), predict the pixel coordinates of that element on the screen. The @50px metric counts a prediction as correct only if it falls within 50 pixels of the ground-truth center. This threshold is meaningful because desktop UI elements are often small — a toolbar button might be 24×24 pixels. Getting within 50px means you're close enough to actually click the right thing. The "Spatial" category (e.g., "the third icon from the left in the top toolbar") is hardest because it requires counting and relative spatial reasoning, not just matching visual appearance to a text label.

Scaling yields consistent gains: OpenCUA improves by 7.6 points from 7B to 72B. Pairing PhiGround-7B with an o3 planner adds 9.0 points, showing that reasoned instructions mitigate execution errors.

OpenCUA-32B achieves 37.7% @50px success rate (vs 16.5% for 7B) across 256 sampled tasks spanning 87 applications.

Human evaluation reveals a critical asymmetry: action correctness reaches 85.9% while grounding correctness is only 52.4%—models frequently identify the correct action type but fail to precisely localize the target UI element.

The action vs. grounding asymmetry explained

This asymmetry is a key finding of the paper. "Action correctness" (85.9%) means the model understands what to do: "click the color picker". "Grounding correctness" (52.4%) means the model knows where to click: the exact pixel location. The gap (33.5 points) reveals that current models have strong semantic understanding of task intent but weak spatial precision. Crucially, if you click the wrong element — even with the right intent — the task fails. This suggests that the core bottleneck for professional desktop CUAs is not reasoning but precise visual localization, pointing to the value of dense pixel-level annotation data like GroundCUA's 3.6M bounding boxes.

Representative Prediction Failures

Figure 2 illustrates a common failure pattern: models struggle to disambiguate visually similar interactive elements distributed across complex, multi-panel interfaces. In Krita, the model targets the Layers panel instead of the tool sidebar (cross-panel confusion). In FreeCAD, it confuses the toolbar with the model tree. These failures are characteristic of professional desktop applications—precisely the domain where existing training data is scarcest.

87 Applications Across 12 Categories

CUA-Suite prioritizes open-source applications with permissive licenses across 12 categories, from software development (VS Code, Eclipse, PyCharm) to content creation (Blender, Inkscape, Krita) and finance (GnuCash, Frappe Books). These applications mirror closed-source counterparts, ensuring broad applicability.

Table 4: Categories of desktop applications and their corresponding applications in CUA-Suite.

Dramatic Performance Variance

Per-application performance varies 20× depending on interface complexity:

3.6%

Darktable

Creative tools @50px success rate Web-like apps

73.3%

OnlyOffice

Applications with specialized visual interfaces—creative tools (Darktable, Krita), canvas-based tools (FreeCAD, QGIS), and media applications (Kodi)—exhibit the lowest success rates. Web-like applications (browsers, spreadsheets, IDEs) with standard toolbar arrangements better align with existing model training distributions.

Multi-layered Reasoning Trajectories

VideoCUA enriches raw video recordings with dense trajectory annotations using a multi-layered reasoning synthesis pipeline. For each keyframe in a task trajectory, four complementary annotation layers are generated, averaging 497 words per step:

Why are chain-of-thought annotations at 497 words/step useful for training?

When you train a model to predict actions, the simplest label is just the click coordinates. But models trained this way learn to pattern-match without understanding why. By providing a 497-word annotation that describes what the agent observes on screen, what it's thinking, what action it's taking, and what it expects to happen next, you give the model a rich supervision signal. During training, the model learns to produce (or implicitly "think") this reasoning chain before predicting the action. This technique — called "chain-of-thought distillation" — is the same approach that has dramatically improved LLM reasoning. Applied to GUI agents, it teaches the model to reason about interface state before acting, reducing random-click failures.

Observation

Detailed description of the current screen state, identifying relevant UI elements and their spatial arrangement.

Thought

Reasoning chain connecting the high-level task goal to the immediate action choice.

Action

Intended action described in natural language grounded to visual elements rather than raw coordinates.

Reflection

Analysis of the outcome, enabling self-correction signals for training.

Example: Krita Digital Art Task

The following screenshots show a trajectory from a Krita task: “Draw a circle shape, fill green color.” Each step captures the screen state alongside the four annotation layers.

Step 0: Initial state

Step 3: Circle drawing action

Step K: Final result with green filled circle

Step K: Task completed

Implications & Future Directions

Both automated and human evaluations converge on the same conclusion: current foundation action models struggle substantially with professional desktop applications, achieving only 37.7% @50px and 57.6% human-verified stepwise accuracy. The wide per-application variance confirms that the core difficulty lies in the diverse visual vocabularies and interaction patterns of professional desktop software, where existing training data is scarce.

VideoCUA directly targets this domain gap through domain coverage (87 professional applications), video scale (55 hours of continuous 30fps recordings), annotation density (~497 words per step), and action diversity (drags, fine-grained mouse control) that web-centric datasets underrepresent.

Generalist Screen Parsing

Dense human-verified bounding-box annotations for training robust desktop screen parsers that cover canvas-based and custom-drawn widgets.

Continuous Spatial Control

Kinematic cursor trajectories preserving human movement priors (Fitts’s Law) for learning continuous mouse movement policies from visual feedback.

What is Fitts's Law and why does it matter for CUA training?

Fitts's Law is a predictive model of human movement: the time to move to a target depends on the distance to the target and its size (T = a + b·log₂(2D/W)). When humans use a mouse, cursor paths are not straight lines — they decelerate as they approach the target, overshoot and self-correct, and move faster over open space. These kinematic properties encode implicit knowledge about target size and location. If a model learns from human cursor trajectories (rather than just click coordinates), it implicitly learns Fitts's Law — knowing that a "slow, careful" trajectory is heading toward a small or precise target. This could enable CUAs to generalize mouse movement to new interface layouts without explicit coordinate training.

Visual World Models

Dense 30fps state-action-next-state triplets for action-conditioned video generation and visual lookahead planning.

Video-Based Reward Modeling

Continuous expert video recordings as positive demonstrations for training fine-grained, step-wise reward models.

Visit Project Page

All data, benchmarks, and models are publicly released. · arXiv:2603.24440

B2B Content

Any content, beautifully transformed for your organization

PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.

View Services Contact Us