Workspace-Bench¶

Workspace-Bench is a benchmark for evaluating AI agents on workspace tasks with large-scale file dependencies. It is built to study a capability we call Workspace Learning: whether an agent can identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a real worker's workspace.

Unlike benchmarks that place all information directly in the prompt or provide a small bundle of task-specific files, Workspace-Bench evaluates agents in realistic workspaces where they must independently explore directories, locate relevant evidence, understand cross-file relations, and produce correct deliverables.

Workspace-Bench framework overview

What is Workspace Learning?¶

Workspace Learning is the ability of an AI agent to:

Identify explicit and implicit dependencies among files in a workspace
Reason over heterogeneous data formats (documents, spreadsheets, code, images, etc.)
Exploit cross-file relationships to complete multi-step tasks
Update existing files while preserving consistency across the workspace

Key Statistics¶

5 realistic worker profiles
74 file types across heterogeneous environments
20,476 files, with workspaces up to 20GB
388 tasks, each with an explicit file dependency graph
7,399 fine-grained rubrics for evaluation
Workspace-Bench-Lite: a 100-task subset reducing cost by ~70%

The SWE-bench Connection¶

Workspace-Bench shares design philosophy with SWE-bench: both evaluate agents on real-world tasks with concrete success criteria. While SWE-bench focuses on resolving GitHub issues through code patches, Workspace-Bench focuses on cross-file reasoning and production of heterogeneous deliverables in realistic workplace environments.

Next Steps¶

Quick Start — Run your first evaluation in minutes
Dataset — Understand the task distribution and formats
Evaluation — Learn how to run full benchmarks and interpret results