Skip to content

Dataset

Workspace-Bench contains realistic workspace tasks designed to evaluate an agent's ability to understand and manipulate large-scale file dependencies.

Overview

Dataset distribution

Worker Profiles

Tasks are organized around 5 realistic worker profiles, each with distinct workspace environments and file types:

Profile Role Typical Files
Operations Manager (yunying) Event planning, process management Docs, spreadsheets, presentations
Logistics Manager (houqin) Supply chain, vendor coordination Contracts, schedules, budgets
AI Product Manager (chanpin) Product specs, user research PRDs, mockups, roadmaps
Researcher (research) Academic writing, data analysis Papers, notebooks, datasets
Backend Developer (kaifa) API design, database schemas Code, configs, SQL

Task Structure

Each task consists of:

  • metadata.json — Task description, expected outputs, and rubrics
  • data/ — Input files that populate the workspace
  • File Dependency Graph — Explicit from -> to relationships between files
  • Rubrics — Fine-grained evaluation criteria (7,399 total across all tasks)

Metadata Format

{
  "absolute_id": 100,
  "persona": "Logistics Manager",
  "task": "Integrate the contents of four files and organize a complete onsite_hosting_execution_manual.doc...",
  "task_diff": "medium",
  "output_files": ["onsite_hosting_execution_manual.doc"],
  "rubrics": [
    "In onsite_hosting_execution_manual.doc, is the hosting content for the warm-up and opening section complete..."
  ],
  "rubric_types": ["Process Evaluation", "Outcome Evaluation"],
  "file_dep_graph": [
    {"from": "host_script_1.docx", "to": "onsite_hosting_execution_manual.doc"}
  ],
  "data_manifest": [
    {"filename": "host_script_1.docx", "stored_relpath": "data/a60fb401fab41412_host_script_1.docx"}
  ]
}

Dataset Splits

Workspace-Bench-Lite

A curated 100-task subset that preserves the full benchmark's distribution across personas, difficulties, and file types while reducing evaluation cost by approximately 70%.

python3 scripts/download_hf_assets.py --lite --workspaces

Full Workspace-Bench

The complete 388-task dataset with all workspaces.

python3 scripts/download_hf_assets.py --full

File Types

Workspace-Bench spans 74 file types, including but not limited to:

  • Documents: .doc, .docx, .pdf, .md, .txt
  • Spreadsheets: .xls, .xlsx, .csv
  • Presentations: .ppt, .pptx
  • Code: .py, .js, .sql, .yaml, .json
  • Images: .png, .jpg, .webp
  • Archives: .zip

Workspace Scale

  • Total files: 20,476
  • Max workspace size: up to 20GB
  • Tasks per persona: ~60-100
  • Average files per task: ~50-200

Accessing the Datasets

Datasets are hosted on Hugging Face:

You can also load them programmatically:

from datasets import load_dataset

lite = load_dataset("Workspace-Bench/Workspace-Bench-Lite", split="test")