Quick Start Guide¶

This guide will help you get started with Workspace-Bench, from installation to running your first evaluation.

Prerequisites¶

Docker
Python 3
API credentials for the agent you want to run
Node.js (≥ 18, only for the visualization dashboard)

Setup¶

First, clone the repository and prepare your environment:

git clone https://github.com/OpenDataBox/Workspace-Bench.git
cd Workspace-Bench/evaluation
cp .env.example .env

Fill .env with your API credentials before running an evaluation. For the default smoke command below, set KIMIK25_BASE_URL and KIMIK25_API_KEY.

Supported Providers

Workspace-Bench supports multiple model providers. See the .env.example for the full list of environment variables.

Download Data¶

Download the Lite task set and workspace files:

python3 scripts/download_hf_assets.py --lite --workspaces

This will populate evaluation/tasks_lite/ with task metadata and evaluation/filesys/ with the corresponding workspace files.

Build Environment¶

Build the Docker image and bootstrap the evaluation environment:

docker compose -f docker/docker-compose.yaml build
docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  bash /workspace/Workspace-Bench/evaluation/docker/bootstrap.sh

Run One Task (Smoke Test)¶

Run a single-task smoke evaluation with the Codex harness:

docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
  --harness codex \
  --model kimi-k2.5 \
  --dataset smoke

Check the report:

python3 scripts/assert_agent_runner_report.py \
  output/Codex--Kimi-K2.5--Smoke/agent_runner_report.json

The expected output is:

[ok] output/Codex--Kimi-K2.5--Smoke/agent_runner_report.json: 1/1 passed

Task outputs and logs are written to:

evaluation/output/Codex--Kimi-K2.5--Smoke/

Run Workspace-Bench-Lite¶

Run the 100-task Lite benchmark:

docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
  --harness codex \
  --model kimi-k2.5 \
  --dataset lite

Visualize Results¶

After running evaluations, start the visualization dashboard:

cd ../viz
npm install
npm run dev

The dashboard will be available at http://localhost:5173 and automatically discovers results under evaluation/output/.

Next Steps¶

Dataset — Learn about task formats and the Lite vs Full splits
Evaluation — Explore advanced evaluation options and multiple harnesses