Quick Start Guide¶
This guide will help you get started with Workspace-Bench, from installation to running your first evaluation.
Prerequisites¶
- Docker
- Python 3
- API credentials for the agent you want to run
- Node.js (≥ 18, only for the visualization dashboard)
Setup¶
First, clone the repository and prepare your environment:
git clone https://github.com/OpenDataBox/Workspace-Bench.git
cd Workspace-Bench/evaluation
cp .env.example .env
Fill .env with your API credentials before running an evaluation. For the default smoke command below, set KIMIK25_BASE_URL and KIMIK25_API_KEY.
Supported Providers
Workspace-Bench supports multiple model providers. See the .env.example for the full list of environment variables.
Download Data¶
Download the Lite task set and workspace files:
This will populate evaluation/tasks_lite/ with task metadata and evaluation/filesys/ with the corresponding workspace files.
Build Environment¶
Build the Docker image and bootstrap the evaluation environment:
docker compose -f docker/docker-compose.yaml build
docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
bash /workspace/Workspace-Bench/evaluation/docker/bootstrap.sh
Run One Task (Smoke Test)¶
Run a single-task smoke evaluation with the Codex harness:
docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
--harness codex \
--model kimi-k2.5 \
--dataset smoke
Check the report:
python3 scripts/assert_agent_runner_report.py \
output/Codex--Kimi-K2.5--Smoke/agent_runner_report.json
The expected output is:
Task outputs and logs are written to:
Run Workspace-Bench-Lite¶
Run the 100-task Lite benchmark:
docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
--harness codex \
--model kimi-k2.5 \
--dataset lite
Visualize Results¶
After running evaluations, start the visualization dashboard:
The dashboard will be available at http://localhost:5173 and automatically discovers results under evaluation/output/.
Next Steps¶
- Dataset — Learn about task formats and the Lite vs Full splits
- Evaluation — Explore advanced evaluation options and multiple harnesses