Evaluation Guide¶
This guide explains how to evaluate AI agents on Workspace-Bench tasks.
Overview¶
Workspace-Bench evaluates agents by placing them in realistic workspace environments, providing a task description, and measuring their ability to produce correct outputs against fine-grained rubrics. The evaluation supports multiple agent harnesses and can be run via Docker for reproducibility.
Supported Harnesses¶
| Harness | Description | API Compatibility |
|---|---|---|
codex |
OpenAI Codex / Responses API | OpenAI Responses → Chat Completions adapter |
openclaw |
OpenClaw agent harness | OpenAI-compatible Chat Completions |
deepagent |
DeepAgents harness (LangChain) | OpenAI-compatible |
claudecode |
Claude Code harness | Anthropic API |
Supported Models¶
Common model aliases include:
gpt-5.4gemini-3.1-prokimi-k2.5glm-5.1minimax-m2.7grok-4.3qwen-3.6
For a custom provider, add --model-id, --model-name, and --env-prefix to the run command.
Running Evaluations¶
Basic Evaluation on Lite¶
docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
--harness codex \
--model kimi-k2.5 \
--dataset lite
Evaluation on the Full Benchmark¶
python3 scripts/download_hf_assets.py --full
docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
--harness codex \
--model kimi-k2.5 \
--dataset full
Using Different Harnesses¶
# OpenClaw + GLM
docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
--harness openclaw \
--model glm-5.1 \
--dataset lite
# DeepAgent + MiniMax
docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
--harness deepagent \
--model minimax-m2.7 \
--dataset lite
Evaluation Outputs¶
Completed runs are stored under evaluation/output/ with the naming convention:
Each task directory contains:
metadata.json— Task definitionagent.json— Execution trace, token usage, and statusoutput/— Files produced by the agentrubrics_judge--{model}.json— Rubric evaluation resultsdependency_graph--{model}.json— Extracted I/O dependency graph
Interpreting Results¶
The agent_runner_report.json at the run root contains:
{
"summary": {
"total": 100,
"passed": 67,
"failed": 20,
"error": 8,
"timeout": 5
},
"cases": [...]
}
A task is marked passed if the agent successfully produced output files. Final correctness is determined by rubric judgment.
Rubric Judgment¶
Rubric files contain per-criterion evaluations:
{
"rubrics": [
{
"index": 0,
"rubric": "Is the output format correct?",
"passed": true,
"confidence": 0.95,
"evidence": "File output.docx contains properly formatted sections..."
}
],
"summary": {
"total": 7,
"passed": 5,
"failed": 2
}
}
Advanced Options¶
Custom Providers¶
For models not in the predefined alias list:
docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
--harness codex \
--model-id my-model \
--model-name My-Model \
--env-prefix MYMODEL \
--dataset lite
Ensure MYMODEL_BASE_URL and MYMODEL_API_KEY are set in .env.
Running Without Docker¶
You can also run evaluations directly if you have the dependencies installed:
cd evaluation
python3 -m pip install -e requirements.txt # if available
python3 src/agent_runner.py --run-config runs/my_config.yaml
Next Steps¶
- Visualization — Browse results in the web dashboard