Skip to content

Evaluation Guide

This guide explains how to evaluate AI agents on Workspace-Bench tasks.

Overview

Workspace-Bench evaluates agents by placing them in realistic workspace environments, providing a task description, and measuring their ability to produce correct outputs against fine-grained rubrics. The evaluation supports multiple agent harnesses and can be run via Docker for reproducibility.

Supported Harnesses

Harness Description API Compatibility
codex OpenAI Codex / Responses API OpenAI Responses → Chat Completions adapter
openclaw OpenClaw agent harness OpenAI-compatible Chat Completions
deepagent DeepAgents harness (LangChain) OpenAI-compatible
claudecode Claude Code harness Anthropic API

Supported Models

Common model aliases include:

  • gpt-5.4
  • gemini-3.1-pro
  • kimi-k2.5
  • glm-5.1
  • minimax-m2.7
  • grok-4.3
  • qwen-3.6

For a custom provider, add --model-id, --model-name, and --env-prefix to the run command.

Running Evaluations

Basic Evaluation on Lite

docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
  --harness codex \
  --model kimi-k2.5 \
  --dataset lite

Evaluation on the Full Benchmark

python3 scripts/download_hf_assets.py --full

docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
  --harness codex \
  --model kimi-k2.5 \
  --dataset full

Using Different Harnesses

# OpenClaw + GLM
docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
  --harness openclaw \
  --model glm-5.1 \
  --dataset lite

# DeepAgent + MiniMax
docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
  --harness deepagent \
  --model minimax-m2.7 \
  --dataset lite

Evaluation Outputs

Completed runs are stored under evaluation/output/ with the naming convention:

{Harness}--{Model}--{Dataset}/

Each task directory contains:

  • metadata.json — Task definition
  • agent.json — Execution trace, token usage, and status
  • output/ — Files produced by the agent
  • rubrics_judge--{model}.json — Rubric evaluation results
  • dependency_graph--{model}.json — Extracted I/O dependency graph

Interpreting Results

The agent_runner_report.json at the run root contains:

{
  "summary": {
    "total": 100,
    "passed": 67,
    "failed": 20,
    "error": 8,
    "timeout": 5
  },
  "cases": [...]
}

A task is marked passed if the agent successfully produced output files. Final correctness is determined by rubric judgment.

Rubric Judgment

Rubric files contain per-criterion evaluations:

{
  "rubrics": [
    {
      "index": 0,
      "rubric": "Is the output format correct?",
      "passed": true,
      "confidence": 0.95,
      "evidence": "File output.docx contains properly formatted sections..."
    }
  ],
  "summary": {
    "total": 7,
    "passed": 5,
    "failed": 2
  }
}

Advanced Options

Custom Providers

For models not in the predefined alias list:

docker compose -f docker/docker-compose.yaml run --rm workspace-bench \
  bash /workspace/Workspace-Bench/evaluation/docker/run-benchmark.sh \
  --harness codex \
  --model-id my-model \
  --model-name My-Model \
  --env-prefix MYMODEL \
  --dataset lite

Ensure MYMODEL_BASE_URL and MYMODEL_API_KEY are set in .env.

Running Without Docker

You can also run evaluations directly if you have the dependencies installed:

cd evaluation
python3 -m pip install -e requirements.txt  # if available
python3 src/agent_runner.py --run-config runs/my_config.yaml

Next Steps