Execution-based agent benchmark

Stop rewarding agents for saying done.

HermesBench measures whether tool-using agents can inspect fixtures, run commands, produce artifacts, pass deterministic checks, and leave auditable evidence — without leaking the private tasks that make leaderboards meaningful.

Run the benchmark View demo leaderboard Read methodology

45tasks/templates

30public/dev tasks

0%mock false-done rate

CIgreen on GitHub

Why this exists

Static benchmarks are benchmaxxed.

Public prompts leak into training data, benchmark harnesses become optimization targets, and simple Q&A scores stop predicting real agent reliability. HermesBench keeps public tasks useful for development while reserving private/fresh/anchor packs for official scoring.

Execution first

Deterministic, artifact, and test-based checks beat subjective grading wherever possible.

False-done aware

Claiming success without artifacts is tracked as a measurable failure mode.

Auditable

Result JSON preserves task evidence, wall time, tool calls, cost fields, timeouts, and category scores.

uv run hermesbench validate-tasks
uv run hermesbench run \
  --agent mock \
  --suite public-dev \
  --output-dir /tmp/hermesbench-results
uv run hermesbench score \
  /tmp/hermesbench-results/*.json

Task model

Designed for Hermes-style tool use.

Tasks are Markdown + YAML specs with fixtures, expected artifacts, deterministic checks, hidden-check notes, contamination risk, freshness windows, safety notes, and human-baseline fields.

Demo data · unofficial

Leaderboard scaffold

This is static demo data until official private/fresh/anchor runs are launched. Self-submitted public/dev results are useful, but not comparable to official hidden-pack runs.

Rank	Agent	Model	Score	Pass@1	False done	Official

Audit trail

Result detail

Loading demo result…

Submissions

Unofficial first. Official only with private packs.

Public uploads are unofficial by default. Official runs require maintainer-controlled task packs, run manifests, score hashes, and archived evidence.

Official-run policy →

uv run hermesbench serve-api \
  --host 127.0.0.1 \
  --port 8787

POST /v1/results
GET  /v1/leaderboard
GET  /health