Execution-based agent benchmark

Stop rewarding agents for saying done.

HermesBench measures whether tool-using agents can inspect fixtures, run commands, produce artifacts, pass deterministic checks, and leave auditable evidence — without leaking the private tasks that make leaderboards meaningful.

45tasks/templates
30public/dev tasks
0%mock false-done rate
CIgreen on GitHub
Why this exists

Static benchmarks are benchmaxxed.

Public prompts leak into training data, benchmark harnesses become optimization targets, and simple Q&A scores stop predicting real agent reliability. HermesBench keeps public tasks useful for development while reserving private/fresh/anchor packs for official scoring.

Execution first

Deterministic, artifact, and test-based checks beat subjective grading wherever possible.

False-done aware

Claiming success without artifacts is tracked as a measurable failure mode.

Auditable

Result JSON preserves task evidence, wall time, tool calls, cost fields, timeouts, and category scores.

uv run hermesbench validate-tasks uv run hermesbench run \ --agent mock \ --suite public-dev \ --output-dir /tmp/hermesbench-results uv run hermesbench score \ /tmp/hermesbench-results/*.json
Task model

Designed for Hermes-style tool use.

Tasks are Markdown + YAML specs with fixtures, expected artifacts, deterministic checks, hidden-check notes, contamination risk, freshness windows, safety notes, and human-baseline fields.

Demo data · unofficial

Leaderboard scaffold

This is static demo data until official private/fresh/anchor runs are launched. Self-submitted public/dev results are useful, but not comparable to official hidden-pack runs.

RankAgentModelScorePass@1False doneOfficial
Audit trail

Result detail

Loading demo result…

Submissions

Unofficial first. Official only with private packs.

Public uploads are unofficial by default. Official runs require maintainer-controlled task packs, run manifests, score hashes, and archived evidence.

Official-run policy →

uv run hermesbench serve-api \ --host 127.0.0.1 \ --port 8787 POST /v1/results GET /v1/leaderboard GET /health