Stop rewarding agents for saying done.
HermesBench measures whether tool-using agents can inspect fixtures, run commands, produce artifacts, pass deterministic checks, and leave auditable evidence — without leaking the private tasks that make leaderboards meaningful.
Static benchmarks are benchmaxxed.
Public prompts leak into training data, benchmark harnesses become optimization targets, and simple Q&A scores stop predicting real agent reliability. HermesBench keeps public tasks useful for development while reserving private/fresh/anchor packs for official scoring.
Execution first
Deterministic, artifact, and test-based checks beat subjective grading wherever possible.
False-done aware
Claiming success without artifacts is tracked as a measurable failure mode.
Auditable
Result JSON preserves task evidence, wall time, tool calls, cost fields, timeouts, and category scores.
uv run hermesbench validate-tasks
uv run hermesbench run \
--agent mock \
--suite public-dev \
--output-dir /tmp/hermesbench-results
uv run hermesbench score \
/tmp/hermesbench-results/*.jsonDesigned for Hermes-style tool use.
Tasks are Markdown + YAML specs with fixtures, expected artifacts, deterministic checks, hidden-check notes, contamination risk, freshness windows, safety notes, and human-baseline fields.
Leaderboard scaffold
This is static demo data until official private/fresh/anchor runs are launched. Self-submitted public/dev results are useful, but not comparable to official hidden-pack runs.
| Rank | Agent | Model | Score | Pass@1 | False done | Official |
|---|
Result detail
Loading demo result…
Unofficial first. Official only with private packs.
Public uploads are unofficial by default. Official runs require maintainer-controlled task packs, run manifests, score hashes, and archived evidence.
uv run hermesbench serve-api \
--host 127.0.0.1 \
--port 8787
POST /v1/results
GET /v1/leaderboard
GET /health