/ terminal-bench · april 2026 batch
Agentic data that actually runs.
Expert-authored terminal tasks with Docker-reproducible verifiers and reference solutions. Every sample boots in a real container, every pass/fail is a concrete test — no LLM-as-judge.
6
Expert-authored samples
April 2026 batch
3,500+
Indexed source files
every byte reproducible
18
Tests per sample · avg
deterministic pytest
100%
Docker-verified
no LLM-as-judge
/ what is terminal bench
LLM evals miss the terminal. Agents live there.
Benchmarks that score agents on static code snippets or single-turn QA miss the actual job: executing commands, reading output, editing files, and recovering from errors — at the shell prompt, across long-running sessions.
Terminal Bench samples are expert-authored tasks with a deterministic verifier. They boot the agent into a real Docker environment, hand it instruction.md, and then score the final state of the filesystem — not the LLM's self-report.
/ task lifecycle
From prompt to verdict, reproducibly.
step 01
Spec declared
task.toml pins the environment: CPUs, memory, timeouts, difficulty, author. Every run reproduces the same box.
# task.toml cpus = 4 mem = 2G agent_timeout = 600s
step 02
Task issued
The agent receives instruction.md plus a clean repo tree. No hidden context, no test leakage.
$ cat instruction.md # Fix asyncio REPL leaking tasks on Ctrl-C…
step 03
Container spawns
A fresh Docker environment boots with exactly the CPU, memory, and timeout caps declared in the spec.
→ docker run 4cpu 2G timeout=600s → /workspace ready
step 04
Agent acts
The agent edits source, runs commands, inspects output. Every action flows through a real shell, every shell line is recorded.
✎ patch src/repl.py $ python -m pytest -q tests/
step 05
Tests run
A deterministic pytest harness executes against the final filesystem. Assertions are concrete — no LLM-as-judge, no subjective scoring.
$ pytest -q ✓ 18 passed 0 failed 142s
step 06
Artifacts delivered
PASS/FAIL plus the full trace: commands, diffs, stdout, timings. Everything is versioned and auditable.
result: PASS log.jsonl · diff.patch · trace.tar
/ anatomy of a sample
Four artifacts. Every sample.
file
task.toml
Metadata: author, difficulty, tags, time estimates, resource caps, verifier timeout.
file
instruction.md
Full narrative task brief with code fragments, constraints, and examples — the exact prompt the agent sees.
file
solution/solve.sh
Reference expert solution — a bash script that produces a passing run against the verifier.
file
tests/
Deterministic pytest harness that decides PASS/FAIL; every assertion is specific and inspectable.
/ real artifact
One expert solution. Readable preview.
This compact solve.sh preview shows how a Terminal Bench sample is patched and verified inside tbrain-asyncio-repl-lifecycle. Full artifacts remain available inside the showcase.
10
lines of bash
0
llm judges involved
#!/usr/bin/env bash
# Compact expert solution preview.
set -euo pipefail
# Patch the REPL to cancel outstanding tasks on Ctrl-C.
sed -i 's/loop.run_forever()/await _drain_and_close(loop)/' src/repl.py
# Run the deterministic verifier.
python -m pytest -q tests/test_outputs.py
/ why terminal bench
We build the hardest tasks your agents haven't seen yet.
Difficulty engineering
Each task is designed to fall inside the expert-yet-solvable band — hard enough to separate strong agents from weak ones, tractable enough that human experts can verify outcomes deterministically.
Verification pipeline
Every sample ships with a reproducible test harness that boots a fresh Docker environment, runs the agent-produced solution, and asserts concrete outcomes — no LLM-as-judge handwaving.
Environment design
Real-world software stacks: Linux, Python, Node, bash, systemd-style services, database fixtures. Agents operate in a terminal, not a toy sandbox.
Task strategy
A deliberate spread across debugging, devops, library work, and API integration — calibrated to surface capability gaps rather than piling on tasks of one flavor.
Expert authoring
Each sample is authored by a PhD or senior engineer in its domain and peer-reviewed. Instructions are written to be precise and unambiguous under terminal-only observation.
Delivery guarantees
Full provenance: author, review notes, time estimates for expert vs. junior, resource caps. Every artifact is versioned; every change is auditable.
/ how to access
Two paths in. Zero guesswork.
path one
Have a passcode?
If our sales team has sent you a TB-XXXX-XXXX passcode, enter it to open the current batch.
path two
Request access
Tell us about your team and intended use. Our sales team reviews each request and usually responds within one business day.