/ terminal-bench · april 2026 batch

Agentic data that actually runs.

Expert-authored terminal tasks with Docker-reproducible verifiers and reference solutions. Every sample boots in a real container, every pass/fail is a concrete test — no LLM-as-judge.

6

Expert-authored samples

April 2026 batch

3,500+

Indexed source files

every byte reproducible

18

Tests per sample · avg

deterministic pytest

100%

Docker-verified

no LLM-as-judge

/ what is terminal bench

LLM evals miss the terminal. Agents live there.

Benchmarks that score agents on static code snippets or single-turn QA miss the actual job: executing commands, reading output, editing files, and recovering from errors — at the shell prompt, across long-running sessions.

Terminal Bench samples are expert-authored tasks with a deterministic verifier. They boot the agent into a real Docker environment, hand it instruction.md, and then score the final state of the filesystem — not the LLM's self-report.

/ task lifecycle

From prompt to verdict, reproducibly.

step 01

Spec declared

task.toml pins the environment: CPUs, memory, timeouts, difficulty, author. Every run reproduces the same box.

# task.toml
cpus = 4  mem = 2G
agent_timeout = 600s

step 02

Task issued

The agent receives instruction.md plus a clean repo tree. No hidden context, no test leakage.

$ cat instruction.md
# Fix asyncio REPL leaking tasks on Ctrl-C…

step 03

Container spawns

A fresh Docker environment boots with exactly the CPU, memory, and timeout caps declared in the spec.

→ docker run  4cpu  2G  timeout=600s
→ /workspace  ready

step 04

Agent acts

The agent edits source, runs commands, inspects output. Every action flows through a real shell, every shell line is recorded.

✎ patch src/repl.py
$ python -m pytest -q tests/

step 05

Tests run

A deterministic pytest harness executes against the final filesystem. Assertions are concrete — no LLM-as-judge, no subjective scoring.

$ pytest -q
✓ 18 passed  0 failed  142s

step 06

Artifacts delivered

PASS/FAIL plus the full trace: commands, diffs, stdout, timings. Everything is versioned and auditable.

result: PASS
log.jsonl · diff.patch · trace.tar

/ anatomy of a sample

Four artifacts. Every sample.

file

task.toml

Metadata: author, difficulty, tags, time estimates, resource caps, verifier timeout.

file

instruction.md

Full narrative task brief with code fragments, constraints, and examples — the exact prompt the agent sees.

file

solution/solve.sh

Reference expert solution — a bash script that produces a passing run against the verifier.

file

tests/

Deterministic pytest harness that decides PASS/FAIL; every assertion is specific and inspectable.

/ real artifact

One expert solution. Readable preview.

This compact solve.sh preview shows how a Terminal Bench sample is patched and verified inside tbrain-asyncio-repl-lifecycle. Full artifacts remain available inside the showcase.

10

lines of bash

0

llm judges involved

solution/solve.sh
bash · github-dark
#!/usr/bin/env bash
# Compact expert solution preview.
set -euo pipefail

# Patch the REPL to cancel outstanding tasks on Ctrl-C.
sed -i 's/loop.run_forever()/await _drain_and_close(loop)/' src/repl.py

# Run the deterministic verifier.
python -m pytest -q tests/test_outputs.py

/ why terminal bench

We build the hardest tasks your agents haven't seen yet.

01

Difficulty engineering

Each task is designed to fall inside the expert-yet-solvable band — hard enough to separate strong agents from weak ones, tractable enough that human experts can verify outcomes deterministically.

02

Verification pipeline

Every sample ships with a reproducible test harness that boots a fresh Docker environment, runs the agent-produced solution, and asserts concrete outcomes — no LLM-as-judge handwaving.

03

Environment design

Real-world software stacks: Linux, Python, Node, bash, systemd-style services, database fixtures. Agents operate in a terminal, not a toy sandbox.

04

Task strategy

A deliberate spread across debugging, devops, library work, and API integration — calibrated to surface capability gaps rather than piling on tasks of one flavor.

05

Expert authoring

Each sample is authored by a PhD or senior engineer in its domain and peer-reviewed. Instructions are written to be precise and unambiguous under terminal-only observation.

06

Delivery guarantees

Full provenance: author, review notes, time estimates for expert vs. junior, resource caps. Every artifact is versioned; every change is auditable.

/ how to access

Two paths in. Zero guesswork.

path one

Have a passcode?

If our sales team has sent you a TB-XXXX-XXXX passcode, enter it to open the current batch.

path two

Request access

Tell us about your team and intended use. Our sales team reviews each request and usually responds within one business day.