/ terminal-bench · april 2026 batch

Agentic data that actually runs.

Expert-authored terminal tasks with Docker-reproducible verifiers and reference solutions. Every sample boots in a real container, every pass/fail is a concrete test — no LLM-as-judge.

Enter showcase+Request access+

tbrain — tb run

Expert-authored samples

April 2026 batch

3,500+

Indexed source files

every byte reproducible

Tests per sample · avg

deterministic pytest

100%

Docker-verified

no LLM-as-judge

/ what is terminal bench

LLM evals miss the terminal. Agents live there.

Benchmarks that score agents on static code snippets or single-turn QA miss the actual job: executing commands, reading output, editing files, and recovering from errors — at the shell prompt, across long-running sessions.

Terminal Bench samples are expert-authored tasks with a deterministic verifier. They boot the agent into a real Docker environment, hand it instruction.md, and then score the final state of the filesystem — not the LLM's self-report.

/ task lifecycle

From prompt to verdict, reproducibly.

step 01

Spec declared

task.toml pins the environment: CPUs, memory, timeouts, difficulty, author. Every run reproduces the same box.

# task.toml
cpus = 4  mem = 2G
agent_timeout = 600s

step 02

Task issued

The agent receives instruction.md plus a clean repo tree. No hidden context, no test leakage.

$ cat instruction.md
# Fix asyncio REPL leaking tasks on Ctrl-C…

step 03

Container spawns

A fresh Docker environment boots with exactly the CPU, memory, and timeout caps declared in the spec.

→ docker run  4cpu  2G  timeout=600s
→ /workspace  ready

step 04

Agent acts

The agent edits source, runs commands, inspects output. Every action flows through a real shell, every shell line is recorded.

✎ patch src/repl.py
$ python -m pytest -q tests/

step 05

Tests run

A deterministic pytest harness executes against the final filesystem. Assertions are concrete — no LLM-as-judge, no subjective scoring.

$ pytest -q
✓ 18 passed  0 failed  142s

step 06

Artifacts delivered

PASS/FAIL plus the full trace: commands, diffs, stdout, timings. Everything is versioned and auditable.

result: PASS
log.jsonl · diff.patch · trace.tar

/ anatomy of a sample

Four artifacts. Every sample.

file

task.toml

Metadata: author, difficulty, tags, time estimates, resource caps, verifier timeout.

file

instruction.md

Full narrative task brief with code fragments, constraints, and examples — the exact prompt the agent sees.

file

solution/solve.sh

Reference expert solution — a bash script that produces a passing run against the verifier.

file

tests/

Deterministic pytest harness that decides PASS/FAIL; every assertion is specific and inspectable.

/ real artifact

One expert solution. Readable preview.

This compact solve.sh preview shows how a Terminal Bench sample is patched and verified inside tbrain-asyncio-repl-lifecycle. Full artifacts remain available inside the showcase.

lines of bash

llm judges involved

solution/solve.sh

bash · github-dark

#!/usr/bin/env bash
# Compact expert solution preview.
set -euo pipefail

# Patch the REPL to cancel outstanding tasks on Ctrl-C.
sed -i 's/loop.run_forever()/await _drain_and_close(loop)/' src/repl.py

# Run the deterministic verifier.
python -m pytest -q tests/test_outputs.py

/ why terminal bench

We build the hardest tasks your agents haven't seen yet.

Difficulty engineering

Each task is designed to fall inside the expert-yet-solvable band — hard enough to separate strong agents from weak ones, tractable enough that human experts can verify outcomes deterministically.

Verification pipeline

Every sample ships with a reproducible test harness that boots a fresh Docker environment, runs the agent-produced solution, and asserts concrete outcomes — no LLM-as-judge handwaving.

Environment design

Real-world software stacks: Linux, Python, Node, bash, systemd-style services, database fixtures. Agents operate in a terminal, not a toy sandbox.

Task strategy

A deliberate spread across debugging, devops, library work, and API integration — calibrated to surface capability gaps rather than piling on tasks of one flavor.

Expert authoring

Each sample is authored by a PhD or senior engineer in its domain and peer-reviewed. Instructions are written to be precise and unambiguous under terminal-only observation.

Delivery guarantees

Full provenance: author, review notes, time estimates for expert vs. junior, resource caps. Every artifact is versioned; every change is auditable.

/ how to access

Two paths in. Zero guesswork.

path one

Have a passcode?

If our sales team has sent you a TB-XXXX-XXXX passcode, enter it to open the current batch.

Enter showcase →

path two

Request access

Tell us about your team and intended use. Our sales team reviews each request and usually responds within one business day.

Request access +