The challenge
The customer was scaling a multimodal model into seven scientific domains and needed a partner who could keep up — not just on volume, but on consistency across text, image, and audio modalities. They had been burned previously by vendors whose pass rate dropped sharply once headcount grew.
Our approach
Pod-of-pods structure
Rather than one large pool of annotators, we ran the program as seven domain pods (one per scientific area) with a central review layer. Each pod reported to a senior expert from that field; the central layer enforced cross-pod consistency.
Calibrated growth
The team grew to roughly 600 expert makers over four months. Every new annotator went through the same calibration set as the founding cohort, so quality stayed flat as headcount climbed.
LLM-assisted pre-labelling
For high-volume image and audio prompts, we used model-assisted pre-labelling with human-in-the-loop verification. The reviewer's time was spent on edge cases, not on copy-paste work.
Outcome
- 48,000 high-quality visual prompts delivered across seven scientific domains.
- ~600 vetted expert makers active by month four.
- 90% sustained pass rate on the customer's hold-out evaluation.
- Full ramp from zero to delivery in four months.
What made it work
The pod-of-pods structure meant that scaling did not dilute domain expertise. The customer was able to hand us a new domain mid-program without losing speed in the existing six.

