Calibrate Human + AI
Evaluation at Scale

GANDALF helps you build heterogeneous evaluation teams, calibrate AI annotators against human judgments, and track agreement until convergence.

Why GANDALF?

Autonomously generate and refine annotation prompts. The AI iterates on disagreements until it converges with your human evaluators.

Assign tasks to single, multiple, or all evaluators. Support multi-annotator workflows with flexible team structures.

Per-question agreement rates, Cohen's Kappa, and side-by-side disagreement viewers let you pinpoint where AI and humans diverge.

Import data from Google Sheets, CSV, or JSON

Define evaluation criteria and scoring rubrics

Generate prompts, run batches, refine from disagreements

Teams annotate with AI pre-fill or full automation