Calibrate Human + AI
Evaluation at Scale
GANDALF helps you build heterogeneous evaluation teams, calibrate AI annotators against human judgments, and track agreement until convergence.
Why GANDALF?
AI-Assisted Calibration
Autonomously generate and refine annotation prompts. The AI iterates on disagreements until it converges with your human evaluators.
Heterogeneous Teams
Assign tasks to single, multiple, or all evaluators. Support multi-annotator workflows with flexible team structures.
Agreement Tracking
Per-question agreement rates, Cohen's Kappa, and side-by-side disagreement viewers let you pinpoint where AI and humans diverge.
How It Works
1
Upload
Import data from Google Sheets, CSV, or JSON
2
Guidelines
Define evaluation criteria and scoring rubrics
3
Calibrate AI
Generate prompts, run batches, refine from disagreements
4
Evaluate
Teams annotate with AI pre-fill or full automation