Calibrate Human + AI
Evaluation at Scale

GANDALF helps you build heterogeneous evaluation teams, calibrate AI annotators against human judgments, and track agreement until convergence.

Why GANDALF?

AI-Assisted Calibration

Autonomously generate and refine annotation prompts. The AI iterates on disagreements until it converges with your human evaluators.

Heterogeneous Teams

Assign tasks to single, multiple, or all evaluators. Support multi-annotator workflows with flexible team structures.

Agreement Tracking

Per-question agreement rates, Cohen's Kappa, and side-by-side disagreement viewers let you pinpoint where AI and humans diverge.

How It Works

1

Upload

Import data from Google Sheets, CSV, or JSON

2

Guidelines

Define evaluation criteria and scoring rubrics

3

Calibrate AI

Generate prompts, run batches, refine from disagreements

4

Evaluate

Teams annotate with AI pre-fill or full automation