Evaluating AI Output
For ICs and managers reviewing AI-generated work — their own, their team's, or a vendor's. Seven chapters: the fluency illusion (Microsoft FAccT 2024, Anthropic sycophancy, Stanford HAI 2026); accuracy vs. usefulness as separate tests; three hallucination patterns (confident fabrication, plausible detail, stale fact) anchored in NIST AI 600-1 and the Vectara HHEM leaderboard; citation evaluation with the Mata v. Avianca anchor, 1,353+ court cases, Sixth Circuit $30K sanctions, and Deloitte Australia's AU$290K refund; demographic and regional bias with named cases (Bloomberg study, EEOC iTutorGroup, Workday Mobley class action); the verification habit grounded in Lally's 66-day study and BJ Fogg's B=MAP formula; and the close — your one-page verification playbook with three never-skip checks, one escalation rule, and the Friday review.
7
Chapters
~45 min
Duration
Intermediate
Level
No
Certification
Course Content
Why AI evaluation is harder than it looks
The fluency illusion is real. Microsoft FAccT 2024 study (404 participants), Anthropic sycophancy paper, Stanford HAI 2026 — confident writing convinces us regardless of accuracy.
Accuracy vs. usefulness
Two different tests. Accurate-and-useless. Useful-and-wrong. The rule: whichever failure is hardest to recover from goes first. The Deloitte Australia anchor.
Spotting hallucinations in 3 patterns
Confident fabrication. Plausible detail. Stale fact. NIST taxonomy, OpenAI "why models hallucinate", Vectara HHEM — reasoning models actually hallucinate MORE on long-form text.
Evaluating sources and citations
1,353+ court cases through 2026. Mata v. Avianca to the Sixth Circuit $30K sanctions. Deloitte AU$290K refund. The Nature 72% fake-citation finding. The 3-step citation check.
Spotting bias in outputs
Bloomberg resume study (11% top-rank for Black women), EEOC iTutorGroup $365K settlement, Workday Mobley class action, MMLU-ProX 30-point Swahili gap. Demographic + regional patterns.
Building your verification habit
Lally's 66-day median to automaticity. Moore's overprecision research. BJ Fogg's B=MAP formula. The 5-minute three-step routine that survives week three.
Making it stick: your verification playbook
Three never-skip checks. One escalation rule. The Friday review. Fill in the playbook builder, download the markdown, print it, pin it where you can see it.