ooligo
claude-skill

Interview debrief summary with Claude

Difficulty
beginner
Setup time
30min
For
recruiter · hiring-manager · talent-acquisition
Recruiting & TA

Stack

A Claude Skill that takes a candidate’s full panel — every interviewer’s structured scorecard, optional BrightHire or Metaview transcripts, and the role rubric — and produces an evidence-grounded debrief brief the panel reads before the synchronous debrief meeting. The brief surfaces aggregate signal per rubric dimension, areas of agreement and disagreement, the specific decision-points the panel needs to resolve, and follow-up questions when signal is thin. It deliberately does not emit a hire/no-hire recommendation — that is the panel’s job, and treating it otherwise puts the workflow inside the EU AI Act Annex III high-risk regime and most US state hiring-AI statutes.

The downstream effect: debriefs become 30-minute discussions of the actual disagreements rather than 90-minute reviews of who scored what.

When to use

Run the skill when all of the following are true:

  • A full interview loop has concluded for the candidate, with at least 3 distinct interviewers covering the role rubric.
  • Every interviewer has submitted a structured scorecard against the rubric (free-text-only scorecards fail the input check in step 1 of the skill — see apps/web/public/artifacts/interview-debrief-summary-skill/SKILL.md).
  • The synchronous debrief meeting is at least 2 hours away. The brief is meant to be read in advance, not skimmed in the meeting.
  • The role has a structured rubric matching the shape in apps/web/public/artifacts/interview-debrief-summary-skill/references/1-interview-rubric-template.md — every dimension has a 1-5 anchor table, every anchor has a behavioral description.

When NOT to use

The skill is the wrong tool for several adjacent jobs:

  • Auto-deciding hire/no-hire. The brief never emits a final decision. It emits decision-points for the panel. Auto-deciding triggers EU AI Act Annex III obligations, NYC LL 144’s bias-audit requirement, IL AIVI consent requirements, and MD HB 1202’s notification rules. The skill is built to fall outside that regime; wiring it into auto-decision logic walks it back in.
  • Sending feedback to candidates without recruiter review. The brief is internal-only. Synthesized rationale text uses internal-panel phrasing that becomes evidence in a discrimination claim if surfaced to the candidate verbatim.
  • Replacing the panel-debrief conversation. The brief is the input to the discussion, not a substitute. “The brief shows consensus, so let’s skip the debrief” is the failure mode the references/3-disagreement-escalation.md rules are designed to surface against — frictionless consensus is itself a calibration concern.
  • Single-interviewer loops. Below 3 interviewers, panel synthesis is not meaningful. Use a single-interviewer feedback workflow.
  • Transcripts without consent. Two-party-consent jurisdictions (CA, FL, IL, MD, MA, MT, NH, PA, WA) make this a hard halt. Do not pass BrightHire or Metaview transcripts unless the candidate consented to recording at interview start.
  • Calibration sessions on rubric-itself questions. When the panel is debating the rubric (not the candidate), the brief’s per-dimension synthesis is noise. Run the calibration session separately, then re-run the brief once the rubric is stable.

Setup

The artifact bundle lives at apps/web/public/artifacts/interview-debrief-summary-skill/. It contains:

  • SKILL.md — the Claude Skill definition with frontmatter, when-to-invoke rules, the six-step method, the literal output format, and the watch-out / guard pairs.
  • references/1-interview-rubric-template.md — the structured rubric shape the skill validates inputs against.
  • references/2-debrief-brief-format.md — the literal Markdown format the brief is written in.
  • references/3-disagreement-escalation.md — the deterministic decision-point rules (range, bar-raiser veto, HM-vs-panel divergence, single-no-among-yes, coverage gap, under-evidenced cluster).

To stand up the workflow:

  1. Drop the bundle into your Claude Code skills directory. Place interview-debrief-summary-skill/ under your project’s .claude/skills/ (or your team’s shared skills location).
  2. Replace the rubric template with your role-specific rubric. Edit references/1-interview-rubric-template.md per role — every dimension needs a 1-5 anchor table with behavioral descriptions. Keep the dimension count between 4 and 7. Below 4, the panel cannot triangulate; above 7, scorecards get filled out as a chore and evidence quality degrades.
  3. Wire the scorecard export. Configure your ATS export so the skill can read structured scorecards. Ashby, Greenhouse, and Lever each expose scorecard JSON via API; the skill expects an array of {interviewer_id, interviewer_role, dimension_scores, evidence_notes} per the Inputs block in SKILL.md.
  4. Test on a known candidate. Run on a candidate where the panel has already debriefed and made a decision. Compare the brief’s decision-points to the actual debrief’s discussion topics. If the brief surfaces topics the panel did not discuss (or misses topics the panel did discuss), tune the rubric — not the prompt — first.
  5. Set up the audit log directory. The skill appends a per-run line to audit/<YYYY-MM>.jsonl containing rubric SHA, interviewer count, decision-point count, and timestamp. No candidate PII in the audit line. The log is what makes the workflow defensible under NYC LL 144 / EU AI Act questioning.

What the skill actually does

The six-step method runs in this order, and the order is load-bearing:

  1. Validate the rubric and inputs. Halts on free-text-only rubrics, on fewer than 3 interviewers, on dimensions covered by fewer than 2 interviewers, on evidence_notes strings under 20 characters. Halting rather than warning is intentional — a brief generated on partial inputs becomes the panel’s mental anchor.
  2. Aggregate per dimension (deterministic). Computes mean, range, standard deviation, and per-interviewer-role breakdown. The LLM does not see scorecards yet at this point.
  3. Identify decision-points (deterministic). Applies the six rules in references/3-disagreement-escalation.md. Decision-points are based on the structured signal, not on what the LLM thinks reads as disagreement.
  4. Synthesize per dimension. The LLM produces a two-to-three-sentence synthesis per dimension, citing evidence_notes strings verbatim in quotation marks. Paraphrasing is where bias enters; the skill forbids it. When transcripts are available, the synthesis cites the timestamp range. “Insufficient signal — recommend follow-up” is a first-class output, distinct from “no recommendation” — the absence of evidence on a dimension is information the panel needs.
  5. Calibration check. Compares the candidate’s score distribution against the rolling mean of the last 5 same-role debriefs. Findings appear in a “Calibration note” block at the end of the brief, never inline per dimension. Intent: frame the conversation, not adjust scores.
  6. Write the brief and stop. Writes to briefs/<candidate_id>-<YYYYMMDD>.md. Appends one line to the audit log. Does not call any “send to candidate”, “post to Slack”, or “update ATS stage” endpoint. The brief is internal until the recruiter and hiring manager decide what to do.

The output format is fixed (see apps/web/public/artifacts/interview-debrief-summary-skill/references/2-debrief-brief-format.md) and intentionally has no “Recommendation” section — only “Aggregate signal”, “Per-dimension synthesis”, “Decision-points for the panel”, “Follow-up questions”, “Calibration note”, and “Appendix — per-interviewer evidence”. A reader who tries to read off a hire decision finds the structure pushes them back to discussion.

Cost reality

A typical brief for a 5-interviewer loop with 5 rubric dimensions and no transcripts attached lands at roughly 18-25k input tokens (rubric + scorecards + evidence notes + the three reference files) and 4-6k output tokens. With Claude Sonnet at current API pricing, that’s about $0.10-$0.15 per debrief. With transcripts attached (typical 30-minute interview transcript: 7-10k tokens each), a 5-interviewer loop pushes to $0.40-$0.70 per debrief.

The time-saved math is the load-bearing number: a typical 5-interviewer debrief meeting runs 60-90 minutes, of which 30-50 minutes is “what did each of us see” round-robin before any actual decision-discussion happens. The brief replaces the round-robin. Recruiters running this skill at one of our reference orgs report debrief meetings averaging 28 minutes (down from 75 minutes) for loops where the brief was distributed at least 4 hours in advance.

That’s roughly 45 minutes saved per debrief, across (typically) 5 interviewers — about 3.75 person-hours of meeting time per loop, at a cost of well under a dollar.

Success metric

The metric to watch move: median debrief meeting length in calendar minutes for loops where the brief was distributed at least 4 hours in advance. Pull from your calendar tooling (or from Ashby interview-scheduling history) and segment into “with brief” vs “without brief” cohorts. Target trajectory: 60-90 minute median in the no-brief cohort drops to 25-40 minute median in the with-brief cohort over the first 4-6 weeks.

Counter-metric to watch in parallel: post-hire regret rate at 6 months in the with-brief cohort vs the no-brief cohort. If debriefs got faster but regret rate ticked up, the brief is letting disagreements get averaged away rather than surfacing them — tighten the disagreement-escalation rules in references/3-disagreement-escalation.md (typically: lower the range threshold from 2 to 1.5, or add a “any score under 3” trigger for the relevant dimension).

vs alternatives

  • Ashby’s built-in debrief features. Ashby aggregates scorecards in a dashboard view and computes a panel-mean. It does not produce a written synthesis, does not surface decision-points by rule, and does not differentiate “consensus at 4.0” from “under-evidenced cluster at 4.0”. Use Ashby’s view as the data source the skill reads from, not as a substitute for the brief.
  • Greenhouse Scorecards aggregation. Greenhouse rolls scorecards into a hire-or-no-hire tally per interviewer plus a panel recommendation aggregate. The aggregate is the failure mode the skill is designed against — it nudges panels toward score-arithmetic-as-decision and obscures bar-raiser vetoes that get averaged into a “thumbs up” overall.
  • Manual recruiter notes. A recruiter reading every scorecard and writing a one-paragraph “themes for the debrief” email is the status quo at most teams. It captures the recruiter’s read of the loop, which is valuable, but it scales linearly with recruiter time and tends to pattern-match toward “what the HM probably wants” over many iterations. The skill is consistent across recruiters and surfaces structural disagreements (R3 — HM-vs-panel divergence) that a recruiter writing the brief themselves rarely flags.
  • Doing nothing. The default — everyone shows up to the debrief with their own notes and the discussion runs round-robin. Works fine for low-volume teams (under 10 hires per quarter). At higher volume, the round-robin is the bottleneck and debrief quality degrades as fatigue accumulates.

Watch-outs

  • Bias from one strong opinion (anchoring on the first scorecard read). Guard: step 2 aggregates deterministically across all interviewers before the LLM sees any single scorecard. Step 3’s R3 rule (HM-vs-panel divergence) explicitly surfaces single-strong- opinion divergence as a decision-point. The synthesis attributes evidence by interviewer role (HM, Peer, XFN, Bar-raiser) rather than by name in the per-dimension blocks, which prevents the brief from rounding up toward the senior interviewer.
  • False consensus on under-evidenced dimensions. Guard: evidence_notes minimum-length check in step 1 (under 20 chars fails). R6 (under-evidenced cluster) in step 3 surfaces dimensions where 3+ scores cluster within 1 point but average evidence note is under 30 characters as RECOMMEND FOLLOW-UP, not as agreement. This is the most common silent failure of free-form debriefs.
  • Score-arithmetic-as-decision (treating mean above 3.5 as “hire”). Guard: the brief never emits a hire/no-hire recommendation. The output format intentionally has no “Recommendation” block — only decision-points and follow-ups. A reader who tries to read off a decision finds the structure pushes them back to discussion.
  • Bar-raiser veto silently overridden. Guard: R2 in step 3 surfaces any bar-raiser score 2+ below the panel mean as a decision-point automatically. The brief cannot be generated in a state where a bar-raiser dissent is averaged away — even if the rest of the panel is unanimous.
  • Demographic patterns leaking into synthesis. Guard: the synthesis cites evidence_notes strings verbatim rather than paraphrasing, which prevents the LLM from rewriting an observation into language that telegraphs a protected-class inference. If a passed-in evidence_note itself contains protected-class proxies (name origin, age inference, parental-status inference, “culture fit” without behavioral anchors), the skill halts in step 1 and surfaces the offending note for re-write before continuing.
  • Calibration note overinterpreted as a verdict. Guard: the calibration block is appended at the end of the brief, never inline per dimension. The block uses the language “within tolerance” or “outside tolerance — discuss” rather than suggesting an action, and the calibration check skips entirely if fewer than 5 prior same-role debriefs are loaded.

Stack

  • AI provider: Claude (Sonnet for the synthesis step; Opus for first-run rubric validation if the rubric is ambiguous).
  • ATS: Ashby, Greenhouse, or Lever — the scorecard data source.
  • Optional transcripts: BrightHire or Metaview, with documented two-party-consent capture at interview start.
  • Where it fits: see structured interviewing for the rubric design discipline this skill assumes is already in place. The skill cannot rescue an unstructured interview process — it can only synthesize the signal a structured process produces.
  • Policy framing: see AI policy for legal teams for the Tier-A enterprise-AI handling required for candidate-data inputs (transcripts in particular are sensitive personal data under GDPR and most US state privacy regimes).

Files in this artifact

Download all (.zip)