claude-skill

Take-Home-Assessment-Evaluator mit Claude

Difficulty

Fortgeschritten

Setup time

40min

For

recruiter · hiring-manager · technical-screener

Recruiting & TA

Stack

Ein Claude Skill, der die Take-Home-Einreichung eines Kandidaten anhand eines vom Hiring-Team erstellten Rubriks bewertet, mit zeilenweisen Zitaten aus dem eingereichten Code oder den Dokumenten, und einen strukturierten Evaluationsbericht produziert — er gibt nie automatisch ein Pass oder Fail. Das Hiring-Panel verwendet den Bericht, um das Live-Debrief zu verankern; die eigentliche Hire/No-Hire-Entscheidung fällt in der Panel-Diskussion, nicht im Bericht. Ersetzt die 60-90 Minuten pro Panelist von unorganisiertem „Ich habe das Samstagmorgen gelesen und dachte, es war okay?” durch einen strukturierten 15-Minuten-Review pro Panelist plus ein 30-minütiges kalibriertes Debrief.

Wann verwenden

Die Rolle verwendet ein Take-Home-Assessment als Schritt im Loop (Structured Interviewing als Voraussetzung — ohne ein schriftliches Rubrik hat der Skill nichts, wogegen er bewertet).
Sie wollen eine konsistente Bewertung über Panelists hinweg. Take-Home-Reviews sind notorisch inkonsistent, weil jeder Panelist zu einem anderen Zeitpunkt mit einem anderen Aufmerksamkeitsniveau liest; der rubrikverankerte Bericht ist das Nivellierungsartefakt.
Das Take-Home ist eine Coding-Aufgabe, ein System-Design-Write-up, eine schriftliche Aufgabe (PRD-Entwurf, Sales-Call-Mock-Write-up) oder ein Integrations-Build, der inspizierbare Artefakte produziert.

Wann NICHT verwenden

Auto-Pass / Auto-Fail im Loop. Der Skill produziert einen bewerteten Bericht. Die Hire-Entscheidung fällt im Panel-Debrief. Die Verdrahtung des aggregierten Scores des Berichts mit einem Stage-Übergang löst dasselbe NYC LL 144 / EU-KI-Gesetz-Exposure aus wie automatische Ablehnung beim Screening.
Live-Coding-Interviews. Anderer Workflow (Live-Beobachtung des Prozesses, keine Artefakt-Evaluierung). Der Interview-Debrief-Workflow verarbeitet diesen Fall.
Take-Homes länger als 4 Stunden Kandidatenarbeit. Lange Take-Homes sind selbst ein Candidate Experience-Anti-Pattern; der Skill behebt das nicht.
Einreichungen, bei denen der Kandidat die KI-Nutzungs-Offenlegung nicht unterzeichnet hat. Die Rubrik-Bewertung ist gegen eine bestimmte KI-Nutzungsrichtlinie kalibriert (z.B. „KI-Tools für Syntaxhilfe erlaubt, nicht für Lösungsgenerierung”); ohne die Offenlegung kann der Skill die „nur-KI-Signal”-Erkennung nicht kalibrieren.
Plagiatserkennung als primärer Einsatz. Der Skill flaggt verdächtige Muster (wörtliche öffentliche Repo-Übereinstimmungen, generisches KI-generiertes Boilerplate), ist aber kein forensisches Plagiat-Tool. Verwenden Sie ein dediziertes Tool, wenn Sie defensible Plagiatsbefunde benötigen.

Einrichtung

Bundle ablegen. Platzieren Sie apps/web/public/artifacts/take-home-evaluator-claude-skill/SKILL.md in Ihrem Claude Code Skills-Verzeichnis.
Rubrik erstellen. Pro Take-Home, schreiben Sie ein JSON-Rubrik mit den Dimensionen, auf denen Sie tatsächlich bewerten (Korrektheit, Code-Qualität, in Kommentaren / README dokumentierte Entscheidungsfindung, Fehlerbehandlung, Test-Abdeckung). Anker pro Dimension bei 1-5. Die Vorlage befindet sich in references/1-take-home-rubric-template.md.
KI-Nutzungsrichtlinie konfigurieren. Der Prompt des Skills teilt Claude explizit mit, welche KI-Nutzung erlaubt war („nur Syntaxhilfe”, „KI-Tools durchgehend erlaubt”, „keine KI-Tools” usw.). Die Einstellung entspricht der Offenlegungssprache im Take-Home-Brief — sie müssen übereinstimmen.
Panelist-Verteilungsmodus setzen. Entweder Einzelpanelist-Modus (ein Bericht pro Einreichung) oder per-Panelist-Modus (jeder Panelist bekommt dieselbe Einreichung, generiert seine eigene Evaluation, und der Skill aggregiert die cross-panelist Deltas). Per-Panelist-Modus fängt Bewertungs-Drift auf, verdoppelt aber die Modellkosten.
Trockendurchlauf mit einem geschlossenen Take-Home. Bewerten Sie ein Take-Home von einem Kandidaten, der letztes Quartal eingestellt (oder nicht) wurde. Vergleichen Sie die Dimensionswerte des Skills mit den tatsächlichen Werten des Panels. Passen Sie die Rubrik-Anker an, wenn der Skill Dimensionen anders gewichtet.

Was der Skill tatsächlich tut

Sechs Schritte. Die Reihenfolge ist wichtig: Deterministische Prüfungen (Kompilieren, Ausführen, Dateistruktur) erfolgen, bevor das LLM etwas bewertet, weil das Zulassen, dass das Modell eine nicht laufende Einreichung bewertet, einen selbstsicheren Bericht über ein kaputtes Artefakt produziert.

Einreichungsform validieren. Prüfen, ob alle im Take-Home-Brief genannten Lieferobjekte vorhanden sind (z.B. README.md, Source-Dateien, Test-Dateien). Fehlende Lieferobjekte → im Bericht flaggen; diese Dimensionen NICHT bewerten.
Deterministische Prüfungen ausführen. Code kompilieren. Die vom Kandidaten geschriebene Test-Suite ausführen. Ausgabe erfassen. Das sind die überprüfbaren, reproduzierbaren Ergebnisse — das LLM re-litigiert sie nicht.
Pro Rubrik-Dimension bewerten. Für jede Dimension im Rubrik, 1-5 mit wörtlichen Zitaten aus der Einreichung des Kandidaten bewerten (Dateipfad + Zeilenbereich + der Code oder Text). Zitate sind erforderlich; ohne ein Zitat fällt die Bewertung auf den 1-Anker des Rubriks zurück. Die Zitatanforderung hält das Modell in der tatsächlichen Einreichung verankert, anstatt generisches Feedback zu geben.
KI-Nutzungssignal gegen die Richtlinie erkennen. Musterchecks gegen die offengelegte KI-Nutzungsrichtlinie ausführen. Wörtliche Übereinstimmungen mit öffentlichem KI-generiertem Boilerplate, verdächtig konsistenter Stil über Dateien unterschiedlicher Komplexität hinweg oder generische Kommentare ohne Auseinandersetzung mit den problemspezifischen Entscheidungen erscheinen alle als ai-use-signal-Hinweise — nicht als Verletzung, nur als Signal für das Panel, um es gegen die offengelegte Richtlinie zu diskutieren.
Aggregat berechnen OHNE Hire/No-Hire-Empfehlung. Die Dimensionswerte summieren. Das Aggregat als Zahl ausgeben. Das Aggregat NICHT in eine Empfehlung übersetzen. Der Skill gibt explizit „Bericht; keine Entscheidung” zurück, nicht „Pass / Fail”.
Per-Panelist- oder aggregierten Bericht ausgeben. Im Einzelpanelist-Modus geht der Bericht an den aufrufenden Panelist. Im Per-Panelist-Modus aggregiert der Skill über Panelists, zeigt per-Dimension cross-panelist Deltas (und welcher Panelist was anders gesehen hat) und gibt einen debrief-fertigen Bericht aus.

Kostenrealität

Pro Take-Home-Einreichung, mit Claude Sonnet 4.6:

LLM-Token — typischerweise 15-30k Eingabe-Token (Rubrik + Einreichungscode/-text + Skill-Anweisungen) und 3-5k Ausgabe-Token (per-Dimension bewerteter Bericht). Ca. 0,15-0,25 $ pro Einreichung im Einzelpanelist-Modus. Per-Panelist-Modus (3-4 Panelists) multipliziert linear.
CI / Sandbox-Kosten — die Test-Suite des Kandidaten auszuführen kostet, was Ihre CI normalerweise kostet; üblicherweise vernachlässigbar. Sandboxed Execution (empfohlen — führen Sie Kandidaten-Code niemals auf dem Panel-Laptop aus) kostet, was Ihr Sandboxed-Runner-Anbieter berechnet.
Panelist-Zeit — der Gewinn. Der Erstdurchgang eines Panelists bei einem Take-Home sind 60-90 Minuten, wenn gut gemacht, weniger wenn schlecht gemacht. Den Bericht des Skills zu überprüfen und pro Dimension zustimmen/ablehnen zu notieren: 15-25 Minuten. Aggregierte Panel-Zeit eingespart pro Take-Home: 2-3 Panelist-Stunden.
Einrichtungszeit — 40 Minuten einmalig für das Rubrik und das KI-Nutzungsrichtlinien-Mapping pro Take-Home-Format. Wiederverwendung über Rollen in derselben Familie ist hoch.

Erfolgsmetrik

Drei Dinge pro Take-Home-Zyklus verfolgen:

Cross-Panelist-Score-Varianz — Varianz über die per-Dimension-Werte der Panelists. Der Skill sollte die Varianz komprimieren (Panelists verankert auf demselben Rubrik und denselben Zitaten) ohne künstliche Übereinstimmung zu erzwingen. Varianz unter ~0,5 (auf einer 5-Punkte-Skala) deutet darauf hin, dass Panelists den Bericht des Skills stempeln; über ~1,5 deutet darauf hin, dass die Rubrik-Anker zu vage sind, um das Take-Home zu unterscheiden.
Hire-vs-No-Hire-Korrelation mit dem Skill-Aggregat — über ein Quartal, korreliert die Hire-Entscheidung des Panels mit dem Aggregat des Skills? Sollte positiv sein, aber NICHT 1,0; wenn es 1,0 ist, delegiert das Panel automatisch (was der Fehlermodus ist, gegen den der Skill entwickelt wurde), und wenn es 0 ist, ist das Rubrik oder der Skill nicht mit dem ausgerichtet, was das Panel tatsächlich bewertet.
Take-Home-Debrief-Dauer — Wanduhr von „alle Panelists haben Reviews eingereicht” bis „Entscheidung aufgezeichnet”. Sollte von 1-2 Tagen auf unter 4 Stunden sinken, weil der Bericht ein gemeinsamer Anker ist.

vs. Alternativen

vs. CodeSignal Coding Reports / HackerRank automatisches Grading. Diese Produkte führen Kandidatencode gegen die Test-Cases der Plattform aus und geben einen Score aus. Wählen Sie sie, wenn Ihr Take-Home strukturiertes gut-definiertes-Input-zu-gut-definierten-Output ist (LeetCode-Stil). Wählen Sie den Skill, wenn das Take-Home ein Build ist (kleines System schreiben, eine API entwerfen, ein PRD schreiben), wo das Rubrik der Score ist und der Score das Rubrik ist. Die beiden sind komplementär; CodeSignal kann der Eingabe für den Run-Tests-Schritt des Skills dienen.
vs. handbenoteten Take-Homes. Handbenotung ist richtig für die höchsten Einsätze (Gründungsingenieur, Principal IC), wo das narrative Urteil des Panels das Lieferobjekt ist. Der Skill verdient seinen Einrichtungsaufwand bei den 80% der Take-Homes, bei denen konsistente Rubrik-Anwendung das ist, was fehlt.
vs. ChatGPT-ähnlichem „Review diesen Code”. Generischer Chat gibt generisches Feedback. Der Skill ist strukturell anders: Er erfordert wörtliche Zitate, führt deterministische Prüfungen zuerst durch und weigert sich, eine Hire/No-Hire-Empfehlung zu erstellen.
vs. keinem Take-Home (nur-Live-Loops). Eine vernünftige Wahl für Senior-Rollen, wo Referenzen und Live-Runden die Last tragen. Der Skill ist irrelevant, wenn der Loop kein Take-Home hat.

Fallstricke

Auto-Pass / Auto-Fail-Drift. Guard: Die Ausgabe des Skills endet mit den per-Dimension-Werten und dem Aggregat. Es gibt keinen „Pass”- oder „Fail”-String. Das Schema lässt ein Empfehlungsfeld explizit aus.
Generische Feedback-Halluzinierung. Guard: Jeder Dimensionswert erfordert ein wörtliches Zitat (Dateipfad + Zeilenbereich + Inhalt). Werte ohne Zitate fallen auf 1 zurück.
Bias-Vererbung aus dem Rubrik. Guard: Das Rubrik ist diesem Skill vorgelagert. Führen Sie das Rubrik durch das Diversity Slate Auditor-Framing — bewertet das Rubrik Dimensionen mit bekannten disparaten Auswirkungen (z.B. „verwendet obskure Idiome”, was oft mit Bootcamp vs. Informatikstudium-Hintergrund korreliert)?
KI-Nutzungserkennung False Positive. Guard: KI-Nutzungssignale werden als Hinweise, nicht als Verletzungen angezeigt. Das Panel überprüft gegen die offengelegte Richtlinie. Auto-Flaggen als Verletzung wäre die falsche Lesart; legitime Verwendung von KI-Tools (innerhalb der Richtlinie) ist zunehmend die Norm.
Sandboxing-Fehler bei Kandidatencode. Guard: Der Skill empfiehlt explizit Sandboxed Execution und warnt, wenn die aufrufende Umgebung die Test-Suite direkt auf der Panel-Maschine ausführt. Führen Sie niemals unüberprüften Kandidatencode auf einer Maschine mit Zugriff auf Firmengeheimnisse aus.
Einreichungsgrößen-Explosion. Guard: Wenn die Einreichung ~50k LOC überschreitet, warnt der Skill, dass die Bewertung partiell sein wird, und fordert den Panelist auf, die Teile zu identifizieren, auf die er sich konzentrieren soll. Take-Homes, die 50k LOC produzieren, sind selbst ein Zeichen, dass der Brief falsch war.

Stack

Das Skill-Bundle befindet sich unter apps/web/public/artifacts/take-home-evaluator-claude-skill/ und enthält:

SKILL.md — die Skill-Definition
references/1-take-home-rubric-template.md — ausfüllbare Rubrik-Vorlage
references/2-ai-use-policy-mapping.md — wie die offengelegte Richtlinie auf die Muster-Prüfungen des Skills abgebildet wird

Tools, die der Workflow voraussetzt: Claude (das Modell). Optional: CodeSignal oder HackerRank für das Deterministische-Prüfungs-Bein; Ashby für den Kandidatendatensatz. Sandboxed Execution ist die Wahl des Recruiters / Hiring-Managers (Docker-Container, GitHub Actions usw.).

Diese Seite auf GitHub bearbeiten

Files in this artifact

Download all (.zip)

---
name: take-home-evaluator
description: Score a take-home submission against a rubric, with verbatim citations from the candidate's code or text, plus deterministic checks (compile, run tests). Output is a structured report with per-dimension scores and an aggregate — never a hire/no-hire recommendation. Detects AI-use signals against the disclosed policy as notes for the panel debrief.
---

# Take-home assessment evaluator

## When to invoke

Use this skill when a panelist has a candidate's take-home submission and wants a rubric-anchored evaluation report to bring to the panel debrief. Take the submission directory plus the role rubric as input and return a structured Markdown report.

Do NOT invoke this skill for:

- **Auto-pass / auto-fail in the loop.** This skill produces a scored report. The hire decision happens in the panel debrief, not in the report.
- **Live coding interviews.** Different workflow.
- **Submissions where the candidate did not sign the AI-use disclosure.** The skill calibrates against a specific use-of-AI policy; without the disclosure, there is nothing to calibrate against.
- **Plagiarism forensics.** The skill flags suspicious patterns but is not a defensible plagiarism tool.

## Inputs

- Required: `submission_dir` — path to the candidate's submission directory.
- Required: `rubric` — path to the take-home rubric file. See `references/1-take-home-rubric-template.md` for the shape.
- Required: `ai_use_policy` — string identifying the disclosed policy. One of `none-allowed`, `syntax-help-only`, `ai-tools-allowed`. The skill's pattern checks are calibrated against the policy, not against an absolute "AI-generated" detector.
- Optional: `panelist_id` — if running per-panelist mode, identify the panelist for cross-panelist aggregation.
- Optional: `sandboxed` — boolean. If `false`, the skill warns about running candidate code on the local machine and asks for confirmation before proceeding to step 2.

## Reference files

- `references/1-take-home-rubric-template.md` — the rubric shape the skill expects.
- `references/2-ai-use-policy-mapping.md` — how each `ai_use_policy` value maps to the pattern checks in step 4.

## Method

Six steps.

### 1. Validate the submission shape

Walk `submission_dir`. Compare against the deliverables named in the take-home brief (the rubric file's `expected_deliverables` field). For each missing deliverable, add to a `missing_deliverables` array on the report. Do NOT score dimensions that depend on missing files; their score is `not assessed` rather than `1`.

### 2. Run deterministic checks

If the submission has a build/test command in `package.json`, `Makefile`, `pyproject.toml`, or `Cargo.toml`:

- Compile / install dependencies in a sandboxed environment. If the calling environment is not sandboxed (per `sandboxed: false`), warn and stop until the panelist confirms.
- Run the candidate's test suite. Capture pass/fail counts.
- Run linters / formatters in check mode (do NOT modify the candidate's code). Capture findings.

Record the deterministic results in a `deterministic_checks` block on the report. These are auditable — the LLM does not re-litigate them in step 3.

### 3. Score per rubric dimension

For each dimension in the rubric:

- Read the rubric anchors (1-5).
- Find evidence in the submission. Evidence is a verbatim string from the code or text, with file path + line range.
- If you cannot find verbatim evidence for a score above 1, the score is 1 — no inference, no generic feedback.
- Tag each cited line with which dimension it supports.

### 4. Detect AI-use signal against the policy

Run pattern checks per `references/2-ai-use-policy-mapping.md`:

- **Verbatim public matches** — does any chunk of the submission match a public AI-generated boilerplate exactly? Surface as `signal: verbatim_public_match`.
- **Style consistency vs. complexity** — is the style suspiciously consistent across files of varying complexity? Real candidates' style varies with the difficulty of the section. Surface as `signal: uniform_style`.
- **Generic comments without engagement** — comments that explain what the code does without engaging with the problem-specific decisions are a common AI-generated tell. Surface as `signal: generic_comments`.

Surface signals as NOTES, never as VIOLATIONS. The panel reviews against the disclosed policy. If `ai_use_policy: ai-tools-allowed`, the signals are informational only. If `ai_use_policy: none-allowed`, the panel discusses whether the signals warrant follow-up — the skill does not decide.

### 5. Compute aggregate

Sum the per-dimension scores. Surface the aggregate as a number AND the per-dimension breakdown.

Do NOT translate the aggregate into a recommendation. The skill's schema explicitly omits a `recommendation` field. The report ends after the aggregate and the AI-use signals section.

### 6. Emit report

Write the report to `report.md` in the submission directory or to stdout (depending on the calling environment). In per-panelist mode, write to `report-<panelist_id>.md`.

## Output format

```markdown
# Take-home evaluation — {Candidate name} — {Role}

Submission: `{submission_dir}` · Rubric: `{rubric_path}` (SHA `{short}`)
Generated: {ISO timestamp} · Skill v1.0 · Model: claude-sonnet-4-6
Panelist: {panelist_id or "single-panelist mode"}
AI-use policy: {ai_use_policy}

## Deterministic checks

- **Build:** {passed | failed | not applicable}
- **Tests:** {N/M passed} ({test command})
- **Linter:** {N findings}
- **Missing deliverables:** {list or "none"}

## Per-dimension scores

### Correctness — 4/5

> Evidence: `src/router.rs:42-57` — handles the request-routing edge case for retries with exponential backoff and jitter, including the cap at max 60s. Anchor 4 ("handles the named edge cases with explicit code paths") matches.

Counter-evidence (would have been 5): retry budget is hardcoded at 5 attempts; the rubric's anchor 5 names "configurable retry budget."

### Code quality — 3/5

> Evidence: `src/router.rs` — file is 800 lines with no module split. Anchor 3 ("readable but lacks structural decomposition") matches.

### Decision-making documented — 4/5

> Evidence: `README.md:25-40` — explains the choice of exponential-vs-fixed backoff with a reference to the failure mode it mitigates. Anchor 4 matches.

### Error handling — 2/5

> Evidence: `src/router.rs:120` — catches and re-raises the network error without distinguishing between transient and permanent failures. Anchor 2 ("error paths exist but do not differentiate") matches.

### Test coverage — 4/5

> Evidence: `tests/router_test.rs` — covers the happy path, three retry scenarios, and the timeout. Missing: the network-partition test the rubric anchor 5 names.

## Aggregate

17/25.

This is the per-dimension sum. The skill does NOT translate this into a hire/no-hire recommendation. The panel debrief is where the decision happens.

## AI-use signal notes

Disclosed policy: **syntax-help-only**.

- ⚠️ `signal: uniform_style` — `src/cache.rs` and `src/router.rs` use the same comment style and naming idioms despite the different complexity. May warrant a follow-up question in the panel debrief.
- ✓ No `verbatim_public_match` signals.
- ✓ No `generic_comments` signals beyond the documented threshold.

The panel discusses these against the disclosed policy. The skill does not decide.
```

## Watch-outs

- **Auto-pass / auto-fail drift.** *Guard:* the report ends after the aggregate. No recommendation field. The aggregate is a sum, not a verdict.
- **Generic-feedback hallucination.** *Guard:* every dimension score requires verbatim citation (file path + line range + content).
- **AI-use false positive.** *Guard:* signals are notes, not violations. The panel decides against the disclosed policy.
- **Unsandboxed candidate code.** *Guard:* skill warns before running in non-sandboxed environments.
- **Bias inheritance.** *Guard:* the rubric is upstream of the skill. Audit the rubric separately if the dimensions correlate with disparate impact.

# Take-home rubric template

The take-home evaluator scores a submission against this rubric shape. Copy the JSON below to your role's rubric file (one per take-home format) and fill in every field. The skill reads the rubric; without it, scoring has nothing to anchor against.

A complete rubric takes 30-90 minutes to author per take-home format. Reuse across roles in the same family is high — a senior-backend take-home rubric is largely the same across companies once you've written it once.

## JSON shape

```json
{
"take_home_id": "senior-backend-router-rewrite-v3",
"version": "2026-04-15",
"expected_deliverables": [
"README.md",
"src/**/*.rs",
"tests/**/*.rs",
"Cargo.toml"
],
"build_commands": {
"build": "cargo build --release",
"test": "cargo test --all",
"lint": "cargo clippy -- -D warnings"
},
"ai_use_policy_match": "syntax-help-only",
"dimensions": [
{
"id": "correctness",
"label": "Correctness",
"anchors": {
"1": "Compiles but does not pass the candidate's own tests, or does not handle the named happy path.",
"2": "Handles the happy path; ignores the named edge cases (retries, partial failure).",
"3": "Handles the happy path and the obvious edge cases; misses the subtle ones (clock skew, partition recovery).",
"4": "Handles the named edge cases with explicit code paths; minor gaps acceptable.",
"5": "Handles all named edge cases AND demonstrates a configurable retry budget / timeout structure that the rubric explicitly calls for."
}
},
{
"id": "code_quality",
"label": "Code quality and structural decomposition",
"anchors": {
"1": "Single file, no decomposition; difficult to read.",
"2": "Decomposed but the decomposition does not follow domain boundaries.",
"3": "Readable but lacks structural decomposition that would scale past prototype.",
"4": "Clear module boundaries that follow the domain; idiomatic for the language.",
"5": "All of 4, plus the structural choices are documented in the README with the alternatives considered."
}
},
{
"id": "decision_documentation",
"label": "Decision-making documented",
"anchors": {
"1": "No README, or the README only repeats the take-home brief.",
"2": "README describes what was built without naming the engineering choices.",
"3": "README names some choices without naming the alternatives.",
"4": "README names the choices AND explains why each was picked over the named alternatives.",
"5": "All of 4, plus the README cites the failure modes each choice mitigates."
}
},
{
"id": "error_handling",
"label": "Error handling",
"anchors": {
"1": "Errors are caught and silently swallowed.",
"2": "Error paths exist but do not differentiate between transient and permanent failures.",
"3": "Differentiates transient vs. permanent; lacks structured error types.",
"4": "Structured error types; retry policy is explicit per error class.",
"5": "All of 4, plus error paths have explicit observability (logging / metrics / traces) named in the code."
}
},
{
"id": "test_coverage",
"label": "Test coverage",
"anchors": {
"1": "No tests, or tests do not run.",
"2": "Tests cover the happy path only.",
"3": "Tests cover the happy path and one or two edge cases.",
"4": "Tests cover the happy path and multiple edge cases (timeout, retry, partial failure).",
"5": "All of 4, plus the network-partition test the rubric explicitly calls for."
}
}
],
"rubric_fairness_check": {
"no_bootcamp_vs_cs_proxies": "Anchors must score on observable behavior in the submission, not on idioms that proxy for educational background. 'Uses obscure language idioms' is forbidden as a positive signal.",
"no_native-english-only_proxies": "Anchors must NOT score on README writing fluency beyond the level required to communicate the engineering decisions.",
"documented_in_brief": "The take-home brief shared with the candidate must describe the rubric dimensions and approximate weighting. Surprise dimensions are unfair."
}
}
```

## Per-field notes

- `take_home_id` — stable identifier for the take-home format. Reused across candidates for the same role family.
- `version` — semver or date. Bumped when the rubric is edited; the skill captures the version in the report so re-scoring against an edited rubric is visible.
- `expected_deliverables` — globs the skill walks against the submission. Missing deliverables surface in the report.
- `build_commands` — the skill runs these in step 2 (deterministic checks). Sandboxed execution required.
- `ai_use_policy_match` — should match the disclosure language in the take-home brief. Mismatch means the candidate's policy understanding doesn't match what the skill calibrates against.
- `dimensions` — array. Each dimension has an `id`, a `label`, and 5 anchor strings. Anchors should be observable behavior, not adjectives.
- `rubric_fairness_check` — three named fairness checks the skill confirms before scoring. If the rubric anchors violate any of these, the skill emits a warning and asks the rubric author to revise. (The skill does not refuse to score on a fairness-check violation, because the rubric is upstream and revising it is the right intervention. But it surfaces the issue.)

## Authoring a new dimension

To add a dimension to an existing rubric:

1. Pick observable behavior, not adjectives. "Has good error handling" is not a dimension; "error paths differentiate transient vs. permanent failure" is.
2. Write the 5 anchors as five distinct observable behaviors, each strictly more demanding than the last.
3. Test the dimension on a known submission. Can you score it from the anchors alone, without the original code in your head? If not, the anchors are too vague.
4. Bump the rubric version.

## Authoring a new rubric (for a net-new take-home)

1. Start from the take-home brief. What does the brief tell the candidate to deliver? Those are the `expected_deliverables`.
2. What is the brief asking the candidate to demonstrate? Those become the `dimensions`. Aim for 4-6 dimensions; more than 6 and the panelist can't hold them.
3. Write the 1-anchor first (the floor: what does an unsubmitted-effort look like?), then the 5-anchor (the ceiling: what does the strongest submission look like?), then fill 2-4 between.
4. Write the brief and the rubric in parallel. Anchors that don't show up in the brief are surprise dimensions; anchors in the brief that don't show up in the rubric are unscoreable promises.
5. Run the rubric on a known submission (a prior hire's submission, anonymized). Does it score them where you'd expect?

# AI-use policy mapping

The take-home evaluator runs pattern checks calibrated against the disclosed AI-use policy. The same submission produces different signal interpretation under different policies; this file documents the mapping.

The intent: surface signals to the panel debrief, not to make a determination. AI-use detection is well-known to be unreliable as a forensic tool; the right framing is "discuss with the candidate against the disclosed policy."

## Policy values

### `none-allowed`

The take-home brief told the candidate: "Do not use AI tools (Claude, ChatGPT, Copilot, Cursor, etc.) at any point during this assessment."

Pattern checks:

- **Verbatim public matches** — surface as `signal: verbatim_public_match` if any chunk ≥3 lines matches a known public AI-generated boilerplate exactly.
- **Style consistency vs. complexity** — surface as `signal: uniform_style` if comment style and naming idioms are suspiciously consistent across files of varying complexity.
- **Generic comments without engagement** — surface as `signal: generic_comments` if comments explain what the code does without engaging with the problem-specific decisions, beyond a per-file threshold (default: >40% of comments are generic).

Panel debrief framing: "Signals suggest possible AI use. Discuss with the candidate. The skill does not determine; the panel does, against the disclosed policy."

### `syntax-help-only`

The take-home brief told the candidate: "AI tools are allowed for syntax help (looking up the right method name, checking a regex, formatting). They are NOT allowed for solution generation (asking 'how would I implement this?', 'write me the function for X')."

Pattern checks: same as `none-allowed`, but with different framing. AI-generated boilerplate at the function level is a signal worth discussing; AI-completed identifier names are not.

Panel debrief framing: "AI use was permitted within bounds. Signals indicate where the candidate may have crossed the bounds. Discuss with the candidate."

### `ai-tools-allowed`

The take-home brief told the candidate: "AI tools are allowed throughout. Tell us what tools you used and how, in the README."

Pattern checks: still run, but signals are surfaced as informational only.

Panel debrief framing: "AI use was permitted. The signals are informational. The panel evaluates the SUBMISSION against the rubric; the question is what the candidate built and why, not which tools they used."

In the `ai-tools-allowed` policy, the panel should be ready to evaluate "the candidate's prompting and tool-use judgment" as a positive dimension if the rubric calls it out — many roles in 2026 explicitly want to see how candidates work with AI tools.

## What the skill does NOT do

- **Run a third-party AI-detection model** (GPTZero, Originality, etc.). Those tools have well-documented false-positive rates that climb to 30-50% on technical writing. Their findings do not survive a panel debrief.
- **Output a confidence score for "this was AI-generated."** No such confidence can be honestly assigned; the patterns are signals, not proof.
- **Block the report on signals.** Signals appear in the report. The panel decides. If `none-allowed` and the signals are strong, the panel typically schedules a follow-up conversation rather than auto-rejecting.

## Calibration

The skill's pattern thresholds (e.g. "40% generic comments") are tunable per take-home format. If your team's take-home format produces a lot of boilerplate naturally (e.g. it asks for a CRUD API), the threshold should be raised; if your format produces little boilerplate (e.g. it asks for a custom algorithm), the threshold should be lowered.

The defaults are calibrated against general-engineering take-homes. Tune in `config.json` per take-home format. Document the tuning in the rubric file's notes section.

## Why surfaced as notes, not as a verdict

1. **False-positive cost is asymmetric.** Auto-rejecting a candidate based on an AI-use signal that turns out to be wrong damages the firm's brand and risks a discrimination claim if the signal correlates with disparate impact. Surfacing for discussion costs nothing.
2. **The disclosed policy is the contract.** What matters is whether the candidate followed the policy they were told about. The signal helps the panel ask; it does not answer.
3. **AI-use detection is unreliable.** Even the best-known detectors have unacceptable error rates. The skill does not pretend otherwise.
4. **Hire decisions involve more than this one signal.** A candidate with a strong submission and a possible AI-use signal under `syntax-help-only` is a candidate to talk to, not a candidate to drop.