A Claude Skill that scores a candidate’s take-home submission against a rubric the hiring team wrote, with line-by-line citations from the submitted code or documents, and produces a structured evaluation report — never auto-passes or auto-fails. The hiring panel uses the report to anchor the live debrief; the actual hire/no-hire decision happens in the panel discussion, not in the report. Replaces the 60-90 minutes per panelist of disorganized “I read this on Saturday morning and I think it was OK?” with a structured 15-minute review per panelist plus a 30-minute calibrated debrief.
When to use
The role uses a take-home assessment as a step in the loop (structured interviewing prerequisite — without a written rubric the skill has nothing to score against).
You want consistent scoring across panelists. Take-home reviews are notoriously inconsistent because each panelist reads at a different time with a different attention level; the rubric-anchored report is the leveling artifact.
The take-home is a coding exercise, system-design write-up, written exercise (PRD draft, sales-call mock-write-up), or an integration-build that produces inspectable artifacts.
When NOT to use
Auto-pass / auto-fail in the loop. The skill produces a scored report. The hire decision happens in the panel debrief. Wiring the report’s aggregate score to a stage transition triggers the same NYC LL 144 / EU AI Act exposure as auto-rejection in screening.
Live coding interviews. Different workflow (live observation of process, not artifact evaluation). The interview-debrief workflow handles that case.
Take-homes longer than 4 hours of candidate work. Long take-homes are themselves a candidate experience anti-pattern; the skill won’t fix that.
Submissions where the candidate didn’t sign the AI-use disclosure. The rubric scoring is calibrated against a specific use-of-AI policy (e.g. “AI tools allowed for syntax help, not for solution generation”); without the disclosure, the skill can’t calibrate the “AI-only signal” detection.
Plagiarism detection as a primary use. The skill flags suspicious patterns (verbatim public-repo matches, generic AI-generated boilerplate) but is not a forensic plagiarism tool. Use a dedicated tool for that if you need defensible plagiarism findings.
Setup
Drop the bundle. Place apps/web/public/artifacts/take-home-evaluator-claude-skill/SKILL.md into your Claude Code skills directory.
Author the rubric. Per take-home, write a JSON rubric with the dimensions you actually score on (correctness, code quality, decision-making documented in comments / README, error handling, test coverage). Anchors per dimension at 1-5. The template lives in references/1-take-home-rubric-template.md.
Configure AI-use policy. The skill’s prompt explicitly tells Claude what AI use was permitted (“syntax help only,” “AI tools allowed throughout,” “no AI tools,” etc.). The setting maps to the disclosure language in the take-home brief — they must match.
Set the panelist-distribution mode. Either single-panelist mode (one report per submission) or per-panelist mode (each panelist gets the same submission, generates their own evaluation, and the skill aggregates the cross-panelist deltas). Per-panelist mode catches scoring drift but doubles the model cost.
Dry-run on a closed take-home. Score a take-home from a candidate hired (or not) last quarter. Compare the skill’s per-dimension scores to the panel’s actual scores. Tune the rubric anchors if the skill weighs dimensions differently.
What the skill actually does
Six steps. The order matters: deterministic checks (compile, run, file structure) happen before the LLM scores anything, because letting the model score a non-running submission produces a confident report on a broken artifact.
Validate the submission shape. Check that all the deliverables named in the take-home brief exist (e.g. README.md, source files, test files). Missing deliverables → flag in the report; do NOT score those dimensions.
Run deterministic checks. Compile the code. Run the test suite the candidate wrote. Capture the output. These are the auditable, reproducible outcomes — the LLM does not re-litigate them.
Score per rubric dimension. For each dimension in the rubric, score 1-5 with verbatim citations from the candidate’s submission (file path + line range + the code or text). Citations are required; without a citation, the score defaults to the rubric’s 1 anchor. The citation requirement keeps the model grounded in the actual submission rather than generic feedback.
Detect AI-use signal against the policy. Run pattern checks against the disclosed AI-use policy. Verbatim matches with public AI-generated boilerplate, suspiciously consistent style across files of varying complexity, or generic comments without engagement with the problem-specific decisions all surface as ai-use-signal notes — not as a violation, just as a signal for the panel to discuss against the disclosed policy.
Compute aggregate WITHOUT a hire/no-hire recommendation. Sum the per-dimension scores. Surface the aggregate as a number. Do NOT translate the aggregate into a recommendation. The skill explicitly returns “report; not a decision” rather than “pass / fail.”
Emit per-panelist or aggregated report. In single-panelist mode, the report goes to the calling panelist. In per-panelist mode, the skill aggregates across panelists, surfaces per-dimension cross-panelist deltas (and which panelist saw what differently), and emits a debrief-ready report.
Cost reality
Per take-home submission, on Claude Sonnet 4.6:
LLM tokens — typically 15-30k input (rubric + submission code/text + skill instructions) and 3-5k output (per-dimension scored report). Roughly $0.15-0.25 per submission in single-panelist mode. Per-panelist mode (3-4 panelists) multiplies linearly.
CI / sandbox cost — running the candidate’s test suite costs whatever your CI normally costs; usually negligible. Sandboxed execution (recommended — never run candidate code on the panel laptop) costs whatever your sandboxed-runner provider charges.
Panelist time — the win. A panelist’s first-pass review of a take-home is 60-90 minutes when done well, less when done poorly. Reviewing the skill’s report and noting agree/disagree per dimension is 15-25 minutes. Aggregate panel time saved per take-home: 2-3 panelist hours.
Setup time — 40 minutes once for the rubric and AI-use-policy mapping per take-home format. Reuse across roles in the same family is high.
Success metric
Track three things per take-home cycle:
Cross-panelist score variance — variance across panelists’ per-dimension scores. The skill should compress variance (panelists anchored on the same rubric and the same citations) without forcing artificial agreement. Variance below ~0.5 (on a 5-point scale) suggests panelists are rubber-stamping the skill’s report; above ~1.5 suggests the rubric anchors are too vague for the take-home to discriminate.
Hire-vs-no-hire correlation with skill aggregate — over a quarter, does the panel’s hire decision correlate with the skill’s aggregate? Should be positive but NOT 1.0; if it’s 1.0, the panel is auto-deferring (which is the failure mode the skill is designed against), and if it’s 0, the rubric or the skill is misaligned with what the panel actually values.
Take-home debrief duration — wall-clock from “all panelists submitted reviews” to “decision recorded.” Should drop from 1-2 days to under 4 hours, because the report is a shared anchor.
vs alternatives
vs CodeSignal Coding Reports / HackerRank automated grading. Those products run candidate code against the platform’s test cases and emit a score. Pick them if your take-home is structured well-defined-input-to-well-defined-output (LeetCode-style). Pick the skill if the take-home is a build (write a small system, design an API, write a PRD), where the rubric is the score and the score is the rubric. The two are complementary; CodeSignal can be the input to the skill’s run-tests step.
vs hand-graded take-homes. Hand-grading is right for the highest-stakes hires (founding engineer, principal IC) where the panel’s narrative judgment is the deliverable. The skill earns its setup cost on the 80% of take-homes where consistent rubric application is what’s missing.
vs ChatGPT-style “review this code.” Generic chat returns generic feedback. The skill is structurally different: it requires verbatim citations, runs deterministic checks first, and refuses to author a hire/no-hire recommendation.
vs no take-home (live-only loops). A reasonable choice for senior roles where references and live rounds carry the load. The skill is irrelevant if the loop has no take-home.
Watch-outs
Auto-pass / auto-fail drift.Guard: the skill’s output ends with the per-dimension scores and the aggregate. There is no “pass” or “fail” string. The schema explicitly omits a recommendation field.
Generic feedback hallucination.Guard: every dimension score requires a verbatim citation (file path + line range + content). Scores without citations default to 1.
Bias inheritance from the rubric.Guard: the rubric is upstream of this skill. Run the rubric through the diversity slate auditor framing — does the rubric score on dimensions that have known disparate impact (e.g. “uses obscure idioms,” which often correlates with bootcamp vs. CS-program background)?
AI-use detection false positive.Guard: AI-use signals are surfaced as notes, not violations. The panel reviews against the disclosed policy. Auto-flagging as a violation would be the wrong reading; legitimate use of AI tools (within the policy) is increasingly the norm.
Sandboxing failure on candidate code.Guard: the skill explicitly recommends sandboxed execution and warns if the calling environment runs the test suite directly on the panel machine. Never run unreviewed candidate code on a machine with access to firm secrets.
Submission-size blowup.Guard: if the submission exceeds ~50K LOC, the skill warns that scoring will be partial and prompts the panelist to identify the parts to focus on. Take-homes that produce 50K LOC are themselves a sign the brief was wrong.
Stack
The skill bundle lives at apps/web/public/artifacts/take-home-evaluator-claude-skill/ and contains:
references/2-ai-use-policy-mapping.md — how the disclosed policy maps to the skill’s pattern checks
Tools the workflow assumes you use: Claude (the model). Optional: CodeSignal or HackerRank for the deterministic-check leg; Ashby for the candidate record. Sandboxed execution is the recruiter / hiring-manager’s choice (Docker containers, GitHub Actions, etc.).
---
name: take-home-evaluator
description: Score a take-home submission against a rubric, with verbatim citations from the candidate's code or text, plus deterministic checks (compile, run tests). Output is a structured report with per-dimension scores and an aggregate — never a hire/no-hire recommendation. Detects AI-use signals against the disclosed policy as notes for the panel debrief.
---
# Take-home assessment evaluator
## When to invoke
Use this skill when a panelist has a candidate's take-home submission and wants a rubric-anchored evaluation report to bring to the panel debrief. Take the submission directory plus the role rubric as input and return a structured Markdown report.
Do NOT invoke this skill for:
- **Auto-pass / auto-fail in the loop.** This skill produces a scored report. The hire decision happens in the panel debrief, not in the report.
- **Live coding interviews.** Different workflow.
- **Submissions where the candidate did not sign the AI-use disclosure.** The skill calibrates against a specific use-of-AI policy; without the disclosure, there is nothing to calibrate against.
- **Plagiarism forensics.** The skill flags suspicious patterns but is not a defensible plagiarism tool.
## Inputs
- Required: `submission_dir` — path to the candidate's submission directory.
- Required: `rubric` — path to the take-home rubric file. See `references/1-take-home-rubric-template.md` for the shape.
- Required: `ai_use_policy` — string identifying the disclosed policy. One of `none-allowed`, `syntax-help-only`, `ai-tools-allowed`. The skill's pattern checks are calibrated against the policy, not against an absolute "AI-generated" detector.
- Optional: `panelist_id` — if running per-panelist mode, identify the panelist for cross-panelist aggregation.
- Optional: `sandboxed` — boolean. If `false`, the skill warns about running candidate code on the local machine and asks for confirmation before proceeding to step 2.
## Reference files
- `references/1-take-home-rubric-template.md` — the rubric shape the skill expects.
- `references/2-ai-use-policy-mapping.md` — how each `ai_use_policy` value maps to the pattern checks in step 4.
## Method
Six steps.
### 1. Validate the submission shape
Walk `submission_dir`. Compare against the deliverables named in the take-home brief (the rubric file's `expected_deliverables` field). For each missing deliverable, add to a `missing_deliverables` array on the report. Do NOT score dimensions that depend on missing files; their score is `not assessed` rather than `1`.
### 2. Run deterministic checks
If the submission has a build/test command in `package.json`, `Makefile`, `pyproject.toml`, or `Cargo.toml`:
- Compile / install dependencies in a sandboxed environment. If the calling environment is not sandboxed (per `sandboxed: false`), warn and stop until the panelist confirms.
- Run the candidate's test suite. Capture pass/fail counts.
- Run linters / formatters in check mode (do NOT modify the candidate's code). Capture findings.
Record the deterministic results in a `deterministic_checks` block on the report. These are auditable — the LLM does not re-litigate them in step 3.
### 3. Score per rubric dimension
For each dimension in the rubric:
- Read the rubric anchors (1-5).
- Find evidence in the submission. Evidence is a verbatim string from the code or text, with file path + line range.
- If you cannot find verbatim evidence for a score above 1, the score is 1 — no inference, no generic feedback.
- Tag each cited line with which dimension it supports.
### 4. Detect AI-use signal against the policy
Run pattern checks per `references/2-ai-use-policy-mapping.md`:
- **Verbatim public matches** — does any chunk of the submission match a public AI-generated boilerplate exactly? Surface as `signal: verbatim_public_match`.
- **Style consistency vs. complexity** — is the style suspiciously consistent across files of varying complexity? Real candidates' style varies with the difficulty of the section. Surface as `signal: uniform_style`.
- **Generic comments without engagement** — comments that explain what the code does without engaging with the problem-specific decisions are a common AI-generated tell. Surface as `signal: generic_comments`.
Surface signals as NOTES, never as VIOLATIONS. The panel reviews against the disclosed policy. If `ai_use_policy: ai-tools-allowed`, the signals are informational only. If `ai_use_policy: none-allowed`, the panel discusses whether the signals warrant follow-up — the skill does not decide.
### 5. Compute aggregate
Sum the per-dimension scores. Surface the aggregate as a number AND the per-dimension breakdown.
Do NOT translate the aggregate into a recommendation. The skill's schema explicitly omits a `recommendation` field. The report ends after the aggregate and the AI-use signals section.
### 6. Emit report
Write the report to `report.md` in the submission directory or to stdout (depending on the calling environment). In per-panelist mode, write to `report-<panelist_id>.md`.
## Output format
```markdown
# Take-home evaluation — {Candidate name} — {Role}
Submission: `{submission_dir}` · Rubric: `{rubric_path}` (SHA `{short}`)
Generated: {ISO timestamp} · Skill v1.0 · Model: claude-sonnet-4-6
Panelist: {panelist_id or "single-panelist mode"}
AI-use policy: {ai_use_policy}
## Deterministic checks
- **Build:** {passed | failed | not applicable}
- **Tests:** {N/M passed} ({test command})
- **Linter:** {N findings}
- **Missing deliverables:** {list or "none"}
## Per-dimension scores
### Correctness — 4/5
> Evidence: `src/router.rs:42-57` — handles the request-routing edge case for retries with exponential backoff and jitter, including the cap at max 60s. Anchor 4 ("handles the named edge cases with explicit code paths") matches.
Counter-evidence (would have been 5): retry budget is hardcoded at 5 attempts; the rubric's anchor 5 names "configurable retry budget."
### Code quality — 3/5
> Evidence: `src/router.rs` — file is 800 lines with no module split. Anchor 3 ("readable but lacks structural decomposition") matches.
### Decision-making documented — 4/5
> Evidence: `README.md:25-40` — explains the choice of exponential-vs-fixed backoff with a reference to the failure mode it mitigates. Anchor 4 matches.
### Error handling — 2/5
> Evidence: `src/router.rs:120` — catches and re-raises the network error without distinguishing between transient and permanent failures. Anchor 2 ("error paths exist but do not differentiate") matches.
### Test coverage — 4/5
> Evidence: `tests/router_test.rs` — covers the happy path, three retry scenarios, and the timeout. Missing: the network-partition test the rubric anchor 5 names.
## Aggregate
17/25.
This is the per-dimension sum. The skill does NOT translate this into a hire/no-hire recommendation. The panel debrief is where the decision happens.
## AI-use signal notes
Disclosed policy: **syntax-help-only**.
- ⚠️ `signal: uniform_style` — `src/cache.rs` and `src/router.rs` use the same comment style and naming idioms despite the different complexity. May warrant a follow-up question in the panel debrief.
- ✓ No `verbatim_public_match` signals.
- ✓ No `generic_comments` signals beyond the documented threshold.
The panel discusses these against the disclosed policy. The skill does not decide.
```
## Watch-outs
- **Auto-pass / auto-fail drift.** *Guard:* the report ends after the aggregate. No recommendation field. The aggregate is a sum, not a verdict.
- **Generic-feedback hallucination.** *Guard:* every dimension score requires verbatim citation (file path + line range + content).
- **AI-use false positive.** *Guard:* signals are notes, not violations. The panel decides against the disclosed policy.
- **Unsandboxed candidate code.** *Guard:* skill warns before running in non-sandboxed environments.
- **Bias inheritance.** *Guard:* the rubric is upstream of the skill. Audit the rubric separately if the dimensions correlate with disparate impact.
# Take-home rubric template
The take-home evaluator scores a submission against this rubric shape. Copy the JSON below to your role's rubric file (one per take-home format) and fill in every field. The skill reads the rubric; without it, scoring has nothing to anchor against.
A complete rubric takes 30-90 minutes to author per take-home format. Reuse across roles in the same family is high — a senior-backend take-home rubric is largely the same across companies once you've written it once.
## JSON shape
```json
{
"take_home_id": "senior-backend-router-rewrite-v3",
"version": "2026-04-15",
"expected_deliverables": [
"README.md",
"src/**/*.rs",
"tests/**/*.rs",
"Cargo.toml"
],
"build_commands": {
"build": "cargo build --release",
"test": "cargo test --all",
"lint": "cargo clippy -- -D warnings"
},
"ai_use_policy_match": "syntax-help-only",
"dimensions": [
{
"id": "correctness",
"label": "Correctness",
"anchors": {
"1": "Compiles but does not pass the candidate's own tests, or does not handle the named happy path.",
"2": "Handles the happy path; ignores the named edge cases (retries, partial failure).",
"3": "Handles the happy path and the obvious edge cases; misses the subtle ones (clock skew, partition recovery).",
"4": "Handles the named edge cases with explicit code paths; minor gaps acceptable.",
"5": "Handles all named edge cases AND demonstrates a configurable retry budget / timeout structure that the rubric explicitly calls for."
}
},
{
"id": "code_quality",
"label": "Code quality and structural decomposition",
"anchors": {
"1": "Single file, no decomposition; difficult to read.",
"2": "Decomposed but the decomposition does not follow domain boundaries.",
"3": "Readable but lacks structural decomposition that would scale past prototype.",
"4": "Clear module boundaries that follow the domain; idiomatic for the language.",
"5": "All of 4, plus the structural choices are documented in the README with the alternatives considered."
}
},
{
"id": "decision_documentation",
"label": "Decision-making documented",
"anchors": {
"1": "No README, or the README only repeats the take-home brief.",
"2": "README describes what was built without naming the engineering choices.",
"3": "README names some choices without naming the alternatives.",
"4": "README names the choices AND explains why each was picked over the named alternatives.",
"5": "All of 4, plus the README cites the failure modes each choice mitigates."
}
},
{
"id": "error_handling",
"label": "Error handling",
"anchors": {
"1": "Errors are caught and silently swallowed.",
"2": "Error paths exist but do not differentiate between transient and permanent failures.",
"3": "Differentiates transient vs. permanent; lacks structured error types.",
"4": "Structured error types; retry policy is explicit per error class.",
"5": "All of 4, plus error paths have explicit observability (logging / metrics / traces) named in the code."
}
},
{
"id": "test_coverage",
"label": "Test coverage",
"anchors": {
"1": "No tests, or tests do not run.",
"2": "Tests cover the happy path only.",
"3": "Tests cover the happy path and one or two edge cases.",
"4": "Tests cover the happy path and multiple edge cases (timeout, retry, partial failure).",
"5": "All of 4, plus the network-partition test the rubric explicitly calls for."
}
}
],
"rubric_fairness_check": {
"no_bootcamp_vs_cs_proxies": "Anchors must score on observable behavior in the submission, not on idioms that proxy for educational background. 'Uses obscure language idioms' is forbidden as a positive signal.",
"no_native-english-only_proxies": "Anchors must NOT score on README writing fluency beyond the level required to communicate the engineering decisions.",
"documented_in_brief": "The take-home brief shared with the candidate must describe the rubric dimensions and approximate weighting. Surprise dimensions are unfair."
}
}
```
## Per-field notes
- `take_home_id` — stable identifier for the take-home format. Reused across candidates for the same role family.
- `version` — semver or date. Bumped when the rubric is edited; the skill captures the version in the report so re-scoring against an edited rubric is visible.
- `expected_deliverables` — globs the skill walks against the submission. Missing deliverables surface in the report.
- `build_commands` — the skill runs these in step 2 (deterministic checks). Sandboxed execution required.
- `ai_use_policy_match` — should match the disclosure language in the take-home brief. Mismatch means the candidate's policy understanding doesn't match what the skill calibrates against.
- `dimensions` — array. Each dimension has an `id`, a `label`, and 5 anchor strings. Anchors should be observable behavior, not adjectives.
- `rubric_fairness_check` — three named fairness checks the skill confirms before scoring. If the rubric anchors violate any of these, the skill emits a warning and asks the rubric author to revise. (The skill does not refuse to score on a fairness-check violation, because the rubric is upstream and revising it is the right intervention. But it surfaces the issue.)
## Authoring a new dimension
To add a dimension to an existing rubric:
1. Pick observable behavior, not adjectives. "Has good error handling" is not a dimension; "error paths differentiate transient vs. permanent failure" is.
2. Write the 5 anchors as five distinct observable behaviors, each strictly more demanding than the last.
3. Test the dimension on a known submission. Can you score it from the anchors alone, without the original code in your head? If not, the anchors are too vague.
4. Bump the rubric version.
## Authoring a new rubric (for a net-new take-home)
1. Start from the take-home brief. What does the brief tell the candidate to deliver? Those are the `expected_deliverables`.
2. What is the brief asking the candidate to demonstrate? Those become the `dimensions`. Aim for 4-6 dimensions; more than 6 and the panelist can't hold them.
3. Write the 1-anchor first (the floor: what does an unsubmitted-effort look like?), then the 5-anchor (the ceiling: what does the strongest submission look like?), then fill 2-4 between.
4. Write the brief and the rubric in parallel. Anchors that don't show up in the brief are surprise dimensions; anchors in the brief that don't show up in the rubric are unscoreable promises.
5. Run the rubric on a known submission (a prior hire's submission, anonymized). Does it score them where you'd expect?
# AI-use policy mapping
The take-home evaluator runs pattern checks calibrated against the disclosed AI-use policy. The same submission produces different signal interpretation under different policies; this file documents the mapping.
The intent: surface signals to the panel debrief, not to make a determination. AI-use detection is well-known to be unreliable as a forensic tool; the right framing is "discuss with the candidate against the disclosed policy."
## Policy values
### `none-allowed`
The take-home brief told the candidate: "Do not use AI tools (Claude, ChatGPT, Copilot, Cursor, etc.) at any point during this assessment."
Pattern checks:
- **Verbatim public matches** — surface as `signal: verbatim_public_match` if any chunk ≥3 lines matches a known public AI-generated boilerplate exactly.
- **Style consistency vs. complexity** — surface as `signal: uniform_style` if comment style and naming idioms are suspiciously consistent across files of varying complexity.
- **Generic comments without engagement** — surface as `signal: generic_comments` if comments explain what the code does without engaging with the problem-specific decisions, beyond a per-file threshold (default: >40% of comments are generic).
Panel debrief framing: "Signals suggest possible AI use. Discuss with the candidate. The skill does not determine; the panel does, against the disclosed policy."
### `syntax-help-only`
The take-home brief told the candidate: "AI tools are allowed for syntax help (looking up the right method name, checking a regex, formatting). They are NOT allowed for solution generation (asking 'how would I implement this?', 'write me the function for X')."
Pattern checks: same as `none-allowed`, but with different framing. AI-generated boilerplate at the function level is a signal worth discussing; AI-completed identifier names are not.
Panel debrief framing: "AI use was permitted within bounds. Signals indicate where the candidate may have crossed the bounds. Discuss with the candidate."
### `ai-tools-allowed`
The take-home brief told the candidate: "AI tools are allowed throughout. Tell us what tools you used and how, in the README."
Pattern checks: still run, but signals are surfaced as informational only.
Panel debrief framing: "AI use was permitted. The signals are informational. The panel evaluates the SUBMISSION against the rubric; the question is what the candidate built and why, not which tools they used."
In the `ai-tools-allowed` policy, the panel should be ready to evaluate "the candidate's prompting and tool-use judgment" as a positive dimension if the rubric calls it out — many roles in 2026 explicitly want to see how candidates work with AI tools.
## What the skill does NOT do
- **Run a third-party AI-detection model** (GPTZero, Originality, etc.). Those tools have well-documented false-positive rates that climb to 30-50% on technical writing. Their findings do not survive a panel debrief.
- **Output a confidence score for "this was AI-generated."** No such confidence can be honestly assigned; the patterns are signals, not proof.
- **Block the report on signals.** Signals appear in the report. The panel decides. If `none-allowed` and the signals are strong, the panel typically schedules a follow-up conversation rather than auto-rejecting.
## Calibration
The skill's pattern thresholds (e.g. "40% generic comments") are tunable per take-home format. If your team's take-home format produces a lot of boilerplate naturally (e.g. it asks for a CRUD API), the threshold should be raised; if your format produces little boilerplate (e.g. it asks for a custom algorithm), the threshold should be lowered.
The defaults are calibrated against general-engineering take-homes. Tune in `config.json` per take-home format. Document the tuning in the rubric file's notes section.
## Why surfaced as notes, not as a verdict
1. **False-positive cost is asymmetric.** Auto-rejecting a candidate based on an AI-use signal that turns out to be wrong damages the firm's brand and risks a discrimination claim if the signal correlates with disparate impact. Surfacing for discussion costs nothing.
2. **The disclosed policy is the contract.** What matters is whether the candidate followed the policy they were told about. The signal helps the panel ask; it does not answer.
3. **AI-use detection is unreliable.** Even the best-known detectors have unacceptable error rates. The skill does not pretend otherwise.
4. **Hire decisions involve more than this one signal.** A candidate with a strong submission and a possible AI-use signal under `syntax-help-only` is a candidate to talk to, not a candidate to drop.