claude-skill

Claudeによるテイクホーム課題評価

Difficulty

中級

Setup time

40min

For

recruiter · hiring-manager · technical-screener

Recruiting & TA

Stack

採用チームが作成した評価基準に対して候補者のテイクホーム提出物をスコアリングし、提出されたコードや文書からの行単位の引用を含む構造化された評価レポートを作成するClaude Skillです。自動合格や自動不合格にすることはありません。採用パネルはレポートをライブデブリーフの錨として使用し、実際の採用・不採用の決定はパネルのディスカッションで行われます。パネリスト1人あたり60〜90分の非組織的な作業を、パネリスト1人あたり15分の構造化レビューと30分のキャリブレーションされたデブリーフに置き換えます。

使うべき場面

ロールがループのステップとしてテイクホーム課題を使用している場合（構造化面接が前提条件です）。
パネリスト間で一貫したスコアリングが必要な場合。テイクホームのレビューは悪名高いほど不一致です。
テイクホームがコーディング演習、システム設計の書き起こし、書面演習、または検査可能なアーティファクトを生成する統合ビルドである場合。

使ってはいけない場面

ループでの自動合格/自動不合格。 スキルはスコアリングされたレポートを生成します。採用決定はパネルデブリーフで行われます。レポートの集計スコアをステージ移行に接続することは、NYC LL 144 / EU AI Actのエクスポージャーを引き起こします。
ライブコーディング面接。 異なるワークフロー（アーティファクト評価ではなくプロセスのライブ観察）。
4時間以上の候補者の作業のテイクホーム。 長いテイクホームはそれ自体候補者体験のアンチパターンです。
候補者がAI使用開示にサインしなかった提出物。 評価基準のスコアリングは特定のAI使用ポリシーに対してキャリブレーションされています。
主要な用途としての盗作検出。 スキルは不審なパターンをフラグしますが、法的な盗作ツールではありません。

セットアップ

バンドルをドロップする。 apps/web/public/artifacts/take-home-evaluator-claude-skill/SKILL.mdをClaude Codeのスキルディレクトリに配置します。
評価基準を作成する。 テイクホームごとに、実際にスコアリングする次元を含むJSON評価基準を作成します。テンプレートはreferences/1-take-home-rubric-template.mdにあります。
AI使用ポリシーを設定する。 設定はテイクホームブリーフの開示言語と一致しなければなりません。
パネリスト配布モードを設定する。 シングルパネリストモードまたはパネリストごとモード。
クローズしたテイクホームでドライランを実施する。 先四半期に採用（または不採用）にした候補者のテイクホームをスコアリングします。

スキルの実際の動作

6つのステップ。決定論的チェックはLLMがスコアリングする前に行われます。

提出物の形状を検証する。 テイクホームブリーフで指定されたすべての成果物が存在することを確認します。
決定論的チェックを実行する。 コードをコンパイルします。候補者が作成したテストスイートを実行します。
評価基準次元ごとにスコアリングする。 候補者の提出物からの逐語的引用を含む1〜5のスコアを付けます。引用なしの場合、スコアは評価基準の1アンカーにデフォルトします。
ポリシーに対するAI使用シグナルを検出する。 ai-use-signalノートとして表面化されます（違反としてではありません）。
採用/不採用推薦なしに集計を計算する。 スキルは「レポート。決定ではありません」を返します。
パネリストごとまたは集計レポートを出力する。

コスト試算

Claude Sonnet 4.6でのテイクホーム提出物1件あたり：

LLMトークン — 通常15〜30kの入力と3〜5kの出力。シングルパネリストモードで1提出物あたり約$0.15〜$0.25。
パネリストの時間 — パネリストの最初の読み取りは60〜90分です。スキルのレポートをレビューするのは15〜25分です。テイクホームごとの節約：2〜3パネリスト時間。
セットアップ時間 — テイクホーム形式ごとに40分。

成功指標

テイクホームサイクルごとに3つのことを追跡します：

パネリスト間のスコアバリアンス — ~0.5以下は過度なゴム印押し。~1.5以上は評価基準アンカーが曖昧すぎます。
採用対不採用とスキル集計の相関 — 正だが1.0ではない。1.0の場合、パネルは自動委任しています。
テイクホームデブリーフの時間 — 4時間未満に短縮されるべきです。

代替手段との比較

CodeSignalコーディングレポート / HackerRankの自動採点との比較。 明確に定義された入力から出力のテイクホームには適しています。評価基準がスコアであるビルドには本スキルを選択します。
手動採点のテイクホームとの比較。 手動採点は最高リスクの採用に適しています。一貫した評価基準の適用が欠けているテイクホームの80%にスキルが価値を発揮します。
ChatGPTスタイルの「このコードをレビューしてください」との比較。 汎用のチャットは汎用のフィードバックを返します。スキルは逐語的引用を必要とし、採用/不採用推薦を拒否します。

注意点

自動合格/自動不合格のドリフト。 対策： スキーマには推薦フィールドが明示的に欠落しています。
汎用フィードバックの幻覚。 対策： 引用なしの場合、スコアは1にデフォルトします。
評価基準からのバイアスの継承。 対策： 評価基準はこのスキルの上流にあります。既知の格差的影響を持つ次元でスコアリングしていないか確認します。
AI使用検出の偽陽性。 対策： AI使用シグナルはノートとして表面化されます（違反としてではありません）。
候補者コードのサンドボックス化の失敗。 対策： 企業の秘密にアクセスできるマシンで未レビューの候補者コードを実行しないでください。

スタック

スキルバンドルはapps/web/public/artifacts/take-home-evaluator-claude-skill/にあります：

SKILL.md — スキル定義
references/1-take-home-rubric-template.md — 記入可能な評価基準テンプレート
references/2-ai-use-policy-mapping.md — 開示されたポリシーのパターンチェックへのマッピング

使用するツール：Claude。オプション：CodeSignalまたはHackerRank、Ashby。

関連：構造化面接、行動面接、候補者体験、採用の質。

GitHubでこのページを編集

Files in this artifact

Download all (.zip)

---
name: take-home-evaluator
description: Score a take-home submission against a rubric, with verbatim citations from the candidate's code or text, plus deterministic checks (compile, run tests). Output is a structured report with per-dimension scores and an aggregate — never a hire/no-hire recommendation. Detects AI-use signals against the disclosed policy as notes for the panel debrief.
---

# Take-home assessment evaluator

## When to invoke

Use this skill when a panelist has a candidate's take-home submission and wants a rubric-anchored evaluation report to bring to the panel debrief. Take the submission directory plus the role rubric as input and return a structured Markdown report.

Do NOT invoke this skill for:

- **Auto-pass / auto-fail in the loop.** This skill produces a scored report. The hire decision happens in the panel debrief, not in the report.
- **Live coding interviews.** Different workflow.
- **Submissions where the candidate did not sign the AI-use disclosure.** The skill calibrates against a specific use-of-AI policy; without the disclosure, there is nothing to calibrate against.
- **Plagiarism forensics.** The skill flags suspicious patterns but is not a defensible plagiarism tool.

## Inputs

- Required: `submission_dir` — path to the candidate's submission directory.
- Required: `rubric` — path to the take-home rubric file. See `references/1-take-home-rubric-template.md` for the shape.
- Required: `ai_use_policy` — string identifying the disclosed policy. One of `none-allowed`, `syntax-help-only`, `ai-tools-allowed`. The skill's pattern checks are calibrated against the policy, not against an absolute "AI-generated" detector.
- Optional: `panelist_id` — if running per-panelist mode, identify the panelist for cross-panelist aggregation.
- Optional: `sandboxed` — boolean. If `false`, the skill warns about running candidate code on the local machine and asks for confirmation before proceeding to step 2.

## Reference files

- `references/1-take-home-rubric-template.md` — the rubric shape the skill expects.
- `references/2-ai-use-policy-mapping.md` — how each `ai_use_policy` value maps to the pattern checks in step 4.

## Method

Six steps.

### 1. Validate the submission shape

Walk `submission_dir`. Compare against the deliverables named in the take-home brief (the rubric file's `expected_deliverables` field). For each missing deliverable, add to a `missing_deliverables` array on the report. Do NOT score dimensions that depend on missing files; their score is `not assessed` rather than `1`.

### 2. Run deterministic checks

If the submission has a build/test command in `package.json`, `Makefile`, `pyproject.toml`, or `Cargo.toml`:

- Compile / install dependencies in a sandboxed environment. If the calling environment is not sandboxed (per `sandboxed: false`), warn and stop until the panelist confirms.
- Run the candidate's test suite. Capture pass/fail counts.
- Run linters / formatters in check mode (do NOT modify the candidate's code). Capture findings.

Record the deterministic results in a `deterministic_checks` block on the report. These are auditable — the LLM does not re-litigate them in step 3.

### 3. Score per rubric dimension

For each dimension in the rubric:

- Read the rubric anchors (1-5).
- Find evidence in the submission. Evidence is a verbatim string from the code or text, with file path + line range.
- If you cannot find verbatim evidence for a score above 1, the score is 1 — no inference, no generic feedback.
- Tag each cited line with which dimension it supports.

### 4. Detect AI-use signal against the policy

Run pattern checks per `references/2-ai-use-policy-mapping.md`:

- **Verbatim public matches** — does any chunk of the submission match a public AI-generated boilerplate exactly? Surface as `signal: verbatim_public_match`.
- **Style consistency vs. complexity** — is the style suspiciously consistent across files of varying complexity? Real candidates' style varies with the difficulty of the section. Surface as `signal: uniform_style`.
- **Generic comments without engagement** — comments that explain what the code does without engaging with the problem-specific decisions are a common AI-generated tell. Surface as `signal: generic_comments`.

Surface signals as NOTES, never as VIOLATIONS. The panel reviews against the disclosed policy. If `ai_use_policy: ai-tools-allowed`, the signals are informational only. If `ai_use_policy: none-allowed`, the panel discusses whether the signals warrant follow-up — the skill does not decide.

### 5. Compute aggregate

Sum the per-dimension scores. Surface the aggregate as a number AND the per-dimension breakdown.

Do NOT translate the aggregate into a recommendation. The skill's schema explicitly omits a `recommendation` field. The report ends after the aggregate and the AI-use signals section.

### 6. Emit report

Write the report to `report.md` in the submission directory or to stdout (depending on the calling environment). In per-panelist mode, write to `report-<panelist_id>.md`.

## Output format

```markdown
# Take-home evaluation — {Candidate name} — {Role}

Submission: `{submission_dir}` · Rubric: `{rubric_path}` (SHA `{short}`)
Generated: {ISO timestamp} · Skill v1.0 · Model: claude-sonnet-4-6
Panelist: {panelist_id or "single-panelist mode"}
AI-use policy: {ai_use_policy}

## Deterministic checks

- **Build:** {passed | failed | not applicable}
- **Tests:** {N/M passed} ({test command})
- **Linter:** {N findings}
- **Missing deliverables:** {list or "none"}

## Per-dimension scores

### Correctness — 4/5

> Evidence: `src/router.rs:42-57` — handles the request-routing edge case for retries with exponential backoff and jitter, including the cap at max 60s. Anchor 4 ("handles the named edge cases with explicit code paths") matches.

Counter-evidence (would have been 5): retry budget is hardcoded at 5 attempts; the rubric's anchor 5 names "configurable retry budget."

### Code quality — 3/5

> Evidence: `src/router.rs` — file is 800 lines with no module split. Anchor 3 ("readable but lacks structural decomposition") matches.

### Decision-making documented — 4/5

> Evidence: `README.md:25-40` — explains the choice of exponential-vs-fixed backoff with a reference to the failure mode it mitigates. Anchor 4 matches.

### Error handling — 2/5

> Evidence: `src/router.rs:120` — catches and re-raises the network error without distinguishing between transient and permanent failures. Anchor 2 ("error paths exist but do not differentiate") matches.

### Test coverage — 4/5

> Evidence: `tests/router_test.rs` — covers the happy path, three retry scenarios, and the timeout. Missing: the network-partition test the rubric anchor 5 names.

## Aggregate

17/25.

This is the per-dimension sum. The skill does NOT translate this into a hire/no-hire recommendation. The panel debrief is where the decision happens.

## AI-use signal notes

Disclosed policy: **syntax-help-only**.

- ⚠️ `signal: uniform_style` — `src/cache.rs` and `src/router.rs` use the same comment style and naming idioms despite the different complexity. May warrant a follow-up question in the panel debrief.
- ✓ No `verbatim_public_match` signals.
- ✓ No `generic_comments` signals beyond the documented threshold.

The panel discusses these against the disclosed policy. The skill does not decide.
```

## Watch-outs

- **Auto-pass / auto-fail drift.** *Guard:* the report ends after the aggregate. No recommendation field. The aggregate is a sum, not a verdict.
- **Generic-feedback hallucination.** *Guard:* every dimension score requires verbatim citation (file path + line range + content).
- **AI-use false positive.** *Guard:* signals are notes, not violations. The panel decides against the disclosed policy.
- **Unsandboxed candidate code.** *Guard:* skill warns before running in non-sandboxed environments.
- **Bias inheritance.** *Guard:* the rubric is upstream of the skill. Audit the rubric separately if the dimensions correlate with disparate impact.

# Take-home rubric template

The take-home evaluator scores a submission against this rubric shape. Copy the JSON below to your role's rubric file (one per take-home format) and fill in every field. The skill reads the rubric; without it, scoring has nothing to anchor against.

A complete rubric takes 30-90 minutes to author per take-home format. Reuse across roles in the same family is high — a senior-backend take-home rubric is largely the same across companies once you've written it once.

## JSON shape

```json
{
"take_home_id": "senior-backend-router-rewrite-v3",
"version": "2026-04-15",
"expected_deliverables": [
"README.md",
"src/**/*.rs",
"tests/**/*.rs",
"Cargo.toml"
],
"build_commands": {
"build": "cargo build --release",
"test": "cargo test --all",
"lint": "cargo clippy -- -D warnings"
},
"ai_use_policy_match": "syntax-help-only",
"dimensions": [
{
"id": "correctness",
"label": "Correctness",
"anchors": {
"1": "Compiles but does not pass the candidate's own tests, or does not handle the named happy path.",
"2": "Handles the happy path; ignores the named edge cases (retries, partial failure).",
"3": "Handles the happy path and the obvious edge cases; misses the subtle ones (clock skew, partition recovery).",
"4": "Handles the named edge cases with explicit code paths; minor gaps acceptable.",
"5": "Handles all named edge cases AND demonstrates a configurable retry budget / timeout structure that the rubric explicitly calls for."
}
},
{
"id": "code_quality",
"label": "Code quality and structural decomposition",
"anchors": {
"1": "Single file, no decomposition; difficult to read.",
"2": "Decomposed but the decomposition does not follow domain boundaries.",
"3": "Readable but lacks structural decomposition that would scale past prototype.",
"4": "Clear module boundaries that follow the domain; idiomatic for the language.",
"5": "All of 4, plus the structural choices are documented in the README with the alternatives considered."
}
},
{
"id": "decision_documentation",
"label": "Decision-making documented",
"anchors": {
"1": "No README, or the README only repeats the take-home brief.",
"2": "README describes what was built without naming the engineering choices.",
"3": "README names some choices without naming the alternatives.",
"4": "README names the choices AND explains why each was picked over the named alternatives.",
"5": "All of 4, plus the README cites the failure modes each choice mitigates."
}
},
{
"id": "error_handling",
"label": "Error handling",
"anchors": {
"1": "Errors are caught and silently swallowed.",
"2": "Error paths exist but do not differentiate between transient and permanent failures.",
"3": "Differentiates transient vs. permanent; lacks structured error types.",
"4": "Structured error types; retry policy is explicit per error class.",
"5": "All of 4, plus error paths have explicit observability (logging / metrics / traces) named in the code."
}
},
{
"id": "test_coverage",
"label": "Test coverage",
"anchors": {
"1": "No tests, or tests do not run.",
"2": "Tests cover the happy path only.",
"3": "Tests cover the happy path and one or two edge cases.",
"4": "Tests cover the happy path and multiple edge cases (timeout, retry, partial failure).",
"5": "All of 4, plus the network-partition test the rubric explicitly calls for."
}
}
],
"rubric_fairness_check": {
"no_bootcamp_vs_cs_proxies": "Anchors must score on observable behavior in the submission, not on idioms that proxy for educational background. 'Uses obscure language idioms' is forbidden as a positive signal.",
"no_native-english-only_proxies": "Anchors must NOT score on README writing fluency beyond the level required to communicate the engineering decisions.",
"documented_in_brief": "The take-home brief shared with the candidate must describe the rubric dimensions and approximate weighting. Surprise dimensions are unfair."
}
}
```

## Per-field notes

- `take_home_id` — stable identifier for the take-home format. Reused across candidates for the same role family.
- `version` — semver or date. Bumped when the rubric is edited; the skill captures the version in the report so re-scoring against an edited rubric is visible.
- `expected_deliverables` — globs the skill walks against the submission. Missing deliverables surface in the report.
- `build_commands` — the skill runs these in step 2 (deterministic checks). Sandboxed execution required.
- `ai_use_policy_match` — should match the disclosure language in the take-home brief. Mismatch means the candidate's policy understanding doesn't match what the skill calibrates against.
- `dimensions` — array. Each dimension has an `id`, a `label`, and 5 anchor strings. Anchors should be observable behavior, not adjectives.
- `rubric_fairness_check` — three named fairness checks the skill confirms before scoring. If the rubric anchors violate any of these, the skill emits a warning and asks the rubric author to revise. (The skill does not refuse to score on a fairness-check violation, because the rubric is upstream and revising it is the right intervention. But it surfaces the issue.)

## Authoring a new dimension

To add a dimension to an existing rubric:

1. Pick observable behavior, not adjectives. "Has good error handling" is not a dimension; "error paths differentiate transient vs. permanent failure" is.
2. Write the 5 anchors as five distinct observable behaviors, each strictly more demanding than the last.
3. Test the dimension on a known submission. Can you score it from the anchors alone, without the original code in your head? If not, the anchors are too vague.
4. Bump the rubric version.

## Authoring a new rubric (for a net-new take-home)

1. Start from the take-home brief. What does the brief tell the candidate to deliver? Those are the `expected_deliverables`.
2. What is the brief asking the candidate to demonstrate? Those become the `dimensions`. Aim for 4-6 dimensions; more than 6 and the panelist can't hold them.
3. Write the 1-anchor first (the floor: what does an unsubmitted-effort look like?), then the 5-anchor (the ceiling: what does the strongest submission look like?), then fill 2-4 between.
4. Write the brief and the rubric in parallel. Anchors that don't show up in the brief are surprise dimensions; anchors in the brief that don't show up in the rubric are unscoreable promises.
5. Run the rubric on a known submission (a prior hire's submission, anonymized). Does it score them where you'd expect?

# AI-use policy mapping

The take-home evaluator runs pattern checks calibrated against the disclosed AI-use policy. The same submission produces different signal interpretation under different policies; this file documents the mapping.

The intent: surface signals to the panel debrief, not to make a determination. AI-use detection is well-known to be unreliable as a forensic tool; the right framing is "discuss with the candidate against the disclosed policy."

## Policy values

### `none-allowed`

The take-home brief told the candidate: "Do not use AI tools (Claude, ChatGPT, Copilot, Cursor, etc.) at any point during this assessment."

Pattern checks:

- **Verbatim public matches** — surface as `signal: verbatim_public_match` if any chunk ≥3 lines matches a known public AI-generated boilerplate exactly.
- **Style consistency vs. complexity** — surface as `signal: uniform_style` if comment style and naming idioms are suspiciously consistent across files of varying complexity.
- **Generic comments without engagement** — surface as `signal: generic_comments` if comments explain what the code does without engaging with the problem-specific decisions, beyond a per-file threshold (default: >40% of comments are generic).

Panel debrief framing: "Signals suggest possible AI use. Discuss with the candidate. The skill does not determine; the panel does, against the disclosed policy."

### `syntax-help-only`

The take-home brief told the candidate: "AI tools are allowed for syntax help (looking up the right method name, checking a regex, formatting). They are NOT allowed for solution generation (asking 'how would I implement this?', 'write me the function for X')."

Pattern checks: same as `none-allowed`, but with different framing. AI-generated boilerplate at the function level is a signal worth discussing; AI-completed identifier names are not.

Panel debrief framing: "AI use was permitted within bounds. Signals indicate where the candidate may have crossed the bounds. Discuss with the candidate."

### `ai-tools-allowed`

The take-home brief told the candidate: "AI tools are allowed throughout. Tell us what tools you used and how, in the README."

Pattern checks: still run, but signals are surfaced as informational only.

Panel debrief framing: "AI use was permitted. The signals are informational. The panel evaluates the SUBMISSION against the rubric; the question is what the candidate built and why, not which tools they used."

In the `ai-tools-allowed` policy, the panel should be ready to evaluate "the candidate's prompting and tool-use judgment" as a positive dimension if the rubric calls it out — many roles in 2026 explicitly want to see how candidates work with AI tools.

## What the skill does NOT do

- **Run a third-party AI-detection model** (GPTZero, Originality, etc.). Those tools have well-documented false-positive rates that climb to 30-50% on technical writing. Their findings do not survive a panel debrief.
- **Output a confidence score for "this was AI-generated."** No such confidence can be honestly assigned; the patterns are signals, not proof.
- **Block the report on signals.** Signals appear in the report. The panel decides. If `none-allowed` and the signals are strong, the panel typically schedules a follow-up conversation rather than auto-rejecting.

## Calibration

The skill's pattern thresholds (e.g. "40% generic comments") are tunable per take-home format. If your team's take-home format produces a lot of boilerplate naturally (e.g. it asks for a CRUD API), the threshold should be raised; if your format produces little boilerplate (e.g. it asks for a custom algorithm), the threshold should be lowered.

The defaults are calibrated against general-engineering take-homes. Tune in `config.json` per take-home format. Document the tuning in the rubric file's notes section.

## Why surfaced as notes, not as a verdict

1. **False-positive cost is asymmetric.** Auto-rejecting a candidate based on an AI-use signal that turns out to be wrong damages the firm's brand and risks a discrimination claim if the signal correlates with disparate impact. Surfacing for discussion costs nothing.
2. **The disclosed policy is the contract.** What matters is whether the candidate followed the policy they were told about. The signal helps the panel ask; it does not answer.
3. **AI-use detection is unreliable.** Even the best-known detectors have unacceptable error rates. The skill does not pretend otherwise.
4. **Hire decisions involve more than this one signal.** A candidate with a strong submission and a possible AI-use signal under `syntax-help-only` is a candidate to talk to, not a candidate to drop.