claude-skill

Clause extraction for any contract with Claude

Difficulty

beginner

Setup time

20min

For

legal-ops · in-house-counsel · paralegal · contract-manager

Legal Ops

Stack

A Claude Skill that takes a single executed contract — .docx or text-layer .pdf — and emits a citation-grounded JSON record with the clauses your CLM actually keys on: governing law, liability cap, indemnification, term, auto-renewal, termination triggers, payment terms, IP ownership, confidentiality term, plus any custom fields you configure (data residency, MFN, change-of-control, assignment). Every extracted value carries a verbatim excerpt, a {page, char_span} citation, and a confidence score, so the downstream reviewer can verify in seconds rather than re-reading the contract.

This page covers when to run it, when explicitly not to, what it costs, and the named failure modes you should size for before pointing it at a production repository.

When to use

Reach for the skill when you have a structured-output need against contracts that have already cleared privilege:

CLM data backfill. You inherited a flat-file repository (Box, SharePoint, network drive) and need to populate Ironclad or Agiloft metadata fields without burning a paralegal quarter.
Clause library construction. You want every “liability cap” clause across the portfolio so the clause library reflects what you actually agreed to, not the playbook’s stated position.
Diligence. You have 48 hours to surface change-of-control, assignment, and most-favored-customer clauses across a target’s contract set before a deal closes.
Renewal triage. You need to flag every contract auto-renewing in the next 90 days with the notice-period-days field populated.

The artifact bundle lives at apps/web/public/artifacts/clause-extraction-claude-skill/ and ships:

SKILL.md — the Skill definition with method, output format, and watch-outs
references/1-clause-taxonomy.md — the clauses to extract per contract type, with headings and synonyms
references/2-output-schema.json — the JSON Schema every record validates against (pin this to a version)
references/3-citation-format.md — citation grammar and the rules for “not present” / “could not extract” fallbacks

When NOT to use

The skill is narrow on purpose. Refuse the invocation in any of these cases.

Privileged drafts in active negotiation. Most legal teams’ AI policies (and the AI policy template we recommend) draw a hard line at in-flight negotiation drafts — particularly outside-counsel redlines and attorney work product. This skill is for executed or near-final contracts that have already cleared the privilege question. If you’re not sure whether a document has cleared, the answer is no.
Anything via non-Tier-A AI vendors. Run only against your firm’s approved Tier-A endpoint (Anthropic API direct, or your enterprise Claude tenant). Never the consumer chatbot. Never a browser plugin. Never an unvetted SaaS wrapper that promises “Claude under the hood.” Sending a contract through a Tier-B vendor is a privilege-leak vector — refuse the invocation rather than route around the AI policy. The Skill itself hard-codes an endpoint allowlist; if you’re running it inside Claude Code or Claude.ai with your enterprise tenant, you’re fine.
Drafting or redlining. This skill reads only. For redlining, use the separate contract-redline skill.
Legal interpretation. The output is text + citation. Whether a 12-month liability cap is “good enough” given the deal context is a judgment call that stays with counsel.

Setup

Drop the bundle into ~/.claude/skills/ (Claude Code) or upload the references/ directory and SKILL.md to a Claude.ai project.
Replace the contents of references/1-clause-taxonomy.md with your firm’s actual taxonomy. The default taxonomy has the common MSA clauses; most firms add 5-10 custom fields (data residency by jurisdiction, change-of-control carveouts, non-solicit term, MFN scope).
Pin references/2-output-schema.json to a version. Bump extractor_version in the schema and in the Skill on every taxonomy change so downstream consumers can detect drift.
Run on a known contract — pick one whose clause values you already have in the CLM. Compare the extracted JSON against the CLM record. Iterate on taxonomy synonyms until you match.
Run at scale. The Skill is per-contract; orchestrate the batch in n8n, a shell loop, or your CLM’s intake hook.

What the skill actually does

Four steps, in order.

Text extraction with layout preservation. .docx is parsed via the docx XML; .pdf via pdfplumber so page numbers and bounding-box character spans survive. If the PDF has no text layer (scanned image), the Skill aborts with error: "ocr_required" rather than emitting empty text. Routing scanned PDFs to OCR is a separate upstream concern; this Skill does not OCR, because silently producing a “clean” empty extraction from a scan is worse than failing loud.
Citation-grounded extraction, one pass per clause. For each clause in the taxonomy: find candidate paragraphs by heading + synonym match, pass only those candidates (not the whole contract) to Claude with the clause definition, and require back the value, a verbatim ≤ 280-char excerpt, the {page, char_span} citation, and a high | medium | low confidence score. Any excerpt not byte-identical to a substring of the source paragraphs is rejected — this is the hallucination guard, and it is non-negotiable. Per-clause prompts (vs one mega-prompt) let you retry only the failures, cap each call’s input tokens, and isolate hallucination to a single field instead of the whole record.
Schema validation against the pinned output-schema.json. Validation errors land in the output’s errors array. The Skill does not silently coerce types.
“Not present” fallback. When a clause is not located, emit value: null, status: "not_present", note: "Searched headings: [...]". Do not guess. CLM backfill pipelines treat null + status:not_present as confirmed-absent (file the contract without that field) and null + status:error as needs-rerun (do not file). Conflating the two corrupts CLM data over time.

Cost reality

At 2026 Claude pricing — call it ~$3/M input tokens and ~$15/M output tokens for the cost-effective model used inside the Skill — the cost is dominated by input tokens, and input tokens are dominated by the length of the candidate paragraphs (because the Skill never sends the full contract, only the matched paragraphs per clause).

Rough numbers per contract:

Short contract (5 pages, ~3K input tokens across all per-clause calls, ~500 output tokens): ~$0.02 per contract.
Standard MSA (20 pages, ~12K input tokens, ~1K output tokens): ~$0.05 per contract.
Long enterprise MSA with schedules (60 pages, ~35K input tokens, ~2K output tokens): ~$0.13 per contract.

At a typical mid-market in-house team running ~200 new and inherited contracts a month through the pipeline, that’s $10–$30/month in token spend. The cost is rounding error against a paralegal hour. Where it stops being rounding error is the 50,000-contract diligence project — at $0.05 each, that’s $2,500, which is still cheap, but worth budgeting up front rather than discovering on the credit card statement.

The non-token cost: every extraction with confidence: medium | low (and a 10% sample of high) needs human review. Plan for ~30 seconds per record at medium and ~2 minutes at low. The Skill is faster than a paralegal, not free.

Success metric

Two metrics worth instrumenting from day one.

Extraction accuracy on a labeled set. Build a 50-contract gold set with manual extractions. Measure precision and recall per clause. Target: ≥ 95% precision on the required clauses (governing_law, liability_cap, term_length_months, auto_renewal). Below that, the false positives poison the CLM and reviewers learn to ignore the field. Recall matters less — not_present is a load-bearing answer, and a missed clause routes to human review anyway.
Time per contract, end to end. Including the human review pass on flagged records. Target for a 20-page MSA: under 4 minutes wall-clock, vs. 20-30 minutes for full manual extraction. If you’re not seeing 5×, the human-review queue is too aggressive — tighten the confidence thresholds.

vs alternatives

vs Ironclad native AI clause extraction. Ironclad’s built-in extraction is excellent if every contract you care about lives in Ironclad. It struggles when you’re backfilling from outside Ironclad (the import path is clunky) and when you want custom clauses beyond Ironclad’s templated set. This Skill runs against any file on disk and uses your taxonomy. If you live entirely in Ironclad, use their native extraction; if you’re feeding multiple destinations or doing diligence on a non-Ironclad repository, this Skill is the better fit.
vs Kira Systems. Kira is the enterprise-grade incumbent — high accuracy, deep template library, expensive (six figures), long sales cycle, requires training data per custom clause. If you’re a BigLaw firm doing M&A diligence at scale, Kira earns its price. If you’re a 50-person legal-ops team backfilling a few thousand inherited MSAs, Kira is overkill and this Skill is two orders of magnitude cheaper for the accuracy you need.
vs manual paralegal review. The honest comparison. A paralegal extracting 10 clauses from a 20-page MSA takes 20-30 minutes and gets ≥ 99% accuracy on the easy clauses (governing law, term) and ~90% on the hard ones (liability cap structure, indemnification carveouts). This Skill does it in under a minute at ~$0.05, hits ~95% on easy and ~85% on hard, and routes the rest to a human via the confidence flag. The right move for most teams is hybrid: Skill on every contract, paralegal on the flagged records.

Watch-outs

Privilege leak via Tier-B vendor. Routing a privileged document through a non-approved AI endpoint can waive privilege. Guard: the Skill checks a hard-coded endpoint allowlist (api.anthropic.com plus your enterprise tenant) at startup and refuses to run if the configured endpoint isn’t on it. Document the allowlist owner in your AI policy.
OCR-induced text gaps on scanned PDFs. A scanned-image PDF with no OCR layer extracts as empty pages; without a guard, the Skill would report most clauses not_present and look like a clean run. Guard: step 1 detects pages with < 50 extracted characters and aborts with ocr_required rather than emitting a misleading record. Route the contract through OCR upstream and re-run.
Hallucinated clauses. Models will helpfully invent a “termination for convenience” clause that doesn’t exist if asked. Guard: the byte-identical excerpt-substring check in step 2 — any excerpt not literally present in the source paragraphs is rejected and the clause records status: "error", error: "excerpt_not_grounded". There is no high-confidence hallucination path by construction.
Schema drift across contract versions. A taxonomy update that changes liability_cap from a string to a {type, amount, period} object silently breaks every downstream consumer. Guard: pin extractor_version in references/2-output-schema.json and bump it on every taxonomy or schema change. Downstream consumers key on version, not on a stability assumption.
Defined-term resolution. “As set forth in Schedule A” returns the reference, not the value. Guard: the Skill detects as set forth in / as defined in and emits confidence: medium with note: "cross-reference, manual resolution required". Naive auto-resolution is worse than the honest flag.
Not legal advice. Extraction is mechanical. Whether a 12-month cap is acceptable for this deal is judgment that stays with counsel.

Stack

Claude — text extraction orchestration, citation-grounded clause extraction, schema validation
Ironclad (optional) — primary CLM destination for the extracted records. See also alternatives-to-ironclad and the best CLM platforms comparison if you’re still picking one.
CLM background — what CLM is and where extraction fits.

Edit this page on GitHub

Files in this artifact

Download all (.zip)

---
name: clause-extraction
description: Extract a fixed set of contract clauses from a single .pdf or .docx and emit citation-grounded JSON with page/span references. Use after intake to backfill CLM metadata, build a clause library, or surface change-of-control / liability terms during diligence.
---

# Clause extraction

## When to invoke

Invoke this skill per contract, after the document has been ingested and you need a structured clause record (governing law, liability cap, term, auto-renewal, indemnification, payment terms, IP ownership, confidentiality term, termination triggers, plus any custom clauses you configure).

Typical callers:

- CLM backfill — populating Ironclad / Agiloft / DealHub metadata for a legacy contract repository
- Diligence — surfacing change-of-control, assignment, MFN clauses on a target company's contract set before deal close
- Clause library — building a corpus of "what we actually agreed to" across a portfolio so the playbook reflects reality

Do NOT invoke this skill for:

- **Privileged drafts in active negotiation** — per AI policy in most legal teams, in-flight negotiation drafts (especially with outside counsel redlines) do not get sent to AI tooling. This skill is for executed or near-final contracts that have already cleared privilege.
- **Anything via non-Tier-A AI vendors.** Run only against the firm-approved Tier-A model endpoint (Anthropic API or your enterprise Claude tenant). A general-purpose chatbot, browser plugin, or unvetted SaaS wrapper is a privilege-leak vector — refuse the invocation rather than route around the AI policy.
- Drafting or redlining clauses (this skill reads only)
- Interpreting legal effect (the output is text + citation; legal judgment stays with counsel)

## Inputs

- Required: `contract_path` — absolute path to a `.pdf` or `.docx`. PDFs must be text-based or pre-OCR'd; scanned-image PDFs without an OCR layer are rejected at step 1.
- Required: `taxonomy` — path to `references/clause-taxonomy.md` (or a custom taxonomy keyed by contract type). Defines the clauses to look for and the expected value type (string, number, boolean, enum).
- Required: `output_schema` — path to `references/output-schema.json`. The JSON Schema the output must validate against. Schema drift across contract versions is the #1 source of downstream pipeline breakage; pinning the schema per run guards against it.
- Optional: `contract_type` — `msa | sow | nda | dpa | order_form`. Selects the clause subset from the taxonomy. Defaults to `msa`.
- Optional: `custom_clauses` — array of additional clause names to look for beyond the taxonomy defaults (e.g. `data_residency_clause`, `most_favored_customer_clause`).

## Reference files

Read these from `references/` before processing. They are templates — replace the placeholder content with your firm's real taxonomy and schema before running on production contracts.

- `references/clause-taxonomy.md` — clause definitions per contract type, with the value type, required/optional flag, and synonym phrases the extraction step matches against
- `references/output-schema.json` — the JSON Schema every emitted record must validate against
- `references/citation-format.md` — citation grammar (page + span anchor) and the rules for "not present" / "could not extract" fallbacks

## Method

Run these steps in order. Do not parallelize — later steps depend on the artifacts produced by earlier ones.

### 1. Text extraction with layout preservation

For `.docx`: parse via the docx XML and emit a flat text stream with paragraph indices and section headings preserved.

For `.pdf`: use a text-layer extractor (pdfplumber or pdfminer.six) that preserves page numbers and bounding-box character spans. If the PDF has no text layer (scanned image), abort with `error: "ocr_required"` rather than silently emitting empty text. Routing a scanned PDF to OCR is a separate upstream concern; this skill does not OCR.

The output of step 1 is a list of `{page, paragraph_index, char_span, text}` records. Every later citation references these coordinates.

### 2. Citation-grounded extraction (one pass per clause)

For each clause in the taxonomy:

1. Locate candidate paragraphs by heading match (e.g. "Governing Law", "Term") and synonym phrase match (e.g. "shall be governed by", "initial term of this Agreement").
2. Pass the candidate paragraphs (not the full contract) to Claude with the clause definition and ask for: the value, the verbatim source excerpt (≤ 280 chars), the `{page, char_span}` citation, and a `confidence` score (`high | medium | low`).
3. **Reject any extracted excerpt that is not byte-identical to a substring of the source paragraphs.** This is the hallucination guard — if the model returns text not actually in the contract, drop the extraction and record `value: null, error: "excerpt_not_grounded"`.

Why one pass per clause and not a single mega-prompt: per-clause prompts let you retry only the failures, cap each call's input tokens (cheaper, faster), and isolate hallucination failures to a single field instead of the whole record.

### 3. Schema validation

Validate the assembled record against `output-schema.json`. If validation fails, emit the validation error in the output's `errors` array. Do not silently coerce types.

### 4. "Not present" fallback

If a clause is not located in step 2 (no candidate paragraphs above confidence threshold), emit `value: null, status: "not_present", note: "Searched headings: [...]; no matching paragraphs found."` Do not guess. "Not present" is a load-bearing answer; CLM backfill pipelines treat `null + status:not_present` differently from `null + error:*`.

## Output format

Always emit a single JSON object per contract. Soft constraints below are enforced by `references/output-schema.json`.

```json
{
"contract_file": "vendor_msa_2026.pdf",
"contract_type": "msa",
"extracted_at": "2026-05-03T14:22:00Z",
"extractor_version": "clause-extraction@2026.05",
"clauses": {
"governing_law": {
"value": "Delaware",
"excerpt": "This Agreement shall be governed by and construed in accordance with the laws of the State of Delaware, without regard to its conflict of laws principles.",
"citation": { "page": 14, "char_span": [1820, 1980] },
"confidence": "high",
"status": "extracted"
},
"liability_cap": {
"value": "12 months fees",
"excerpt": "In no event shall either party's aggregate liability exceed the fees paid by Customer in the twelve (12) months preceding the event giving rise to the claim.",
"citation": { "page": 18, "char_span": [220, 410] },
"confidence": "high",
"status": "extracted"
},
"auto_renewal": {
"value": true,
"renewal_term_months": 12,
"notice_period_days": 90,
"excerpt": "This Agreement shall automatically renew for successive 12-month terms unless either party provides 90 days' written notice of non-renewal.",
"citation": { "page": 3, "char_span": [50, 230] },
"confidence": "high",
"status": "extracted"
},
"most_favored_customer_clause": {
"value": null,
"status": "not_present",
"note": "Searched headings: ['Most Favored', 'MFN', 'Pricing']; no matching paragraphs found."
}
},
"errors": []
}
```

## Watch-outs

- **Privilege leak via Tier-B vendor.** Routing a privileged or attorney-work-product document through a non-approved AI endpoint can waive privilege. Guard: hard-coded allowlist of model endpoints (`ALLOWED_ENDPOINTS = ["api.anthropic.com", "<your-enterprise-tenant>"]`) checked at skill startup. Refuse to run if the configured endpoint is not on the list. Document the allowlist owner in your AI policy.
- **OCR-induced text gaps on scanned PDFs.** If step 1 silently emits empty pages from a scanned image PDF, the skill will report many clauses as `not_present` and look like a clean extraction. Guard: step 1 detects pages with < 50 extracted characters and aborts with `ocr_required` rather than producing a misleading "clean" record.
- **Hallucinated clauses.** Models will helpfully invent a "termination for convenience" clause that doesn't exist if asked. Guard: byte-identical excerpt-substring check in step 2 — any excerpt not literally present in the source paragraphs is rejected. Pair with `confidence: low` flagging for human review on the rest.
- **Schema drift across contract versions.** A taxonomy update that changes `liability_cap` from a string to a structured `{type, amount, period}` silently breaks every downstream consumer. Guard: pin `extractor_version` in the output and bump it on every taxonomy or schema change. Downstream consumers key on version, not on the assumption that the schema is stable.
- **Defined-term resolution.** When a clause says "as set forth in Schedule A" the excerpt is the reference, not the value. Guard: detect the substring "as set forth in" / "as defined in" and emit `confidence: medium, note: "cross-reference, manual resolution required"` rather than treating the reference as the answer.
- **Heading-light contracts.** Contracts without clear section headings (older or short-form) extract less reliably. Guard: when fewer than 60% of expected headings match in step 2, mark the whole record `confidence: medium` and note `"heading_density: low"` so downstream QA routes it to human review.

# Clause taxonomy — TEMPLATE

> Replace this template's contents with your firm's actual clause taxonomy
> per contract type. The clause-extraction skill reads this file on every
> run; without your real taxonomy, extractions will use the generic defaults
> below and miss the clauses your CLM cares about.

The skill keys on `clause_id`. Every clause record in the output JSON uses the `clause_id` as the property name. Adding a clause means: add an entry here AND add the matching property to `output-schema.json`.

## Convention

For each clause:

- `clause_id` — snake_case identifier used as the JSON key
- `value_type` — `string | number | boolean | enum | structured`
- `required` — `true | false` (drives `not_present` vs hard error in validation)
- `headings` — list of section heading strings the locator matches against
- `synonyms` — list of phrase substrings the locator falls back to when no heading matches
- `value_hint` — what the extractor should pull (e.g. "the named jurisdiction state or country", "12-month-fees / 24-month-fees / unlimited / other")

## MSA defaults

### governing_law

- value_type: `string`
- required: true
- headings: `["Governing Law", "Choice of Law", "Applicable Law"]`
- synonyms: `["shall be governed by", "construed in accordance with the laws of"]`
- value_hint: the named jurisdiction (state, province, or country)

### liability_cap

- value_type: `structured` — `{ type: "fees_period" | "fixed_amount" | "unlimited" | "other", amount?: number, period_months?: number, currency?: string }`
- required: true
- headings: `["Limitation of Liability", "Liability Cap", "Cap on Liability"]`
- synonyms: `["aggregate liability shall not exceed", "in no event shall either party's liability exceed"]`
- value_hint: extract the cap amount or formula. Distinguish indirect-damages exclusions (do NOT extract those here) from the cap itself.

### indemnification

- value_type: `structured` — `{ ip_indemnity: boolean, mutual: boolean, carveouts: string[] }`
- required: true
- headings: `["Indemnification", "Indemnity"]`
- synonyms: `["shall defend, indemnify and hold harmless"]`
- value_hint: pull the IP indemnity boolean and the carveouts list (e.g. combinations, modifications, open source).

### term_length_months

- value_type: `number`
- required: true
- headings: `["Term", "Term and Termination"]`
- synonyms: `["initial term of this Agreement", "shall commence on the Effective Date"]`
- value_hint: convert years to months (3-year term → 36).

### auto_renewal

- value_type: `structured` — `{ enabled: boolean, renewal_term_months?: number, notice_period_days?: number }`
- required: true
- headings: `["Renewal", "Term and Termination"]`
- synonyms: `["shall automatically renew", "evergreen", "successive renewal terms"]`

### termination_triggers

- value_type: `structured` — `{ for_convenience: { allowed: boolean, notice_days?: number }, for_cause: { material_breach_cure_days?: number }, for_insolvency: boolean }`
- required: true
- headings: `["Termination"]`
- synonyms: `["may terminate this Agreement", "for material breach"]`

### payment_terms

- value_type: `structured` — `{ net_days: number, currency: string, late_fee_apr?: number }`
- required: true
- headings: `["Payment", "Fees and Payment", "Invoicing"]`
- synonyms: `["payable within", "net thirty (30) days"]`

### ip_ownership

- value_type: `enum` — `vendor | customer | joint | work_for_hire | other`
- required: true
- headings: `["Intellectual Property", "Ownership", "IP Rights"]`
- synonyms: `["all right, title and interest"]`

### confidentiality_term_months

- value_type: `number`
- required: true
- headings: `["Confidentiality", "Non-Disclosure"]`
- synonyms: `["confidentiality obligations shall survive", "for a period of"]`
- value_hint: convert years to months. If trade-secret carveout is "in perpetuity", emit `-1` and set `confidence: medium`.

## NDA defaults

(Replace with your NDA-specific taxonomy. Typical: `term_months`, `survival_period_months`, `permitted_purposes`, `residual_rights`, `return_or_destroy`.)

## DPA defaults

(Replace with your DPA-specific taxonomy. Typical: `data_residency`, `subprocessor_consent`, `audit_rights`, `breach_notification_hours`, `sccs_module_used`.)

## Custom clauses (firm-specific)

Add your firm-specific clauses here. Examples to consider:

- `change_of_control_clause` — boolean + carveouts
- `most_favored_customer_clause` — boolean + scope
- `data_residency_clause` — enum of jurisdictions
- `assignment_restriction` — enum: `no_restriction | consent_required | prohibited`
- `non_solicit_term_months` — number

## Last edited

{YYYY-MM-DD} — bump on every taxonomy change. The extractor records this date in `extractor_version` so downstream consumers can detect schema drift.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://ooligo.io/schemas/clause-extraction/v1.json",
  "title": "Clause extraction record",
  "description": "Output of the clause-extraction skill, one record per contract. Replace this template with your firm's pinned schema. Bump $id on every breaking change so downstream consumers can key on version.",
  "type": "object",
  "required": ["contract_file", "contract_type", "extracted_at", "extractor_version", "clauses", "errors"],
  "additionalProperties": false,
  "properties": {
    "contract_file": { "type": "string", "minLength": 1 },
    "contract_type": { "type": "string", "enum": ["msa", "sow", "nda", "dpa", "order_form"] },
    "extracted_at": { "type": "string", "format": "date-time" },
    "extractor_version": { "type": "string", "pattern": "^clause-extraction@[0-9]{4}\\.[0-9]{2}$" },
    "source_metadata": {
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "page_count": { "type": "integer", "minimum": 1 },
        "heading_density": { "type": "string", "enum": ["high", "medium", "low"] },
        "ocr_layer_present": { "type": "boolean" }
      }
    },
    "clauses": {
      "type": "object",
      "description": "One property per clause_id from the taxonomy. Use $defs/clauseRecord for each.",
      "additionalProperties": { "$ref": "#/$defs/clauseRecord" }
    },
    "errors": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["code", "message"],
        "properties": {
          "code": { "type": "string", "enum": ["ocr_required", "schema_validation_failed", "endpoint_not_allowed", "excerpt_not_grounded", "taxonomy_mismatch"] },
          "message": { "type": "string" },
          "clause_id": { "type": "string" }
        }
      }
    }
  },
  "$defs": {
    "clauseRecord": {
      "type": "object",
      "required": ["status"],
      "additionalProperties": true,
      "properties": {
        "status": { "type": "string", "enum": ["extracted", "not_present", "error"] },
        "value": {
          "description": "Type depends on the clause's value_type in the taxonomy. Null when status != extracted.",
          "anyOf": [
            { "type": "string" },
            { "type": "number" },
            { "type": "boolean" },
            { "type": "object" },
            { "type": "null" }
          ]
        },
        "excerpt": {
          "type": "string",
          "maxLength": 280,
          "description": "Verbatim substring of the source. MUST be byte-identical to text in the source paragraphs — the hallucination guard rejects any excerpt not literally present."
        },
        "citation": {
          "type": "object",
          "required": ["page", "char_span"],
          "additionalProperties": false,
          "properties": {
            "page": { "type": "integer", "minimum": 1 },
            "char_span": {
              "type": "array",
              "minItems": 2,
              "maxItems": 2,
              "items": { "type": "integer", "minimum": 0 },
              "description": "[start_char_offset, end_char_offset] within the page's extracted text."
            }
          }
        },
        "confidence": { "type": "string", "enum": ["high", "medium", "low"] },
        "note": { "type": "string" },
        "error": { "type": "string" }
      },
      "allOf": [
        {
          "if": { "properties": { "status": { "const": "extracted" } } },
          "then": { "required": ["value", "excerpt", "citation", "confidence"] }
        },
        {
          "if": { "properties": { "status": { "const": "not_present" } } },
          "then": { "required": ["note"] }
        },
        {
          "if": { "properties": { "status": { "const": "error" } } },
          "then": { "required": ["error"] }
        }
      ]
    }
  }
}

# Citation format — TEMPLATE

> The clause-extraction skill emits a citation on every extracted clause so
> the downstream reviewer can verify the extraction in seconds rather than
> re-reading the contract. Without a usable citation grammar, extractions
> are unfalsifiable — and unfalsifiable extractions become silent CLM data
> rot. This file pins the format and the fallback rules.

## Citation grammar

A citation is a structured object, not a string. The skill emits:

```json
{
  "page": 14,
  "char_span": [1820, 1980]
}
```

- `page` — 1-indexed page number in the source PDF, or paragraph cluster index for `.docx` (since `.docx` has no fixed pagination).
- `char_span` — `[start, end]` character offsets within the page's extracted text, where `start` is inclusive and `end` is exclusive.

The `excerpt` field on the clause record is the verbatim substring at that span. The skill enforces that `page_text[char_span[0]:char_span[1]] == excerpt`. If the assertion fails, the extraction is rejected and the clause is recorded with `status: "error", error: "excerpt_not_grounded"`.

## Why structured, not "p. 14, ¶ 3"

Free-text citations like "p. 14, paragraph 3" cannot be machine-verified. A reviewer cannot click them. A pipeline cannot diff them across re-runs. A regression test cannot assert "the citation moved by exactly N characters when we re-ran extraction after taxonomy v2." Structured citations make every extraction reproducible and reviewable.

## Reviewer UX expectations

Downstream tooling (CLM, review queue, audit log) is expected to render the citation as a deep link into the source PDF page with the excerpt highlighted. Without that affordance, reviewers fall back to ctrl-F on the excerpt — which works, but doubles review time.

Recommended renderer behavior:

- Display the excerpt with the citation page badge inline
- On click, open the source PDF at the cited page with the excerpt highlighted (PDF.js supports `#highlight=<text>`)
- Show `confidence` as a colored chip: high = green, medium = amber, low = red

## "Not present" — the load-bearing answer

When a clause is not located, the citation is omitted and `status: "not_present"` is set with a `note` field documenting the search:

```json
{
  "value": null,
  "status": "not_present",
  "note": "Searched headings: ['Most Favored', 'MFN', 'Pricing']; searched synonyms: ['most favored', 'no less favorable']; no matching paragraphs found."
}
```

This is intentionally explicit. CLM backfill pipelines treat a `null` with `status: "not_present"` as confirmed-absent (file the contract without that field) and a `null` with `status: "error"` as needs-rerun (do not file). Conflating the two corrupts CLM data over time.

## Cross-reference handling

When the matched paragraph is a pointer ("as set forth in Schedule A"), the skill emits:

```json
{
  "value": "see Schedule A",
  "excerpt": "Liability shall be limited as set forth in Schedule A.",
  "citation": { "page": 18, "char_span": [220, 274] },
  "confidence": "medium",
  "note": "cross-reference; manual resolution required"
}
```

Resolving cross-references is out of scope. The skill could chase the reference into Schedule A, but the failure modes (mis-numbered schedules, amendments overriding the schedule, partially-resolved chains) make naive resolution worse than an honest "needs human" flag.

## Confidence calibration

| Confidence | Meaning | Reviewer action |
|---|---|---|
| `high` | Heading match + synonym match + clean excerpt grounded in source | Spot-check 10% sample; trust the rest |
| `medium` | Synonym match without heading, OR cross-reference, OR low heading density on the contract overall | Review every record |
| `low` | Multiple candidate paragraphs and the model picked one with weak signal, OR excerpt is ≥ 200 chars | Review every record before filing |

The skill MUST NOT emit `high` for a record that did not pass the byte-identical excerpt check. There is no "high-confidence hallucination" case — by construction.

## Last edited

{YYYY-MM-DD}