claude-skill

Extracción de cláusulas para cualquier contrato con Claude

Dificultad

principiante

Tiempo de setup

20min

Para

legal-ops · in-house-counsel · paralegal · contract-manager

Legal Ops

Stack

Un Claude Skill que toma un único contrato firmado — .docx o .pdf con capa de texto — y emite un registro JSON anclado con citas que contiene las cláusulas que tu CLM realmente indexa: ley aplicable, límite de responsabilidad, indemnización, plazo, renovación automática, causales de terminación, condiciones de pago, propiedad intelectual, plazo de confidencialidad, además de cualquier campo personalizado que configures (residencia de datos, MFN, cambio de control, cesión). Cada valor extraído incluye un extracto literal, una cita {page, char_span} y un puntaje de confianza, de modo que el revisor downstream pueda verificar en segundos en lugar de releer el contrato.

Esta página cubre cuándo ejecutarlo, cuándo explícitamente no hacerlo, cuánto cuesta y los modos de falla nombrados que debes dimensionar antes de apuntarlo a un repositorio en producción.

Cuándo usarlo

Recurre al skill cuando tengas una necesidad de salida estructurada contra contratos que ya hayan superado la etapa de privilegio:

Backfill de datos en CLM. Heredaste un repositorio de archivos planos (Box, SharePoint, unidad de red) y necesitas poblar los campos de metadatos de Ironclad o Agiloft sin quemar un trimestre de paralegal.
Construcción de biblioteca de cláusulas. Quieres cada cláusula de “límite de responsabilidad” en todo el portafolio para que la biblioteca de cláusulas refleje lo que realmente acordaste, no la posición declarada del playbook.
Diligencia. Tienes 48 horas para identificar cláusulas de cambio de control, cesión y cliente más favorecido en el conjunto de contratos de un objetivo antes de cerrar una operación.
Triaje de renovaciones. Necesitas marcar cada contrato con renovación automática en los próximos 90 días, con el campo notice-period-days completado.

El bundle del artefacto vive en apps/web/public/artifacts/clause-extraction-claude-skill/ e incluye:

SKILL.md — la definición del Skill con método, formato de salida y advertencias
references/1-clause-taxonomy.md — las cláusulas a extraer por tipo de contrato, con encabezados y sinónimos
references/2-output-schema.json — el JSON Schema contra el cual valida cada registro (fíjalo a una versión)
references/3-citation-format.md — la gramática de citas y las reglas para los fallbacks de “no presente” / “no se pudo extraer”

Cuándo NO usarlo

El skill es estrecho a propósito. Rechaza la invocación en cualquiera de estos casos.

Borradores privilegiados en negociación activa. La política de IA de la mayoría de los equipos legales (y la plantilla de política de IA que recomendamos) traza una línea dura ante borradores en negociación — particularmente redlines de outside counsel y producto de trabajo del abogado. Este skill es para contratos firmados o casi finales que ya hayan superado la cuestión del privilegio. Si no estás seguro de si un documento ha superado esa etapa, la respuesta es no.
Cualquier cosa vía proveedores de IA que no sean Tier-A. Ejecuta solo contra el endpoint Tier-A aprobado por tu firma (Anthropic API directa, o tu tenant empresarial de Claude). Nunca el chatbot de consumo. Nunca un plugin de navegador. Nunca un wrapper SaaS no auditado que prometa “Claude por debajo”. Enviar un contrato a través de un proveedor Tier-B es un vector de fuga de privilegio — rechaza la invocación en lugar de saltarte la política de IA. El propio Skill codifica una allowlist de endpoints; si lo ejecutas dentro de Claude Code o Claude.ai con tu tenant empresarial, estás bien.
Redacción o redlining. Este skill solo lee. Para redlining, usa el skill separado de contract-redline.
Interpretación legal. La salida es texto + cita. Si un límite de responsabilidad de 12 meses es “suficientemente bueno” dado el contexto del trato es una decisión de criterio que se queda con el abogado.

Setup

Coloca el bundle en ~/.claude/skills/ (Claude Code) o sube el directorio references/ y SKILL.md a un proyecto en Claude.ai.
Reemplaza el contenido de references/1-clause-taxonomy.md con la taxonomía real de tu firma. La taxonomía por defecto trae las cláusulas comunes de MSA; la mayoría de las firmas agrega 5 a 10 campos personalizados (residencia de datos por jurisdicción, carveouts de cambio de control, plazo de no captación, alcance de MFN).
Fija references/2-output-schema.json a una versión. Sube extractor_version en el schema y en el Skill en cada cambio de taxonomía para que los consumidores downstream puedan detectar drift.
Ejecuta sobre un contrato conocido — elige uno cuyos valores de cláusula ya tengas en el CLM. Compara el JSON extraído contra el registro del CLM. Itera sobre los sinónimos de la taxonomía hasta que coincida.
Ejecuta a escala. El Skill es por contrato; orquesta el batch en n8n, en un loop de shell, o en el hook de ingesta de tu CLM.

Qué hace realmente el skill

Cuatro pasos, en orden.

Extracción de texto preservando layout. El .docx se parsea vía el XML de docx; el .pdf vía pdfplumber para que sobrevivan los números de página y los character spans con bounding box. Si el PDF no tiene capa de texto (imagen escaneada), el Skill aborta con error: "ocr_required" en lugar de emitir texto vacío. Enrutar PDFs escaneados a OCR es una preocupación upstream separada; este Skill no hace OCR, porque producir silenciosamente una extracción “limpia” vacía a partir de un escaneo es peor que fallar de forma ruidosa.
Extracción anclada con citas, una pasada por cláusula. Para cada cláusula en la taxonomía: encuentra párrafos candidatos por coincidencia de encabezado + sinónimos, pasa solo esos candidatos (no el contrato completo) a Claude con la definición de la cláusula, y exige de vuelta el valor, un extracto literal de ≤ 280 caracteres, la cita {page, char_span} y un puntaje de confianza high | medium | low. Cualquier extracto que no sea byte-idéntico a un substring de los párrafos fuente es rechazado — esta es la guardia anti-alucinación, y es no negociable. Los prompts por cláusula (vs un mega-prompt) te permiten reintentar solo las fallas, capar los tokens de entrada por llamada y aislar la alucinación a un solo campo en lugar del registro completo.
Validación de schema contra el output-schema.json fijado. Los errores de validación aterrizan en el array errors de la salida. El Skill no coacciona tipos en silencio.
Fallback de “no presente”. Cuando una cláusula no se localiza, emite value: null, status: "not_present", note: "Searched headings: [...]". No adivines. Los pipelines de backfill de CLM tratan null + status:not_present como ausencia confirmada (archivar el contrato sin ese campo) y null + status:error como necesita re-ejecución (no archivar). Confundir los dos corrompe los datos del CLM con el tiempo.

Realidad de costos

Con el pricing de Claude en 2026 — pongamos ~$3/M de tokens de entrada y ~$15/M de tokens de salida para el modelo cost-effective usado dentro del Skill — el costo está dominado por los tokens de entrada, y los tokens de entrada están dominados por la longitud de los párrafos candidatos (porque el Skill nunca envía el contrato completo, solo los párrafos coincidentes por cláusula).

Números aproximados por contrato:

Contrato corto (5 páginas, ~3K tokens de entrada en todas las llamadas por cláusula, ~500 tokens de salida): ~$0,02 por contrato.
MSA estándar (20 páginas, ~12K tokens de entrada, ~1K tokens de salida): ~$0,05 por contrato.
MSA enterprise largo con anexos (60 páginas, ~35K tokens de entrada, ~2K tokens de salida): ~$0,13 por contrato.

En un equipo in-house típico de mid-market que pasa ~200 contratos nuevos y heredados al mes por el pipeline, eso son $10–$30/mes en gasto de tokens. El costo es un error de redondeo frente a una hora de paralegal. Donde deja de ser error de redondeo es en el proyecto de diligencia de 50.000 contratos — a $0,05 cada uno, son $2.500, que sigue siendo barato, pero vale la pena presupuestarlo de antemano y no descubrirlo en el extracto de la tarjeta de crédito.

El costo no-token: cada extracción con confidence: medium | low (y un sample del 10% de high) necesita revisión humana. Planifica ~30 segundos por registro en medium y ~2 minutos en low. El Skill es más rápido que un paralegal, no gratis.

Métrica de éxito

Dos métricas que vale la pena instrumentar desde el día uno.

Precisión de extracción sobre un set etiquetado. Construye un gold set de 50 contratos con extracciones manuales. Mide precisión y recall por cláusula. Objetivo: ≥ 95% de precisión en las cláusulas requeridas (governing_law, liability_cap, term_length_months, auto_renewal). Por debajo de eso, los falsos positivos envenenan el CLM y los revisores aprenden a ignorar el campo. El recall importa menos — not_present es una respuesta load-bearing, y una cláusula no detectada se enruta a revisión humana de todas formas.
Tiempo por contrato, end to end. Incluyendo la pasada de revisión humana sobre los registros marcados. Objetivo para un MSA de 20 páginas: menos de 4 minutos de wall-clock, vs. 20-30 minutos de extracción manual completa. Si no estás viendo 5×, la cola de revisión humana es demasiado agresiva — ajusta los umbrales de confianza.

vs alternativas

vs la extracción nativa de cláusulas con IA de Ironclad. La extracción incorporada de Ironclad es excelente si todos los contratos que te importan viven en Ironclad. Tropieza cuando estás haciendo backfill desde fuera de Ironclad (la ruta de import es torpe) y cuando quieres cláusulas personalizadas más allá del set templado de Ironclad. Este Skill corre contra cualquier archivo en disco y usa tu taxonomía. Si vives enteramente en Ironclad, usa su extracción nativa; si estás alimentando múltiples destinos o haciendo diligencia sobre un repositorio no-Ironclad, este Skill es el mejor encaje.
vs Kira Systems. Kira es el incumbente de grado enterprise — alta precisión, biblioteca profunda de templates, caro (seis cifras), ciclo de venta largo, requiere datos de entrenamiento por cláusula personalizada. Si eres una firma BigLaw haciendo diligencia M&A a escala, Kira se gana su precio. Si eres un equipo de legal-ops de 50 personas haciendo backfill de unos pocos miles de MSAs heredados, Kira es overkill y este Skill es dos órdenes de magnitud más barato para la precisión que necesitas.
vs revisión manual de paralegal. La comparación honesta. Un paralegal extrayendo 10 cláusulas de un MSA de 20 páginas tarda 20-30 minutos y obtiene ≥ 99% de precisión en las cláusulas fáciles (ley aplicable, plazo) y ~90% en las difíciles (estructura del límite de responsabilidad, carveouts de indemnización). Este Skill lo hace en menos de un minuto a ~$0,05, llega a ~95% en fáciles y ~85% en difíciles, y enruta el resto a un humano vía la flag de confianza. La movida correcta para la mayoría de los equipos es híbrida: Skill en cada contrato, paralegal en los registros marcados.

Watch-outs

Fuga de privilegio vía proveedor Tier-B. Enrutar un documento privilegiado a través de un endpoint de IA no aprobado puede renunciar al privilegio. Guard: el Skill verifica una allowlist de endpoints codificada (api.anthropic.com más tu tenant empresarial) al inicio y se rehúsa a correr si el endpoint configurado no está en ella. Documenta al dueño de la allowlist en tu política de IA.
Vacíos de texto inducidos por OCR en PDFs escaneados. Un PDF de imagen escaneada sin capa OCR se extrae como páginas vacías; sin una guardia, el Skill reportaría la mayoría de las cláusulas como not_present y parecería una corrida limpia. Guard: el paso 1 detecta páginas con < 50 caracteres extraídos y aborta con ocr_required en lugar de emitir un registro engañoso. Enruta el contrato por OCR upstream y vuelve a ejecutar.
Cláusulas alucinadas. Los modelos amablemente inventarán una cláusula de “terminación por conveniencia” que no existe si se les pide. Guard: la verificación de extracto byte-idéntico-a-substring del paso 2 — cualquier extracto no presente literalmente en los párrafos fuente es rechazado y la cláusula registra status: "error", error: "excerpt_not_grounded". No hay ruta de alucinación de alta confianza por construcción.
Drift de schema entre versiones de contrato. Una actualización de taxonomía que cambia liability_cap de string a un objeto {type, amount, period} rompe silenciosamente a cada consumidor downstream. Guard: fija extractor_version en references/2-output-schema.json y súbelo en cada cambio de taxonomía o schema. Los consumidores downstream se basan en la versión, no en una asunción de estabilidad.
Resolución de términos definidos. “As set forth in Schedule A” devuelve la referencia, no el valor. Guard: el Skill detecta as set forth in / as defined in y emite confidence: medium con note: "cross-reference, manual resolution required". La auto-resolución ingenua es peor que la flag honesta.
No es asesoría legal. La extracción es mecánica. Si un cap de 12 meses es aceptable para este trato es criterio que se queda con el abogado.

Stack

Claude — orquestación de extracción de texto, extracción de cláusulas anclada con citas, validación de schema
Ironclad (opcional) — destino CLM principal para los registros extraídos. Ver también alternatives-to-ironclad y la comparación de best CLM platforms si todavía estás eligiendo uno.
Contexto de CLM — qué es CLM y dónde encaja la extracción.

Editar esta página en GitHub

Archivos de este artefacto

Descargar todo (.zip)

---
name: clause-extraction
description: Extract a fixed set of contract clauses from a single .pdf or .docx and emit citation-grounded JSON with page/span references. Use after intake to backfill CLM metadata, build a clause library, or surface change-of-control / liability terms during diligence.
---

# Clause extraction

## When to invoke

Invoke this skill per contract, after the document has been ingested and you need a structured clause record (governing law, liability cap, term, auto-renewal, indemnification, payment terms, IP ownership, confidentiality term, termination triggers, plus any custom clauses you configure).

Typical callers:

- CLM backfill — populating Ironclad / Agiloft / DealHub metadata for a legacy contract repository
- Diligence — surfacing change-of-control, assignment, MFN clauses on a target company's contract set before deal close
- Clause library — building a corpus of "what we actually agreed to" across a portfolio so the playbook reflects reality

Do NOT invoke this skill for:

- **Privileged drafts in active negotiation** — per AI policy in most legal teams, in-flight negotiation drafts (especially with outside counsel redlines) do not get sent to AI tooling. This skill is for executed or near-final contracts that have already cleared privilege.
- **Anything via non-Tier-A AI vendors.** Run only against the firm-approved Tier-A model endpoint (Anthropic API or your enterprise Claude tenant). A general-purpose chatbot, browser plugin, or unvetted SaaS wrapper is a privilege-leak vector — refuse the invocation rather than route around the AI policy.
- Drafting or redlining clauses (this skill reads only)
- Interpreting legal effect (the output is text + citation; legal judgment stays with counsel)

## Inputs

- Required: `contract_path` — absolute path to a `.pdf` or `.docx`. PDFs must be text-based or pre-OCR'd; scanned-image PDFs without an OCR layer are rejected at step 1.
- Required: `taxonomy` — path to `references/clause-taxonomy.md` (or a custom taxonomy keyed by contract type). Defines the clauses to look for and the expected value type (string, number, boolean, enum).
- Required: `output_schema` — path to `references/output-schema.json`. The JSON Schema the output must validate against. Schema drift across contract versions is the #1 source of downstream pipeline breakage; pinning the schema per run guards against it.
- Optional: `contract_type` — `msa | sow | nda | dpa | order_form`. Selects the clause subset from the taxonomy. Defaults to `msa`.
- Optional: `custom_clauses` — array of additional clause names to look for beyond the taxonomy defaults (e.g. `data_residency_clause`, `most_favored_customer_clause`).

## Reference files

Read these from `references/` before processing. They are templates — replace the placeholder content with your firm's real taxonomy and schema before running on production contracts.

- `references/clause-taxonomy.md` — clause definitions per contract type, with the value type, required/optional flag, and synonym phrases the extraction step matches against
- `references/output-schema.json` — the JSON Schema every emitted record must validate against
- `references/citation-format.md` — citation grammar (page + span anchor) and the rules for "not present" / "could not extract" fallbacks

## Method

Run these steps in order. Do not parallelize — later steps depend on the artifacts produced by earlier ones.

### 1. Text extraction with layout preservation

For `.docx`: parse via the docx XML and emit a flat text stream with paragraph indices and section headings preserved.

For `.pdf`: use a text-layer extractor (pdfplumber or pdfminer.six) that preserves page numbers and bounding-box character spans. If the PDF has no text layer (scanned image), abort with `error: "ocr_required"` rather than silently emitting empty text. Routing a scanned PDF to OCR is a separate upstream concern; this skill does not OCR.

The output of step 1 is a list of `{page, paragraph_index, char_span, text}` records. Every later citation references these coordinates.

### 2. Citation-grounded extraction (one pass per clause)

For each clause in the taxonomy:

1. Locate candidate paragraphs by heading match (e.g. "Governing Law", "Term") and synonym phrase match (e.g. "shall be governed by", "initial term of this Agreement").
2. Pass the candidate paragraphs (not the full contract) to Claude with the clause definition and ask for: the value, the verbatim source excerpt (≤ 280 chars), the `{page, char_span}` citation, and a `confidence` score (`high | medium | low`).
3. **Reject any extracted excerpt that is not byte-identical to a substring of the source paragraphs.** This is the hallucination guard — if the model returns text not actually in the contract, drop the extraction and record `value: null, error: "excerpt_not_grounded"`.

Why one pass per clause and not a single mega-prompt: per-clause prompts let you retry only the failures, cap each call's input tokens (cheaper, faster), and isolate hallucination failures to a single field instead of the whole record.

### 3. Schema validation

Validate the assembled record against `output-schema.json`. If validation fails, emit the validation error in the output's `errors` array. Do not silently coerce types.

### 4. "Not present" fallback

If a clause is not located in step 2 (no candidate paragraphs above confidence threshold), emit `value: null, status: "not_present", note: "Searched headings: [...]; no matching paragraphs found."` Do not guess. "Not present" is a load-bearing answer; CLM backfill pipelines treat `null + status:not_present` differently from `null + error:*`.

## Output format

Always emit a single JSON object per contract. Soft constraints below are enforced by `references/output-schema.json`.

```json
{
"contract_file": "vendor_msa_2026.pdf",
"contract_type": "msa",
"extracted_at": "2026-05-03T14:22:00Z",
"extractor_version": "clause-extraction@2026.05",
"clauses": {
"governing_law": {
"value": "Delaware",
"excerpt": "This Agreement shall be governed by and construed in accordance with the laws of the State of Delaware, without regard to its conflict of laws principles.",
"citation": { "page": 14, "char_span": [1820, 1980] },
"confidence": "high",
"status": "extracted"
},
"liability_cap": {
"value": "12 months fees",
"excerpt": "In no event shall either party's aggregate liability exceed the fees paid by Customer in the twelve (12) months preceding the event giving rise to the claim.",
"citation": { "page": 18, "char_span": [220, 410] },
"confidence": "high",
"status": "extracted"
},
"auto_renewal": {
"value": true,
"renewal_term_months": 12,
"notice_period_days": 90,
"excerpt": "This Agreement shall automatically renew for successive 12-month terms unless either party provides 90 days' written notice of non-renewal.",
"citation": { "page": 3, "char_span": [50, 230] },
"confidence": "high",
"status": "extracted"
},
"most_favored_customer_clause": {
"value": null,
"status": "not_present",
"note": "Searched headings: ['Most Favored', 'MFN', 'Pricing']; no matching paragraphs found."
}
},
"errors": []
}
```

## Watch-outs

- **Privilege leak via Tier-B vendor.** Routing a privileged or attorney-work-product document through a non-approved AI endpoint can waive privilege. Guard: hard-coded allowlist of model endpoints (`ALLOWED_ENDPOINTS = ["api.anthropic.com", "<your-enterprise-tenant>"]`) checked at skill startup. Refuse to run if the configured endpoint is not on the list. Document the allowlist owner in your AI policy.
- **OCR-induced text gaps on scanned PDFs.** If step 1 silently emits empty pages from a scanned image PDF, the skill will report many clauses as `not_present` and look like a clean extraction. Guard: step 1 detects pages with < 50 extracted characters and aborts with `ocr_required` rather than producing a misleading "clean" record.
- **Hallucinated clauses.** Models will helpfully invent a "termination for convenience" clause that doesn't exist if asked. Guard: byte-identical excerpt-substring check in step 2 — any excerpt not literally present in the source paragraphs is rejected. Pair with `confidence: low` flagging for human review on the rest.
- **Schema drift across contract versions.** A taxonomy update that changes `liability_cap` from a string to a structured `{type, amount, period}` silently breaks every downstream consumer. Guard: pin `extractor_version` in the output and bump it on every taxonomy or schema change. Downstream consumers key on version, not on the assumption that the schema is stable.
- **Defined-term resolution.** When a clause says "as set forth in Schedule A" the excerpt is the reference, not the value. Guard: detect the substring "as set forth in" / "as defined in" and emit `confidence: medium, note: "cross-reference, manual resolution required"` rather than treating the reference as the answer.
- **Heading-light contracts.** Contracts without clear section headings (older or short-form) extract less reliably. Guard: when fewer than 60% of expected headings match in step 2, mark the whole record `confidence: medium` and note `"heading_density: low"` so downstream QA routes it to human review.

# Clause taxonomy — TEMPLATE

> Replace this template's contents with your firm's actual clause taxonomy
> per contract type. The clause-extraction skill reads this file on every
> run; without your real taxonomy, extractions will use the generic defaults
> below and miss the clauses your CLM cares about.

The skill keys on `clause_id`. Every clause record in the output JSON uses the `clause_id` as the property name. Adding a clause means: add an entry here AND add the matching property to `output-schema.json`.

## Convention

For each clause:

- `clause_id` — snake_case identifier used as the JSON key
- `value_type` — `string | number | boolean | enum | structured`
- `required` — `true | false` (drives `not_present` vs hard error in validation)
- `headings` — list of section heading strings the locator matches against
- `synonyms` — list of phrase substrings the locator falls back to when no heading matches
- `value_hint` — what the extractor should pull (e.g. "the named jurisdiction state or country", "12-month-fees / 24-month-fees / unlimited / other")

## MSA defaults

### governing_law

- value_type: `string`
- required: true
- headings: `["Governing Law", "Choice of Law", "Applicable Law"]`
- synonyms: `["shall be governed by", "construed in accordance with the laws of"]`
- value_hint: the named jurisdiction (state, province, or country)

### liability_cap

- value_type: `structured` — `{ type: "fees_period" | "fixed_amount" | "unlimited" | "other", amount?: number, period_months?: number, currency?: string }`
- required: true
- headings: `["Limitation of Liability", "Liability Cap", "Cap on Liability"]`
- synonyms: `["aggregate liability shall not exceed", "in no event shall either party's liability exceed"]`
- value_hint: extract the cap amount or formula. Distinguish indirect-damages exclusions (do NOT extract those here) from the cap itself.

### indemnification

- value_type: `structured` — `{ ip_indemnity: boolean, mutual: boolean, carveouts: string[] }`
- required: true
- headings: `["Indemnification", "Indemnity"]`
- synonyms: `["shall defend, indemnify and hold harmless"]`
- value_hint: pull the IP indemnity boolean and the carveouts list (e.g. combinations, modifications, open source).

### term_length_months

- value_type: `number`
- required: true
- headings: `["Term", "Term and Termination"]`
- synonyms: `["initial term of this Agreement", "shall commence on the Effective Date"]`
- value_hint: convert years to months (3-year term → 36).

### auto_renewal

- value_type: `structured` — `{ enabled: boolean, renewal_term_months?: number, notice_period_days?: number }`
- required: true
- headings: `["Renewal", "Term and Termination"]`
- synonyms: `["shall automatically renew", "evergreen", "successive renewal terms"]`

### termination_triggers

- value_type: `structured` — `{ for_convenience: { allowed: boolean, notice_days?: number }, for_cause: { material_breach_cure_days?: number }, for_insolvency: boolean }`
- required: true
- headings: `["Termination"]`
- synonyms: `["may terminate this Agreement", "for material breach"]`

### payment_terms

- value_type: `structured` — `{ net_days: number, currency: string, late_fee_apr?: number }`
- required: true
- headings: `["Payment", "Fees and Payment", "Invoicing"]`
- synonyms: `["payable within", "net thirty (30) days"]`

### ip_ownership

- value_type: `enum` — `vendor | customer | joint | work_for_hire | other`
- required: true
- headings: `["Intellectual Property", "Ownership", "IP Rights"]`
- synonyms: `["all right, title and interest"]`

### confidentiality_term_months

- value_type: `number`
- required: true
- headings: `["Confidentiality", "Non-Disclosure"]`
- synonyms: `["confidentiality obligations shall survive", "for a period of"]`
- value_hint: convert years to months. If trade-secret carveout is "in perpetuity", emit `-1` and set `confidence: medium`.

## NDA defaults

(Replace with your NDA-specific taxonomy. Typical: `term_months`, `survival_period_months`, `permitted_purposes`, `residual_rights`, `return_or_destroy`.)

## DPA defaults

(Replace with your DPA-specific taxonomy. Typical: `data_residency`, `subprocessor_consent`, `audit_rights`, `breach_notification_hours`, `sccs_module_used`.)

## Custom clauses (firm-specific)

Add your firm-specific clauses here. Examples to consider:

- `change_of_control_clause` — boolean + carveouts
- `most_favored_customer_clause` — boolean + scope
- `data_residency_clause` — enum of jurisdictions
- `assignment_restriction` — enum: `no_restriction | consent_required | prohibited`
- `non_solicit_term_months` — number

## Last edited

{YYYY-MM-DD} — bump on every taxonomy change. The extractor records this date in `extractor_version` so downstream consumers can detect schema drift.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://ooligo.io/schemas/clause-extraction/v1.json",
  "title": "Clause extraction record",
  "description": "Output of the clause-extraction skill, one record per contract. Replace this template with your firm's pinned schema. Bump $id on every breaking change so downstream consumers can key on version.",
  "type": "object",
  "required": ["contract_file", "contract_type", "extracted_at", "extractor_version", "clauses", "errors"],
  "additionalProperties": false,
  "properties": {
    "contract_file": { "type": "string", "minLength": 1 },
    "contract_type": { "type": "string", "enum": ["msa", "sow", "nda", "dpa", "order_form"] },
    "extracted_at": { "type": "string", "format": "date-time" },
    "extractor_version": { "type": "string", "pattern": "^clause-extraction@[0-9]{4}\\.[0-9]{2}$" },
    "source_metadata": {
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "page_count": { "type": "integer", "minimum": 1 },
        "heading_density": { "type": "string", "enum": ["high", "medium", "low"] },
        "ocr_layer_present": { "type": "boolean" }
      }
    },
    "clauses": {
      "type": "object",
      "description": "One property per clause_id from the taxonomy. Use $defs/clauseRecord for each.",
      "additionalProperties": { "$ref": "#/$defs/clauseRecord" }
    },
    "errors": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["code", "message"],
        "properties": {
          "code": { "type": "string", "enum": ["ocr_required", "schema_validation_failed", "endpoint_not_allowed", "excerpt_not_grounded", "taxonomy_mismatch"] },
          "message": { "type": "string" },
          "clause_id": { "type": "string" }
        }
      }
    }
  },
  "$defs": {
    "clauseRecord": {
      "type": "object",
      "required": ["status"],
      "additionalProperties": true,
      "properties": {
        "status": { "type": "string", "enum": ["extracted", "not_present", "error"] },
        "value": {
          "description": "Type depends on the clause's value_type in the taxonomy. Null when status != extracted.",
          "anyOf": [
            { "type": "string" },
            { "type": "number" },
            { "type": "boolean" },
            { "type": "object" },
            { "type": "null" }
          ]
        },
        "excerpt": {
          "type": "string",
          "maxLength": 280,
          "description": "Verbatim substring of the source. MUST be byte-identical to text in the source paragraphs — the hallucination guard rejects any excerpt not literally present."
        },
        "citation": {
          "type": "object",
          "required": ["page", "char_span"],
          "additionalProperties": false,
          "properties": {
            "page": { "type": "integer", "minimum": 1 },
            "char_span": {
              "type": "array",
              "minItems": 2,
              "maxItems": 2,
              "items": { "type": "integer", "minimum": 0 },
              "description": "[start_char_offset, end_char_offset] within the page's extracted text."
            }
          }
        },
        "confidence": { "type": "string", "enum": ["high", "medium", "low"] },
        "note": { "type": "string" },
        "error": { "type": "string" }
      },
      "allOf": [
        {
          "if": { "properties": { "status": { "const": "extracted" } } },
          "then": { "required": ["value", "excerpt", "citation", "confidence"] }
        },
        {
          "if": { "properties": { "status": { "const": "not_present" } } },
          "then": { "required": ["note"] }
        },
        {
          "if": { "properties": { "status": { "const": "error" } } },
          "then": { "required": ["error"] }
        }
      ]
    }
  }
}

# Citation format — TEMPLATE

> The clause-extraction skill emits a citation on every extracted clause so
> the downstream reviewer can verify the extraction in seconds rather than
> re-reading the contract. Without a usable citation grammar, extractions
> are unfalsifiable — and unfalsifiable extractions become silent CLM data
> rot. This file pins the format and the fallback rules.

## Citation grammar

A citation is a structured object, not a string. The skill emits:

```json
{
  "page": 14,
  "char_span": [1820, 1980]
}
```

- `page` — 1-indexed page number in the source PDF, or paragraph cluster index for `.docx` (since `.docx` has no fixed pagination).
- `char_span` — `[start, end]` character offsets within the page's extracted text, where `start` is inclusive and `end` is exclusive.

The `excerpt` field on the clause record is the verbatim substring at that span. The skill enforces that `page_text[char_span[0]:char_span[1]] == excerpt`. If the assertion fails, the extraction is rejected and the clause is recorded with `status: "error", error: "excerpt_not_grounded"`.

## Why structured, not "p. 14, ¶ 3"

Free-text citations like "p. 14, paragraph 3" cannot be machine-verified. A reviewer cannot click them. A pipeline cannot diff them across re-runs. A regression test cannot assert "the citation moved by exactly N characters when we re-ran extraction after taxonomy v2." Structured citations make every extraction reproducible and reviewable.

## Reviewer UX expectations

Downstream tooling (CLM, review queue, audit log) is expected to render the citation as a deep link into the source PDF page with the excerpt highlighted. Without that affordance, reviewers fall back to ctrl-F on the excerpt — which works, but doubles review time.

Recommended renderer behavior:

- Display the excerpt with the citation page badge inline
- On click, open the source PDF at the cited page with the excerpt highlighted (PDF.js supports `#highlight=<text>`)
- Show `confidence` as a colored chip: high = green, medium = amber, low = red

## "Not present" — the load-bearing answer

When a clause is not located, the citation is omitted and `status: "not_present"` is set with a `note` field documenting the search:

```json
{
  "value": null,
  "status": "not_present",
  "note": "Searched headings: ['Most Favored', 'MFN', 'Pricing']; searched synonyms: ['most favored', 'no less favorable']; no matching paragraphs found."
}
```

This is intentionally explicit. CLM backfill pipelines treat a `null` with `status: "not_present"` as confirmed-absent (file the contract without that field) and a `null` with `status: "error"` as needs-rerun (do not file). Conflating the two corrupts CLM data over time.

## Cross-reference handling

When the matched paragraph is a pointer ("as set forth in Schedule A"), the skill emits:

```json
{
  "value": "see Schedule A",
  "excerpt": "Liability shall be limited as set forth in Schedule A.",
  "citation": { "page": 18, "char_span": [220, 274] },
  "confidence": "medium",
  "note": "cross-reference; manual resolution required"
}
```

Resolving cross-references is out of scope. The skill could chase the reference into Schedule A, but the failure modes (mis-numbered schedules, amendments overriding the schedule, partially-resolved chains) make naive resolution worse than an honest "needs human" flag.

## Confidence calibration

| Confidence | Meaning | Reviewer action |
|---|---|---|
| `high` | Heading match + synonym match + clean excerpt grounded in source | Spot-check 10% sample; trust the rest |
| `medium` | Synonym match without heading, OR cross-reference, OR low heading density on the contract overall | Review every record |
| `low` | Multiple candidate paragraphs and the model picked one with weak signal, OR excerpt is ≥ 200 chars | Review every record before filing |

The skill MUST NOT emit `high` for a record that did not pass the byte-identical excerpt check. There is no "high-confidence hallucination" case — by construction.

## Last edited

{YYYY-MM-DD}