claude-skill

Claude SkillによるSalesforceデータクリーンアップ

Difficulty

上級

Setup time

90min

For

revops

RevOps

Stack

Salesforceをスキャンして、レポートを静かに歪める「データのゴミ」を検出するClaude Skillです。重複アカウント、孤立コンタクト、不正リード、形式不正な電話番号、アカウント-コンタクトの不一致、ファネル定義に違反するステージ値を検出し、オペレーターが書き込み前に承認するCSVとして修正案を提案します。スキルはドライランと人間の承認なしに書き込みを行わず、適用されたすべての変更はカスタム監査オブジェクトに記録されて元に戻せます。

フルバンドルはにあります。3つの参照ファイルがオペレーターの記入可能なスキャフォールドとして機能します。マッチキーと類似度しきい値のための、ステージごとの必須フィールドセットのための、マージで勝つレコードを選択するための重みのための。

使うべき場面

基礎となるオブジェクトデータがチームがクリーニングできる速さよりも速く劣化し、レポートが信頼できなくなったときに使用します。具体的なトリガー：ボードのARR数字とCROのパイプラインビューが数パーセント以上乖離している、SDRチームが3つの異なるアカウントレコードで同じ見込み客に当たっていると不満を言っている、コンタクトが間違ったアカウントに存在するためマーケティングアトリビューションダッシュボードが重複カウントしている、アカウントの4分の1でファーモグラフィックフィールドが欠落しているため年次ICPの再セグメンテーションがブロックされている。

このスキルは、既存のデデュープツールが棚卸し効果を生み出した場合にも適切な選択肢です。RevOpsがライセンスを持っているが、誰も提案を信頼して行動しない場合です。スキルの差別化要因は、すべての提案されたマージが発動した決定論的キーと選択を動かしたサバイバー選択シグナルを引用した1ペアごとの理由行とともに送られることです。

使ってはいけない場面

リードキャプチャ時のリアルタイムデデュープゲートが必要な場合。スキルはチャンクでスキャンするバッチツールであり、同期検証ルールではありません。

スキルに書き込みを自動適用させる必要がある場合。設計上、自動モードはありません。すべての修正は、が行に触れる前にオペレーターがをマークしなければならないドライランCSVを通過します。

GDPRまたはCCPAの消去権リクエストに対応している場合。プラットフォームの文書化されたPII消去フローを使用します。

ゴミ箱をバイパスするハード削除が必要な場合。スキルにはハード削除のコードパスがありません。

書き込みスコープのトークンを使用して本番環境に対して最初の実行を行う場合。スキルが書き込み認証情報を受け入れる前に、2回の読み取り専用スキャンサイクルが必要です。

セットアップ

バンドルをにドロップします。スキルローダーはとディレクトリを自動的にピックアップします。
を読み取り専用のConnected Appトークンに、をサンドボックスエンドポイントに設定します。スキルはデフォルトでを使用します。
3つの参照ファイルの内容をチームの実際のルールに置き換えます。テンプレートはスキャフォールドです。
の「Method, step 5」に記載されたフィールド形状を使用して、カスタムSObjectをプロビジョニングします。
最初のディスカバリースキャンを実行します。。

スキルの実際の動作

スキルは5つのステップを順番に実行します。ディスカバリースキャンはBulk APIでチャンク単位で各スコープのSObjectを取得します。100kアカウントのorgに対する単一のRESTクエリはガバナー制限に達するためです。

デデュープは2パスのハイブリッドを使用します。パス1は決定論的（小文字化ドメイン、E.164正規化電話番号、NFKD正規化名前）。単一の強いキーでの完全一致は信頼度でプロポーザルCSVに進みます。パス2はClaude意味類似度比較ですが、弱い決定論的シグナルをすでに共有する候補ペアのみに限定されます。

サバイバー選択は複合スコアを使用します：活動最新性に0.4の重み、添付コンタクト数に0.3、Opportunityヒストリーに0.2、がインテグレーションユーザーかどうかに0.1。

ドライランは、、、、、、、および列を含むCSVを出力します。

コスト試算

100kアカウント、500kコンタクトのorgに対するディスカバリースキャンは約20分の壁時計時間で実行され、約3〜5ドルのClaude APIトークンを消費します。

ヘッドカウントの計算が本当の話です。上記規模での典型的なRevOpsクリーンアップスプリントは四半期ごとにアナリスト1人の2〜3週間を要します。スキルはそれを四半期ごとに約2日に圧縮します。

成功指標

スキャンごとに1つの数値を監視します：信頼度の提案のうちオペレーターが最初のレビューで承認する割合。最初の実行では通常50%未満。3〜4回のサイクル内で80%以上に達するべきです。

代替手段との比較

DemandToolsはこの分野の現役です。成熟した、決定論的な、GUI駆動のツールです。チームがすでにDemandToolsを使用していてルールセットが成熟している場合はそこに留まります。

Cloudingoはより近い比較点です。あいまいマッチングとレビューして適用ワークフローを持ちます。スキルのエッジは1ペアごとの理由行とgitでデデュープルールをバージョン管理できる参照ファイルモデルです。

注意点

最初の実行で書き込みアクセスを付与することが最も一般的な失敗です。対策：スキルは監査ログに事前ドライラン行がゼロの場合にを拒否します。

アカウントマージが間違ったレコードにカスケードすることが最もコストのかかる失敗です。対策：サンドボックスリハーサルとなしに行のを拒否します。

担当者へのコミュニケーションなしにマージすることが文化的な失敗です。対策：スキルはすべての適用実行と共にを出力します。

スタック

Salesforce — 真実のソースと書き込みのターゲット。カスタムオブジェクトが可逆的な監査ログを保持します
Claude — 意味類似度パスを実行し、1ペアごとの理由行を出力します
Bulk API — 読み取りと書き込みの両方に使用

GitHubでこのページを編集

Files in this artifact

Download all (.zip)

---
name: salesforce-data-cleanup
description: Find and propose fixes for Salesforce data-quality issues — duplicate accounts, orphan contacts, junk leads, malformed phones, mismatched account-contact relationships, and stage values that violate funnel definitions. Always proposes; never writes without explicit dry-run review and approval.
---

# Salesforce data cleanup

## When to invoke

Whenever you need to surface and propose fixes for Salesforce data-quality issues that are quietly distorting reporting: duplicate accounts and contacts, orphan records (contacts with no account, opportunities with no contact roles), junk leads (test records, role-address contacts, free-mail in B2B), malformed phones and websites, account-contact mismatches (contact email domain does not match account website domain), and stage values that violate the funnel definition (e.g. "Closed Won" with no Close Date, "Negotiation" with zero opportunity contact roles).

The Skill takes an object scope, a dedup-rules file, a stage-definition file, and an ICP rubric. It produces a discovery scan, then per-issue dry-runs as CSVs the operator approves before any write.

Do NOT invoke this skill for:

- Auto-applying writes without a dry-run + human approval. Every fix goes through the dry-run CSV gate. There is no "auto" mode.
- First runs against production with write access. The first two scan cycles must use a read-only API token. Write credentials only after the operator has reviewed the false-positive rate.
- Hard-deletes. The Skill never bypasses the recycle bin. Deletes are soft by default; permanent purges are out of scope and a deliberate manual decision.
- Bulk merges in production without a sandbox rehearsal. Account merges cascade to opportunities, activities, and contact roles — sandbox first.
- Data residency / PII purge requests under GDPR or CCPA. Use the platform's documented right-to-erasure flow, not this Skill.

## Inputs

- Required: `scope` — comma-separated SObject names to scan (e.g. `Account,Contact,Lead,Opportunity`). Default: all four.
- Required: `dedup_rules_path` — path to a `references/dedup-rules.md` style file naming the deterministic match keys (domain, normalized phone, normalized name) and the fuzzy-match thresholds.
- Required: `stage_definitions_path` — path to a `references/stage-definitions.md` style file naming the required field set per Opportunity stage.
- Required: `icp_rubric_path` — path to a `references/icp-rubric.md` style file used to evaluate whether orphan or low-signal records are worth keeping.
- Optional: `survivor_ranking_path` — path to a `references/survivor-ranking.md` style file with weights for activity recency, contact count, opportunity history, and last-modified-by user trust. Default weights apply if absent.
- Optional: `sandbox` — boolean, default `true`. Forces the Skill to talk to the configured sandbox endpoint, not production.
- Optional: `chunk_size` — integer, default `10000`. Bulk API query chunk size.

## Reference files

Always read the following from `references/` before scanning. Without them, the discovery output is generic and the dry-run proposals will not match the operator's actual definitions.

- `references/dedup-rules.md` — deterministic and fuzzy match rules (replace the template with your real rules)
- `references/stage-definitions.md` — required fields per Opportunity stage (replace with your actual funnel definition)
- `references/survivor-ranking.md` — weights for choosing the survivor in a merge proposal (replace with your actual priorities)

## Method

Run these five sub-tasks in order. Do not parallelize: later steps depend on the survivor-selection context produced by earlier steps.

### 1. Discovery scan

Pull each in-scope SObject via the Bulk API in chunks (default 10k rows). Bulk API, not REST query, because a full Account scan on a real org is 100k+ rows and REST chops on governor limits. Chunked because a single 1M-row pull blows past memory and the job times out at the org's configured Bulk API batch ceiling.

For each chunk, classify rows into issue buckets:

- `dedup_account` / `dedup_contact` / `dedup_lead`
- `orphan_contact` (no AccountId), `orphan_opp` (no contact roles)
- `junk_lead` (free-mail in B2B, role addresses, test patterns)
- `format_violation_phone` / `format_violation_website` (regex-fail)
- `mismatch_account_contact` (contact email domain != account website)
- `stage_violation` (required field per stage missing)

Output: a one-page scan summary with per-bucket counts and confidence band.

### 2. Dedup pass — deterministic + semantic hybrid

For dedup buckets the Skill runs a two-pass pipeline. Pass one is pure deterministic regex/normalization: lowercased email domain, E.164-normalized phone, NFKD-normalized + lowercased + suffix-stripped name. Exact matches on either domain or normalized phone go straight to the proposal CSV.

Pass two is Claude semantic similarity, applied only to candidate pairs that share at least one weak deterministic signal (same first 6 digits of phone, same first token of name, same parent-domain TLD). Hybrid because pure regex misses "Acme, Inc." vs "Acme Incorporated — APAC", and pure semantic similarity is too expensive to run pairwise across N^2 records and produces too many false positives on common names. The narrow-then-rank approach holds the per-scan token spend to under $5 for a 100k-Account org.

Confidence bands: `high` (deterministic match on 2+ keys), `medium` (deterministic on 1 key + semantic similarity ≥ 0.85), `low` (semantic only). The dry-run CSV ships only `high` and `medium` by default.

### 3. Survivor selection for merges

For each duplicate pair, propose a survivor using a weighted score:

- Activity recency (last 90 days of Tasks + Events): weight 0.4
- Contact count attached: weight 0.3
- Opportunity history (count + total Amount): weight 0.2
- LastModifiedBy not equal to integration user: weight 0.1

The Skill ranks both records, proposes the higher score as survivor, and emits a per-pair rationale line in the dry-run CSV. RevOps reviews and overrides via a `survivor_override` column before approving.

This composite is used because no single signal is reliable: most-recent modification often points at an integration backfill, contact count favors old crusty records, and Opportunity Amount alone discards the active relationship. The composite tracks "where the team is actually working."

### 4. Dry-run CSV per issue

For the issue the operator picks, emit a CSV with one row per proposed change. Columns: `Id`, `Object`, `Operation` (`merge` / `update` / `soft_delete`), `Field`, `Old_Value`, `New_Value`, `Confidence`, `Survivor_Id` (merges only), `Rationale`, `Approve` (operator-set, default blank).

The Skill writes nothing until a CSV with `Approve=Y` rows is passed back into `apply_fix`. Rows without `Approve=Y` are skipped, including blanks. This is the gate.

### 5. Apply with audit

`apply_fix(issue_id, csv_path)` reads the approved CSV and posts the writes in chunks via the Bulk API. Every write is logged to a custom SObject `Cleanup_Audit__c` with: `Operation`, `Target_Id`, `Field`, `Old_Value_JSON`, `New_Value_JSON`, `Survivor_Id`, `Run_Id`, `Approver_User`, `Timestamp`. The audit record makes the change reversible: a `revert(run_id)` companion script reads the audit log and re-applies `Old_Value_JSON`.

## Output format

The discovery scan returns a Markdown report and writes one dry-run CSV per issue picked. Literal example:

```markdown
# Salesforce data-cleanup scan — 2026-05-03

Scope: Account, Contact, Lead, Opportunity
Sandbox: true
Run ID: run_2026-05-03_a8c1

## Summary by issue type

| Issue | Count | High conf | Medium conf | Low conf |
|----------------------------|------:|----------:|------------:|---------:|
| dedup_account | 1,284 | 412 | 736 | 136 |
| dedup_contact | 8,902 | 3,118 | 4,901 | 883 |
| orphan_contact | 1,071 | 1,071 | - | - |
| junk_lead | 942 | 942 | - | - |
| format_violation_phone | 4,820 | 4,820 | - | - |
| mismatch_account_contact | 612 | 612 | - | - |
| stage_violation | 217 | 217 | - | - |

Pick one to dry-run: `dry_run_fix(issue_id="dedup_account")`
```

Per-issue dry-run CSV (first three lines):

```csv
Id,Object,Operation,Field,Old_Value,New_Value,Confidence,Survivor_Id,Rationale,Approve
0011x00000ABCD1,Account,merge,,,,high,0011x00000WXYZ9,"E.164 phone match + same primary domain; survivor has 14 activities in last 90d vs 0",
0031x00000EFGH2,Contact,update,Email,john@oldmail.example,john@newmail.example,high,,Domain matches account website,
00Q1x00000IJKL3,Lead,soft_delete,,,,high,,Free-mail (gmail) in B2B segment + no activity 18mo,
```

Audit log entry shape (one row in `Cleanup_Audit__c` per write):

```json
{
"Operation": "merge",
"Target_Id": "0011x00000ABCD1",
"Field": null,
"Old_Value_JSON": "{\"Name\":\"Acme, Inc.\",\"Phone\":\"+1-415-555-0100\"}",
"New_Value_JSON": "{\"merged_into\":\"0011x00000WXYZ9\"}",
"Survivor_Id": "0011x00000WXYZ9",
"Run_Id": "run_2026-05-03_a8c1",
"Approver_User": "revops.lead@example.com",
"Timestamp": "2026-05-03T17:42:11Z"
}
```

## Watch-outs

- **Granting write access on the first run.** The first scan will surface false positives because it is also surfacing flaws in the dedup ruleset. Guard: the Skill refuses to run `apply_fix` if the configured token has write scope and `Cleanup_Audit__c` shows zero prior `Operation=dry_run` rows for the run's scope. Two read-only cycles minimum.
- **Account merge cascading to wrong records.** Account merges propagate to all child Opportunities, Contacts, Tasks, and Events. A wrong survivor destroys the wrong history. Guard: `apply_fix` for `dedup_account` rows refuses to run unless `sandbox=true` has been set on at least one prior run with the same `Run_Id` prefix in the last 14 days, and the operator has set `--rehearsed=true` explicitly.
- **Hard-deletes bypassing recycle bin.** Operators sometimes ask the Skill to "really delete" — to skip the recycle bin and free storage. Guard: the Skill has no hard-delete code path. `soft_delete` is the only delete operation; recycle-bin emptying is a manual platform action.
- **Reps waking up to merged accounts.** A clean dedup run with no comms burns trust faster than the bad data did. Guard: the Skill emits a `change_brief.md` alongside every applied run, listing the merge map (loser-Id → survivor-Id, owner email, count of moved opps) ready to paste into a Slack channel before reps log in.

# Dedup rules — TEMPLATE

> Replace this template's contents with your team's actual matching rules.
> The Skill reads this file on every run; without your real rules, the
> dedup proposals will be generic and produce a high false-positive rate
> on the first scan.

## Object scope

Which SObjects this ruleset applies to. The Skill runs the rules per object; rules can differ across Account / Contact / Lead.

- [ ] Account
- [ ] Contact
- [ ] Lead

## Deterministic match keys

Pass one of the dedup pipeline. Two records that match on any single key listed here are surfaced as `high` confidence dedup candidates.

### Account

| Key                       | Normalization                                  | Match strength |
|---------------------------|------------------------------------------------|----------------|
| Primary domain            | lowercase, strip `www.`, strip TLD aliases     | strong         |
| Billing phone             | E.164 normalize, drop extension                | strong         |
| Stripped name + country   | NFKD, lowercase, drop suffixes (Inc/LLC/GmbH)  | medium         |
| D-U-N-S number            | exact                                          | strong         |

### Contact

| Key                       | Normalization                                  | Match strength |
|---------------------------|------------------------------------------------|----------------|
| Email                     | lowercase, strip `+tag`, strip dots in gmail   | strong         |
| Mobile phone              | E.164 normalize                                | strong         |
| Full name + AccountId     | NFKD, lowercase, drop punctuation              | medium         |

### Lead

| Key                       | Normalization                                  | Match strength |
|---------------------------|------------------------------------------------|----------------|
| Email                     | lowercase, strip `+tag`                        | strong         |
| Phone                     | E.164 normalize                                | strong         |

## Fuzzy / semantic similarity threshold

Pass two of the pipeline. Only candidates that share at least one weak signal (same first 6 digits of phone, same first token of name, same parent-domain TLD) are submitted for semantic comparison.

| SObject | Cosine similarity threshold | Notes                                    |
|---------|-----------------------------|------------------------------------------|
| Account | 0.85                        | Lower threshold misses APAC/EU branches  |
| Contact | 0.90                        | Higher because false positives expensive |
| Lead    | 0.80                        | Lower because cleanup tolerates churn    |

## Disqualifiers — never propose merge

Even if rules match, do not propose merge if any of the following holds:

- Records belong to different `RecordType` (e.g. Customer vs Partner)
- Records are owned by different BU `Sales_Region__c` values
- One record has an active Opportunity in `Negotiation` or later — defer to rep review, do not auto-propose
- Records carry a `Do_Not_Merge__c = TRUE` flag set by an admin

## Last edited

{YYYY-MM-DD}

# Stage definitions — TEMPLATE

> Replace this template's contents with your team's actual Opportunity
> stage definition. The Skill uses this to flag rows in the
> `stage_violation` bucket — opportunities sitting in a stage without
> the fields that stage requires.

## Funnel stages

List every active value of the Opportunity `StageName` picklist, in funnel order. Mark which stages are open vs closed-won vs closed-lost.

| Order | StageName             | Type        | Probability % |
|------:|-----------------------|-------------|---------------:|
| 1     | Prospecting           | open        |              5 |
| 2     | Qualification         | open        |             15 |
| 3     | Discovery             | open        |             30 |
| 4     | Solutioning           | open        |             50 |
| 5     | Negotiation           | open        |             70 |
| 6     | Verbal Commit         | open        |             90 |
| 7     | Closed Won            | closed-won  |            100 |
| 8     | Closed Lost           | closed-lost |              0 |

## Required fields per stage

A row sitting in stage N must have all fields listed for N (and all earlier stages) populated. The Skill flags rows that violate this as `stage_violation` candidates.

### Qualification

- `AccountId` not null
- `Primary_Contact__c` not null
- `Lead_Source` not null

### Discovery

- All of Qualification, plus:
- At least 1 OpportunityContactRole with `IsPrimary = TRUE`
- `Pain_Point__c` not null
- `Compelling_Event__c` not null

### Solutioning

- All of Discovery, plus:
- `Decision_Criteria__c` not null
- `Decision_Process__c` not null
- At least 2 OpportunityContactRoles

### Negotiation

- All of Solutioning, plus:
- `Economic_Buyer_Identified__c = TRUE`
- `Mutual_Action_Plan_URL__c` not null

### Verbal Commit

- All of Negotiation, plus:
- `Procurement_Status__c` not null
- `Close_Date` within the next 45 days

### Closed Won

- All of Verbal Commit, plus:
- `Close_Date` not in the future
- `Amount` not null and greater than zero
- `Contract_URL__c` not null
- At least 1 `Quote` record with `Status = "Accepted"`

### Closed Lost

- `Close_Date` not in the future
- `Loss_Reason__c` not null
- `Competitor__c` not null when `Loss_Reason__c = "Competitor"`

## Skip-stage rules

Stages that may be legitimately skipped (the Skill does not flag a violation if the row passes through them):

- `Solutioning` may be skipped for inbound deals under $10k ACV
- `Verbal Commit` may be skipped for self-serve conversions

## Last edited

{YYYY-MM-DD}

# Survivor ranking — TEMPLATE

> Replace this template's contents with your team's actual weighting.
> The Skill uses these weights when proposing which record survives a
> merge. Defaults below reflect the bias "where the team is actually
> working today" rather than "who has the most history."

## Composite score

For each record in a duplicate pair, the Skill computes:

```
score = 0.40 * activity_recency_signal
      + 0.30 * contact_count_signal
      + 0.20 * opportunity_history_signal
      + 0.10 * not_integration_user_signal
```

Higher score wins survivor status. Ties default to the record with the lower (older) Salesforce Id, which is conventionally the canonical record.

## Signal definitions

### activity_recency_signal

- 1.0 if Tasks + Events in the last 30 days exist
- 0.7 if in the last 90 days
- 0.3 if in the last 365 days
- 0.0 otherwise

### contact_count_signal

- Normalized count of active Contacts attached to the Account. `min(count, 20) / 20`. Capped to avoid one record's bloat dominating.

### opportunity_history_signal

- 0.5 weight on count of Opportunities with `IsClosed = FALSE`
- 0.5 weight on log10(1 + sum of `Amount` across all Opportunities)
- Both halves min-max normalized across the candidate pair only

### not_integration_user_signal

- 1.0 if `LastModifiedById` is not in the integration-user allowlist
- 0.0 otherwise

The integration-user allowlist (set by RevOps admin):

- `dataloader@example.com.invalid`
- `marketo-sync@example.com.invalid`
- `outreach-sync@example.com.invalid`
- {add yours}

## Override columns

Even after the Skill proposes a survivor, RevOps can override per-row in the dry-run CSV. The columns the Skill respects on read-back:

- `survivor_override` — set to the Id you want as survivor
- `do_not_merge` — set to `Y` to drop this row from `apply_fix`
- `defer_to_owner_review` — set to `Y` to route to the rep instead

## Disqualifiers — never auto-propose a survivor

Skip survivor proposal entirely if any of:

- Both records have an open Opportunity in `Negotiation` or later — too much risk of breaking deal motion mid-flight
- Records have different Account Owners and one is in their first 30 days on the team (avoid trampling new-hire pipelines)
- One record carries a `Strategic_Account__c = TRUE` flag — kick to a RevOps human

## Last edited

{YYYY-MM-DD}