podium-contact-dedup
Deduplicate Podium contacts in production and survive the data-quality failures —
Allowed Tools
Provided by Plugin
podium-pack
Claude Code skill pack for Podium (10 production-engineer skills)
Installation
This skill is included in the podium-pack plugin:
/plugin install podium-pack@claude-code-plugins-plus
Click to copy
Instructions
Podium Contact Dedup
Overview
Deduplicate Podium contacts in production and operate the dedup pipeline at scale. This is not a one-shot cleanup script — it is the data-quality layer your integration runs continuously to keep the contact corpus sane while messages, calls, webchats, and reviews keep mutating it. Run it once and Sydney's "0412 345 678" walk-in stops creating a fifth contact next to the four that already exist as +61 412 345 678, (04) 1234-5678, +61412345678, and 0412345678.
The six production failures this skill prevents:
- Phone format inconsistency —
+61 412 345 678,0412 345 678,(04) 1234-5678,+61412345678are all the same phone but produce four contacts. Operators paste numbers from a CRM, a phone screen, a written form, and a stored fragment; Podium dedups on exact string match, so all four survive and the next caller appears as a fifth. - Merge API ordering loses fields — Podium's merge endpoint takes a
primaryand aduplicate; whichever you pick asprimarykeeps its own fields, the other's fields are discarded. Pick the wrong record (newer but emptier) as primary and the older, richer record's name, tags, and conversation links vanish silently. - Opt-out flag lost on merge — duplicate had
marketingoptout=true, primary hadmarketingoptout=false; naive merge keeps primary's flag and re-enables marketing on a person who explicitly opted out. This is a compliance incident (TCPA, GDPR Article 21, ACMA Spam Act) and a trust incident — the customer opted out, you marketed at them anyway. - Soft-delete vs hard-delete semantic confusion — Podium's
DELETE /contacts/{uid}is reversible; the record is hidden, not destroyed. Treat it as terminal and you ship a "contact reappeared after we deleted them" support ticket every time an admin restores a contact via the UI. Hard-delete (purge) is a separate, irreversible endpoint with different scopes. - Duplicate detection across locations — same phone calls Sydney AND Burleigh Heads, two contacts created (one per location), per-location dedup misses it entirely. Cross-location dedup needs a separate routine keyed by
phonenaturalkeyacross the union of contacts in every location_uid, not just within one. - Merge conflicts on simultaneous edits — two operators (or one operator + one automated job) merge overlapping clusters at the same time; the second merge's
primarymay have already been merged into another record, the API silently merges into a now-stale target, and one operator's intent is dropped without surfacing the conflict.
Prerequisites
- Python 3.10+ with the
phonenumberslibrary (pip install phonenumbers) - A working
podium-authintegration (this skill calls Podium with an authenticated client) - Read scope:
contacts.read. Write scope:contacts.write. (contacts.deleteonly if hard-purge is in scope.) - A local SQLite database for the natural-key index and merge state file (default
./podium-dedup.sqlite) - A default region for E.164 parsing (
AUfor Australian deployments,USfor US — set per-tenant)
Instructions
Build in this order. Each section neutralizes one production failure mode.
1. E.164 normalization with natural-key emission (neutralizes phone-format inconsistency)
Every contact's phone is parsed by the phonenumbers library into E.164 form, then hashed into a stable "natural key" suitable for an index lookup. Same human-readable phone → same key, regardless of formatting input.
import phonenumbers
from phonenumbers import NumberParseException
def normalize_phone(raw: str, default_region: str = "AU") -> dict:
"""Return {e164, national, country, natural_key, valid} for any input format."""
try:
parsed = phonenumbers.parse(raw, default_region)
except NumberParseException as e:
return {"valid": False, "reason": f"parse_failed: {e}"}
if not phonenumbers.is_valid_number(parsed):
return {"valid": False, "reason": "not_a_valid_number"}
e164 = phonenumbers.format_number(parsed, phonenumbers.PhoneNumberFormat.E164)
national = phonenumbers.format_number(parsed, phonenumbers.PhoneNumberFormat.NATIONAL)
return {
"valid": True,
"e164": e164, # +61412345678
"national": national, # 0412 345 678
"country": phonenumbers.region_code_for_number(parsed),
"natural_key": e164, # the E.164 IS the natural key — no further hashing needed
}
The natural key is the E.164 string itself. Hashing it adds nothing — E.164 is already canonical and bounded in length. Use the E.164 directly as the SQLite primary key on the natural-key index.
2. SQLite-backed natural-key index (neutralizes O(N²) duplicate scans)
A naive dedup scans every pair of contacts — O(N²) on a 50k-contact corpus is hours. Instead, build a (naturalkey → [contactuid, ...]) index in SQLite once, then duplicate detection is O(N) over the index.
CREATE TABLE IF NOT EXISTS contact_index (
contact_uid TEXT PRIMARY KEY,
location_uid TEXT NOT NULL,
natural_key TEXT NOT NULL, -- E.164
raw_phone TEXT,
name TEXT,
field_count INTEGER NOT NULL DEFAULT 0,
marketing_opt_out INTEGER NOT NULL DEFAULT 0,
sms_opt_out INTEGER NOT NULL DEFAULT 0,
email_opt_out INTEGER NOT NULL DEFAULT 0,
deleted_at_podium TEXT, -- ISO8601, NULL if live
updated_at_podium TEXT NOT NULL,
indexed_at TEXT NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_natural_key ON contact_index(natural_key);
CREATE INDEX IF NOT EXISTS idx_natural_key_per_location ON contact_index(natural_key, location_uid);
The field_count column is precomputed at index time so the merge orchestrator picks the richer record as primary without re-fetching every record.
3. Duplicate detection with confidence scoring (neutralizes blind merges)
A cluster is a set of contacts sharing the same natural_key. Within a cluster, each pair gets a confidence score in [0.0, 1.0]:
| Factor | Weight |
|---|---|
| Same E.164 (always true within a cluster) | 0.60 — required floor |
| Same name (case-insensitive, normalized) | +0.20 |
| Same email | +0.15 |
| Overlapping tags | +0.05 |
Only clusters with all pairwise scores >= 0.80 auto-merge by default. Lower-scored clusters surface for human review.
def cluster_confidence(a: dict, b: dict) -> float:
score = 0.60 # same natural_key by construction
if a.get("name") and a["name"].strip().lower() == (b.get("name") or "").strip().lower():
score += 0.20
if a.get("email") and a["email"].strip().lower() == (b.get("email") or "").strip().lower():
score += 0.15
a_tags, b_tags = set(a.get("tags") or []), set(b.get("tags") or [])
if a_tags & b_tags:
score += 0.05
return round(min(score, 1.0), 4)
4. Merge orchestrator with primary selection (neutralizes lost richer record)
For each auto-mergeable cluster, pick the primary by this deterministic rule, in order:
- Most fields populated (highest
field_count) — the richer record wins. - Most recently updated (
updatedatpodium) — break ties toward fresher state. - Lowest contact_uid (lexical) — final, deterministic, reproducible tiebreak.
Every other contact in the cluster is a duplicate to be merged INTO the primary. Never trust caller-supplied ordering — always compute primary inside the orchestrator.
def select_primary(cluster: list[dict]) -> dict:
return max(
cluster,
key=lambda c: (c["field_count"], c["updated_at_podium"], -ord_key(c["contact_uid"]))
)
def ord_key(uid: str) -> int:
# Stable, deterministic tiebreak — lower uid sorts first
return sum(ord(c) for c in uid)
5. Opt-out preservation by union (neutralizes compliance re-enable)
The strongest setting wins, always. If any record in the cluster has marketingoptout=true, the merged record has marketingoptout=true. Same for smsoptout and emailoptout. This rule is non-negotiable and is the reason every cluster's opt-out state is computed BEFORE the merge API call, then forced via PATCH /contacts/{primary_uid} immediately after the merge completes.
def union_opt_outs(cluster: list[dict]) -> dict:
return {
"marketing_opt_out": any(c.get("marketing_opt_out") for c in cluster),
"sms_opt_out": any(c.get("sms_opt_out") for c in cluster),
"email_opt_out": any(c.get("email_opt_out") for c in cluster),
}
The merge-then-patch sequence:
- Compute
optouts = unionopt_outs(cluster)BEFORE any API call. - Call Podium merge API:
POST /contacts/{primaryuid}/mergewith{"duplicateuids": [...]}. - Immediately
PATCH /contacts/{primaryuid}withoptoutsto overwrite whatever Podium's merge left there.
Do not rely on Podium's merge to preserve opt-outs. The PATCH is the canonical source of truth for the final state.
6. Soft-delete vs hard-delete handling (neutralizes "contact reappeared")
Podium's DELETE /contacts/{uid} is soft delete — the record sets deleted_at and disappears from default list endpoints but remains restorable via the Podium UI. Treat it as a state change, not a destruction.
The dedup pipeline never hard-deletes. It always:
- Calls
POST /contacts/{primary_uid}/merge— Podium soft-deletes the duplicates and links their conversation history to the primary. - Records the operation in the local audit log with
operation=merge, soft_delete=true, restorable=true. - If a human admin restores a soft-deleted duplicate via the Podium UI, the next dedup run sees it again and re-merges it. This is by design — the audit log surfaces the loop so a human can decide whether the restore was intentional.
Hard-delete (/contacts/{uid}?hard=true — separate scope, separate endpoint) is reserved for compliance erasure requests (GDPR right-to-be-forgotten, CCPA delete request) and runs through a different skill, not this one.
7. Cross-location dedup (neutralizes the Sydney + Burleigh Heads case)
Per-location dedup misses the case where the same phone exists as two separate contacts in two different locations. The cross-location scan runs after per-location dedup completes:
def cross_location_clusters(db) -> list[list[dict]]:
"""Return clusters of contacts sharing a natural_key across DIFFERENT location_uids."""
rows = db.execute("""
SELECT natural_key, contact_uid, location_uid, field_count, updated_at_podium
FROM contact_index
WHERE deleted_at_podium IS NULL
GROUP BY natural_key
HAVING COUNT(DISTINCT location_uid) > 1
""").fetchall()
# ... assemble per-key cluster
Cross-location merges have a different policy: by default they DO NOT auto-merge, because a person may legitimately be a customer of two separate franchises. They surface for human review with both location names attached. The auto-merge threshold can be raised per-deployment when the operator confirms locations represent the same business entity (e.g., two co-located retail floors).
8. Idempotent merge with state file (neutralizes mid-run crash)
Every cluster operation is recorded in merge_state BEFORE the API call and confirmed AFTER. A crash mid-merge leaves a pending row; the next run sees it, queries Podium for the current state of the primary, and either confirms done or retries.
CREATE TABLE IF NOT EXISTS merge_state (
cluster_id TEXT PRIMARY KEY, -- hash of sorted contact_uids
natural_key TEXT NOT NULL,
primary_uid TEXT NOT NULL,
duplicate_uids TEXT NOT NULL, -- JSON array
status TEXT NOT NULL, -- pending | merging | merged | patched | failed
attempts INTEGER NOT NULL DEFAULT 0,
last_error TEXT,
started_at TEXT NOT NULL,
completed_at TEXT
);
State transitions: pending → merging → merged → patched. Only patched is terminal-success. A run resumes from any non-terminal state by re-checking the primary in Podium.
9. Conflict detection on simultaneous edits (neutralizes the race)
Before each merge API call, the orchestrator re-fetches each duplicate and verifies updatedatpodium matches the indexed value. If a duplicate has been updated since the index was built (another operator merged it into a different record, an admin edited it, an inbound message arrived), the orchestrator:
- Aborts the merge for this cluster.
- Logs
conflictdetectedto the audit log with the staleindexedupdatedatvs the currentliveupdated_at. - Marks the cluster
reindexrequired— the next run rebuilds the index for thisnatural_keyand re-evaluates.
This is fail-stop, not fail-silent. A simultaneous-merge race surfaces in the audit log, not in the customer's marketing inbox.
Error Handling
Troubleshoot failures using the table below — each row maps a wire-level symptom to the root cause and the action. For deeper debug, the audit-log.jsonl records every cluster's pre- and post-merge state, and the merge_state SQLite table is the resumable source of truth for any in-flight operation.
| HTTP Status | Podium Error | Root Cause | Action |
|---|---|---|---|
400 Bad Request |
invalidduplicateuid |
A duplicate_uid does not exist or already soft-deleted | Re-fetch and re-evaluate cluster — duplicate may have been merged elsewhere |
404 Not Found |
contactnotfound |
Primary uid no longer exists (hard-deleted between index and merge) | Skip cluster; the data is gone, no recovery needed |
409 Conflict |
mergeinprogress |
Another merge is already operating on one of these contacts | Wait with exponential backoff; another orchestrator instance is mid-merge |
422 Unprocessable |
crosslocationmerge_blocked |
Primary and duplicate are in different location_uids and tenant policy forbids | Surface to human review queue; do not retry |
429 Too Many Requests |
rate_limited |
Burst merge load tripped Podium's per-tenant limit | Honor Retry-After; downstream skill is podium-rate-limit-survival |
500/502/503 |
server_error |
Podium-side transient | Exponential backoff with jitter, max 4 attempts; keep cluster pending |
Examples
Normalize a single phone number from CLI
python3 scripts/phone_normalize.py --phone "0412 345 678" --region AU --output json
Output:
{
"valid": true,
"e164": "+61412345678",
"national": "0412 345 678",
"country": "AU",
"natural_key": "+61412345678"
}
Build the natural-key index and find duplicate clusters
# 1. Pull all contacts from a location and populate the SQLite index
python3 scripts/find_duplicates.py \
--location-uid loc_abc123 \
--db ./podium-dedup.sqlite \
--token-env PODIUM_ACCESS_TOKEN \
--output json
# Output: clusters of length >= 2, each with confidence score and suggested primary
Dry-run a merge of one cluster
python3 scripts/merge_contacts.py \
--cluster-id cl_7f3a... \
--db ./podium-dedup.sqlite \
--token-env PODIUM_ACCESS_TOKEN \
--dry-run
Dry-run prints the planned operation — primaryuid, duplicateuids, optoutunion, and the exact API calls that would fire — without contacting Podium.
Execute the merge for real
python3 scripts/merge_contacts.py \
--cluster-id cl_7f3a... \
--db ./podium-dedup.sqlite \
--token-env PODIUM_ACCESS_TOKEN
Scan for cross-location duplicates after per-location runs complete
python3 scripts/cross_location_dedup.py \
--db ./podium-dedup.sqlite \
--output review-queue.json
The output is a human-review queue, not an auto-merge plan — cross-location merges require operator confirmation by default.
Output
- E.164 normalization function with natural-key emission, validated by the
phonenumberslibrary - SQLite-backed natural-key index keyed on
naturalkeyand(naturalkey, location_uid) - Duplicate detection emitting clusters with confidence scores (0.60 floor, 0.80 auto-merge threshold)
- Merge orchestrator with deterministic primary selection (fieldcount > updatedat > uid)
- Opt-out preservation via union-then-PATCH after every merge
- Cross-location duplicate scanner producing a human-review queue
- Idempotent merge state file (resumable after crash)
- Conflict detection via
updatedatpodiumre-check before each merge
Resources
- Podium API docs — Contacts
- Podium API docs — Contact merge
- phonenumbers library (Google libphonenumber port)
- E.164 spec (ITU-T Recommendation E.164)
- config/settings.yaml — region defaults, confidence thresholds, opt-out policy
- references/errors.md — ERRDEDUP* codes with cause + solution
- references/examples.md — 10 worked examples (single phone, cluster, cross-location, resume)
- references/implementation.md — Node port, libphonenumber wiring, audit log schema
- scripts/phonenormalize.py — CLI: normalize a phone to E.164 + natural-key
- scripts/findduplicates.py — CLI: scan and emit cluster proposals with scores
- scripts/mergecontacts.py — CLI: merge a cluster (primary + dupes) with
--dry-run - scripts/crosslocationdedup.py — CLI: cross-location scan for human review