🎯

deduplication

🎯Skill

from dadbodgeoff/drift

What it does

Deduplicates multi-source data by semantic similarity, selecting canonical items using reputation scoring and hash-based grouping.

📦

Part of

dadbodgeoff/drift(69 items)

deduplication

Installation

npm installInstall npm package

npm install -g driftdetect

npm installInstall npm package

npm install -g driftdetect@latest

npm installInstall npm package

npm install -g driftdetect-mcp

Claude Desktop ConfigurationAdd this to your claude_desktop_config.json

{
  "mcpServers": {
    "drift": {
      "command": "driftdetect-mcp"
    }
  }
...

📖 Extracted from docs: dadbodgeoff/drift

Need more details? View full documentation on GitHub →

5Installs

AddedFeb 4, 2026

View on GitHub Back to Skills

Skill Details

SKILL.md

Event deduplication with canonical selection, reputation scoring, and hash-based grouping for multi-source data aggregation. Handles both ID-based and content-based deduplication.

Overview

# Event Deduplication

Canonical selection with reputation scoring and hash-based grouping for multi-source data.

When to Use This Skill

Aggregating data from multiple sources (news, events, products)
Same content appears from different outlets/sources
Need to pick the "best" version from duplicates
Tracking deduplication metrics for optimization

Core Concepts

Simple URL deduplication isn't enough. Production needs:

Grouping by semantic similarity (same story, different outlets)
Canonical selection (pick the "best" version)
Reputation scoring (prefer authoritative sources)
Both ID-based and content-based deduplication

Two modes:

ID-based: When sources have unique IDs, keep the "best" version when IDs collide
Content-based: Group by semantic similarity, select canonical from each group

Implementation

TypeScript

```typescript

import { createHash } from 'crypto';

interface DeduplicationResult {

items: T[];

originalCount: number;

dedupedCount: number;

reductionPercent: number;

duplicateGroups?: number;

}

// ============================================

// ID-Based Deduplication

// ============================================

function deduplicateById(

items: T[],

preferFn: (existing: T, candidate: T) => T

): DeduplicationResult {

const seen = new Map();

for (const item of items) {

const existing = seen.get(item.id);

if (existing) {

seen.set(item.id, preferFn(existing, item));

} else {

seen.set(item.id, item);

}

const dedupedItems = Array.from(seen.values());

const reductionPercent = items.length > 0

? Math.round((1 - dedupedItems.length / items.length) * 100)

: 0;

return {

items: dedupedItems,

originalCount: items.length,

dedupedCount: dedupedItems.length,

reductionPercent,

};

}

// ============================================

// Content-Based Deduplication

// ============================================

interface Article {

title: string;

url: string;

domain: string;

publishedAt: string;

tone?: number;

}

/**

* Generate deduplication key from content

* Groups by: normalized title + source country + date

function generateDedupKey(article: Article): string {

const normalizedTitle = article.title

.toLowerCase()

.replace(/[^\w\s]/g, '')

.trim()

.slice(0, 50);

const dateStr = article.publishedAt?.slice(0, 10).replace(/-/g, '') || 'unknown';

return ${normalizedTitle}|${dateStr};

}

/**

* Generate unique ID from URL

function generateEventId(url: string): string {

return createHash('md5').update(url).digest('hex').slice(0, 12);

}

/**

* Source reputation scoring

function getReputationScore(domain: string): number {

// Tier 1: Wire services and major international

const tier1 = ['reuters.com', 'apnews.com', 'bbc.com', 'bbc.co.uk',

'aljazeera.com', 'france24.com', 'dw.com'];

if (tier1.some(r => domain.includes(r))) return 100;

// Tier 2: Major newspapers

const tier2 = ['nytimes.com', 'washingtonpost.com', 'theguardian.com',

'ft.com', 'economist.com', 'wsj.com'];

if (tier2.some(r => domain.includes(r))) return 75;

// Tier 3: Regional/national

const tier3 = ['cnn.com', 'foxnews.com', 'nbcnews.com', 'abcnews.go.com'];

if (tier3.some(r => domain.includes(r))) return 50;

return 10;

}

/**

* Select canonical article from duplicate group

function selectCanonical(

group: { item: T; source: string }[]

): { item: T; source: string } {

return group.reduce((best, current) => {

const bestScore = getReputationScore(best.item.domain) +

Math.abs(best.item.tone || 0);

const currentScore = getReputationScore(current.item.domain) +

Math.abs(current.item.tone || 0);

return currentScore > bestScore ? current : best;

});

}

/**

* Deduplicate articles from multiple sources

function deduplicateArticles(

sourceResults: { sourceName: string; articles: T[] }[]

): DeduplicationResult {

const groups = new Map();

let totalArticles = 0;

// Group articles by dedup key

for (const { sourceName, articles } of sourceResults) {

for (const article of articles) {

totalArticles++;

const key = generateDedupKey(article);

if (!groups.has(key)) {

groups.set(key, []);

}

groups.get(key)!.push({ item: article, source: sourceName });

}

// Select canonical article from each group

const items: (T & { source: string })[] = [];

for (const group of groups.values()) {

const canonical = selectCanonical(group);

items.push({ ...canonical.item, source: canonical.source });

}

const reductionPercent = totalArticles > 0

? Math.round((1 - items.length / totalArticles) * 100)

: 0;

console.log([Dedup] ${totalArticles} → ${items.length} (${reductionPercent}% reduction));

return {

items,

originalCount: totalArticles,

dedupedCount: items.length,

reductionPercent,

duplicateGroups: groups.size,

};

}

```

Usage Examples

ID-Based Deduplication

```typescript

const events = await fetchEvents();

const result = deduplicateById(events, (existing, candidate) => {

// Prefer events with coordinates

if (!existing.lat && candidate.lat) return candidate;

// Prefer higher sentiment magnitude

if (Math.abs(candidate.sentiment) > Math.abs(existing.sentiment)) {

return candidate;

}

return existing;

});

console.log(Reduced ${result.reductionPercent}% duplicates);

```

Multi-Source Aggregation

```typescript

const results = await Promise.all([

fetchFromSourceA(),

fetchFromSourceB(),

fetchFromSourceC(),

]);

const { items, reductionPercent } = deduplicateArticles([

{ sourceName: 'source-a', articles: results[0] },

{ sourceName: 'source-b', articles: results[1] },

{ sourceName: 'source-c', articles: results[2] },

]);

// items now contains canonical articles with source attribution

```

Best Practices

Semantic grouping - Group by normalized content, not just URL
Reputation scoring - Prefer authoritative sources as canonical
Best version selection - When IDs collide, keep version with most data
Reduction tracking - Log how much deduplication helped
Source attribution - Track which source the canonical came from

Common Mistakes

Simple URL deduplication (misses same story from different outlets)
Random selection from duplicates (lose quality signal)
No normalization (case/punctuation differences create false negatives)
Not tracking reduction metrics (can't optimize)
Hardcoded source lists (make configurable)

Related Patterns

batch-processing - Process deduplicated items efficiently
validation-quarantine - Validate before deduplication
checkpoint-resume - Track which files have been deduplicated

More from this repository10

🎯

feature-flags🎯Skill

Enables controlled feature rollouts, A/B testing, and selective feature access through configurable flags for gradual deployment and user targeting.

🎯

design-tokens🎯Skill

Generates a comprehensive, type-safe design token system with WCAG AA color compliance and multi-framework support for consistent visual design.

🎯

file-uploads🎯Skill

Securely validates, scans, and processes file uploads with multi-stage checks, malware detection, and race condition prevention.

🎯

ai-coaching🎯Skill

Guides users through articulating creative intent by extracting structured parameters and detecting conversation readiness.

🎯

environment-config🎯Skill

Validates and centralizes environment variables with type safety, fail-fast startup checks, and multi-environment support.

🎯

community-feed🎯Skill

Generates efficient social feed with cursor pagination, trending algorithms, and engagement tracking for infinite scroll experiences.

🎯

cloud-storage🎯Skill

Enables secure, multi-tenant cloud file storage with signed URLs, direct uploads, and visibility control for user-uploaded assets.

🎯

email-service🎯Skill

Simplifies email sending, templating, and tracking with robust SMTP integration and support for multiple email providers and transactional workflows.

🎯

error-sanitization🎯Skill

Sanitizes error messages by logging full details server-side while exposing only generic, safe messages to prevent sensitive information leakage.

🎯

batch-processing🎯Skill

Optimizes database operations by collecting and batching independent records, improving throughput by 30-40% with built-in fallback processing.