🎯

nemo-curator

🎯Skill

from ovachiever/droid-tings

VibeIndex|
What it does

Accelerates GPU-powered data curation for LLM training by performing fuzzy deduplication, quality filtering, semantic deduplication, and PII redaction across text, image, video, and audio datasets.

πŸ“¦

Part of

ovachiever/droid-tings(370 items)

nemo-curator

Installation

git cloneClone repository
git clone https://github.com/ovachiever/droid-tings.git
πŸ“– Extracted from docs: ovachiever/droid-tings
16Installs
-
AddedFeb 4, 2026

Skill Details

SKILL.md

GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16Γ— faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality training datasets, cleaning web data, or deduplicating large corpora.

Overview

# NeMo Curator - GPU-Accelerated Data Curation

NVIDIA's toolkit for preparing high-quality training data for LLMs.

When to use NeMo Curator

Use NeMo Curator when:

  • Preparing LLM training data from web scrapes (Common Crawl)
  • Need fast deduplication (16Γ— faster than CPU)
  • Curating multi-modal datasets (text, images, video, audio)
  • Filtering low-quality or toxic content
  • Scaling data processing across GPU cluster

Performance:

  • 16Γ— faster fuzzy deduplication (8TB RedPajama v2)
  • 40% lower TCO vs CPU alternatives
  • Near-linear scaling across GPU nodes

Use alternatives instead:

  • datatrove: CPU-based, open-source data processing
  • dolma: Allen AI's data toolkit
  • Ray Data: General ML data processing (no curation focus)

Quick start

Installation

```bash

# Text curation (CUDA 12)

uv pip install "nemo-curator[text_cuda12]"

# All modalities

uv pip install "nemo-curator[all_cuda12]"

# CPU-only (slower)

uv pip install "nemo-curator[cpu]"

```

Basic text curation pipeline

```python

from nemo_curator import ScoreFilter, Modify

from nemo_curator.datasets import DocumentDataset

import pandas as pd

# Load data

df = pd.DataFrame({"text": ["Good document", "Bad doc", "Excellent text"]})

dataset = DocumentDataset(df)

# Quality filtering

def quality_score(doc):

return len(doc["text"].split()) > 5 # Filter short docs

filtered = ScoreFilter(quality_score)(dataset)

# Deduplication

from nemo_curator.modules import ExactDuplicates

deduped = ExactDuplicates()(filtered)

# Save

deduped.to_parquet("curated_data/")

```

Data curation pipeline

Stage 1: Quality filtering

```python

from nemo_curator.filters import (

WordCountFilter,

RepeatedLinesFilter,

UrlRatioFilter,

NonAlphaNumericFilter

)

# Apply 30+ heuristic filters

from nemo_curator import ScoreFilter

# Word count filter

dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000))

# Remove repetitive content

dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3))

# URL ratio filter

dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2))

```

Stage 2: Deduplication

Exact deduplication:

```python

from nemo_curator.modules import ExactDuplicates

# Remove exact duplicates

deduped = ExactDuplicates(id_field="id", text_field="text")(dataset)

```

Fuzzy deduplication (16Γ— faster on GPU):

```python

from nemo_curator.modules import FuzzyDuplicates

# MinHash + LSH deduplication

fuzzy_dedup = FuzzyDuplicates(

id_field="id",

text_field="text",

num_hashes=260, # MinHash parameters

num_buckets=20,

hash_method="md5"

)

deduped = fuzzy_dedup(dataset)

```

Semantic deduplication:

```python

from nemo_curator.modules import SemanticDuplicates

# Embedding-based deduplication

semantic_dedup = SemanticDuplicates(

id_field="id",

text_field="text",

embedding_model="sentence-transformers/all-MiniLM-L6-v2",

threshold=0.8 # Cosine similarity threshold

)

deduped = semantic_dedup(dataset)

```

Stage 3: PII redaction

```python

from nemo_curator.modules import Modify

from nemo_curator.modifiers import PIIRedactor

# Redact personally identifiable information

pii_redactor = PIIRedactor(

supported_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "LOCATION"],

anonymize_action="replace" # or "redact"

)

redacted = Modify(pii_redactor)(dataset)

```

Stage 4: Classifier filtering

```python

from nemo_curator.classifiers import QualityClassifier

# Quality classification

quality_clf = QualityClassifier(

model_path="nvidia/quality-classifier-deberta",

batch_size=256,

device="cuda"

)

# Filter low-quality documents

high_quality = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5)

```

GPU acceleration

GPU vs CPU performance

| Operation | CPU (16 cores) | GPU (A100) | Speedup |

|-----------|----------------|------------|---------|

| Fuzzy dedup (8TB) | 120 hours | 7.5 hours | 16Γ— |

| Exact dedup (1TB) | 8 hours | 0.5 hours | 16Γ— |

| Quality filtering | 2 hours | 0.2 hours | 10Γ— |

Multi-GPU scaling

```python

from nemo_curator import get_client

import dask_cuda

# Initialize GPU cluster

client = get_client(cluster_type="gpu", n_workers=8)

# Process with 8 GPUs

deduped = FuzzyDuplicates(...)(dataset)

```

Multi-modal curation

Image curation

```python

from nemo_curator.image import (

AestheticFilter,

NSFWFilter,

CLIPEmbedder

)

# Aesthetic scoring

aesthetic_filter = AestheticFilter(threshold=5.0)

filtered_images = aesthetic_filter(image_dataset)

# NSFW detection

nsfw_filter = NSFWFilter(threshold=0.9)

safe_images = nsfw_filter(filtered_images)

# Generate CLIP embeddings

clip_embedder = CLIPEmbedder(model="openai/clip-vit-base-patch32")

image_embeddings = clip_embedder(safe_images)

```

Video curation

```python

from nemo_curator.video import (

SceneDetector,

ClipExtractor,

InternVideo2Embedder

)

# Detect scenes

scene_detector = SceneDetector(threshold=27.0)

scenes = scene_detector(video_dataset)

# Extract clips

clip_extractor = ClipExtractor(min_duration=2.0, max_duration=10.0)

clips = clip_extractor(scenes)

# Generate embeddings

video_embedder = InternVideo2Embedder()

video_embeddings = video_embedder(clips)

```

Audio curation

```python

from nemo_curator.audio import (

ASRInference,

WERFilter,

DurationFilter

)

# ASR transcription

asr = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc")

transcribed = asr(audio_dataset)

# Filter by WER (word error rate)

wer_filter = WERFilter(max_wer=0.3)

high_quality_audio = wer_filter(transcribed)

# Duration filtering

duration_filter = DurationFilter(min_duration=1.0, max_duration=30.0)

filtered_audio = duration_filter(high_quality_audio)

```

Common patterns

Web scrape curation (Common Crawl)

```python

from nemo_curator import ScoreFilter, Modify

from nemo_curator.filters import *

from nemo_curator.modules import *

from nemo_curator.datasets import DocumentDataset

# Load Common Crawl data

dataset = DocumentDataset.read_parquet("common_crawl/*.parquet")

# Pipeline

pipeline = [

# 1. Quality filtering

WordCountFilter(min_words=100, max_words=50000),

RepeatedLinesFilter(max_repeated_line_fraction=0.2),

SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3),

UrlRatioFilter(max_url_ratio=0.3),

# 2. Language filtering

LanguageIdentificationFilter(target_languages=["en"]),

# 3. Deduplication

ExactDuplicates(id_field="id", text_field="text"),

FuzzyDuplicates(id_field="id", text_field="text", num_hashes=260),

# 4. PII redaction

PIIRedactor(),

# 5. NSFW filtering

NSFWClassifier(threshold=0.8)

]

# Execute

for stage in pipeline:

dataset = stage(dataset)

# Save

dataset.to_parquet("curated_common_crawl/")

```

Distributed processing

```python

from nemo_curator import get_client

from dask_cuda import LocalCUDACluster

# Multi-GPU cluster

cluster = LocalCUDACluster(n_workers=8)

client = get_client(cluster=cluster)

# Process large dataset

dataset = DocumentDataset.read_parquet("s3://large_dataset/*.parquet")

deduped = FuzzyDuplicates(...)(dataset)

# Cleanup

client.close()

cluster.close()

```

Performance benchmarks

Fuzzy deduplication (8TB RedPajama v2)

  • CPU (256 cores): 120 hours
  • GPU (8Γ— A100): 7.5 hours
  • Speedup: 16Γ—

Exact deduplication (1TB)

  • CPU (64 cores): 8 hours
  • GPU (4Γ— A100): 0.5 hours
  • Speedup: 16Γ—

Quality filtering (100GB)

  • CPU (32 cores): 2 hours
  • GPU (2Γ— A100): 0.2 hours
  • Speedup: 10Γ—

Cost comparison

CPU-based curation (AWS c5.18xlarge Γ— 10):

  • Cost: $3.60/hour Γ— 10 = $36/hour
  • Time for 8TB: 120 hours
  • Total: $4,320

GPU-based curation (AWS p4d.24xlarge Γ— 2):

  • Cost: $32.77/hour Γ— 2 = $65.54/hour
  • Time for 8TB: 7.5 hours
  • Total: $491.55

Savings: 89% reduction ($3,828 saved)

Supported data formats

  • Input: Parquet, JSONL, CSV
  • Output: Parquet (recommended), JSONL
  • WebDataset: TAR archives for multi-modal

Use cases

Production deployments:

  • NVIDIA used NeMo Curator to prepare Nemotron-4 training data
  • Open-source datasets curated: RedPajama v2, The Pile

References

  • [Filtering Guide](references/filtering.md) - 30+ quality filters, heuristics
  • [Deduplication Guide](references/deduplication.md) - Exact, fuzzy, semantic methods

Resources

  • GitHub: https://github.com/NVIDIA/NeMo-Curator ⭐ 500+
  • Docs: https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/
  • Version: 0.4.0+
  • License: Apache 2.0