clip-aware-embeddings
π―Skillfrom erichowens/some_claude_skills
Performs semantic image-text matching using CLIP embeddings for zero-shot classification, image search, and similarity tasks.
Installation
npx skills add https://github.com/erichowens/some_claude_skills --skill clip-aware-embeddingsSkill Details
Semantic image-text matching with CLIP and alternatives. Use for image search, zero-shot classification, similarity matching. NOT for counting objects, fine-grained classification (celebrities, car models), spatial reasoning, or compositional queries. Activate on "CLIP", "embeddings", "image similarity", "semantic search", "zero-shot classification", "image-text matching".
Overview
# CLIP-Aware Image Embeddings
Smart image-text matching that knows when CLIP works and when to use alternatives.
MCP Integrations
| MCP | Purpose |
|-----|---------|
| Firecrawl | Research latest CLIP alternatives and benchmarks |
| Hugging Face (if configured) | Access model cards and documentation |
Quick Decision Tree
```
Your task:
ββ Semantic search ("find beach images") β CLIP β
ββ Zero-shot classification (broad categories) β CLIP β
ββ Counting objects β DETR, Faster R-CNN β
ββ Fine-grained ID (celebrities, car models) β Specialized model β
ββ Spatial relations ("cat left of dog") β GQA, SWIG β
ββ Compositional ("red car AND blue truck") β DCSMs, PC-CLIP β
```
When to Use This Skill
β Use for:
- Semantic image search
- Broad category classification
- Image similarity matching
- Zero-shot tasks on new categories
β Do NOT use for:
- Counting objects in images
- Fine-grained classification
- Spatial understanding
- Attribute binding
- Negation handling
Installation
```bash
pip install transformers pillow torch sentence-transformers --break-system-packages
```
Validation: Run python scripts/validate_setup.py
Basic Usage
Image Search
```python
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
# Embed images
images = [Image.open(f"img{i}.jpg") for i in range(10)]
inputs = processor(images=images, return_tensors="pt")
image_features = model.get_image_features(**inputs)
# Search with text
text_inputs = processor(text=["a beach at sunset"], return_tensors="pt")
text_features = model.get_text_features(**text_inputs)
# Compute similarity
similarity = (image_features @ text_features.T).softmax(dim=0)
```
Common Anti-Patterns
Anti-Pattern 1: "CLIP for Everything"
β Wrong:
```python
# Using CLIP to count cars in an image
prompt = "How many cars are in this image?"
# CLIP cannot count - it will give nonsense results
```
Why wrong: CLIP's architecture collapses spatial information into a single vector. It literally cannot count.
β Right:
```python
from transformers import DetrImageProcessor, DetrForObjectDetection
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
# Detect objects
results = model(**processor(images=image, return_tensors="pt"))
# Filter for cars and count
car_detections = [d for d in results if d['label'] == 'car']
count = len(car_detections)
```
How to detect: If query contains "how many", "count", or numeric questions β Use object detection
---
Anti-Pattern 2: Fine-Grained Classification
β Wrong:
```python
# Trying to identify specific celebrities with CLIP
prompts = ["Tom Hanks", "Brad Pitt", "Morgan Freeman"]
# CLIP will perform poorly - not trained for fine-grained face ID
```
Why wrong: CLIP trained on coarse categories. Fine-grained faces, car models, flower species require specialized models.
β Right:
```python
# Use a fine-tuned face recognition model
from transformers import AutoFeatureExtractor, AutoModelForImageClassification
model = AutoModelForImageClassification.from_pretrained(
"microsoft/resnet-50" # Then fine-tune on celebrity dataset
)
# Or use dedicated face recognition: ArcFace, CosFace
```
How to detect: If query asks to distinguish between similar items in same category β Use specialized model
---
Anti-Pattern 3: Spatial Understanding
β Wrong:
```python
# CLIP cannot understand spatial relationships
prompts = [
"cat to the left of dog",
"cat to the right of dog"
]
# Will give nearly identical scores
```
Why wrong: CLIP embeddings lose spatial topology. "Left" and "right" are treated as bag-of-words.
β Right:
```python
# Use a spatial reasoning model
# Examples: GQA models, Visual Genome models, SWIG
from swig_model import SpatialRelationModel
model = SpatialRelationModel()
result = model.predict_relation(image, "cat", "dog")
# Returns: "left", "right", "above", "below", etc.
```
How to detect: If query contains directional words (left, right, above, under, next to) β Use spatial model
---
Anti-Pattern 4: Attribute Binding
β Wrong:
```python
prompts = [
"red car and blue truck",
"blue car and red truck"
]
# CLIP often gives similar scores for both
```
Why wrong: CLIP cannot bind attributes to objects. It sees "red, blue, car, truck" as a bag of concepts.
β Right - Use PC-CLIP or DCSMs:
```python
# PC-CLIP: Fine-tuned for pairwise comparisons
from pc_clip import PCCLIPModel
model = PCCLIPModel.from_pretrained("pc-clip-vit-l")
# Or use DCSMs (Dense Cosine Similarity Maps)
```
How to detect: If query has multiple objects with different attributes β Use compositional model
---
Evolution Timeline
2021: CLIP Released
- Revolutionary: zero-shot, 400M image-text pairs
- Widely adopted for everything
- Limitations not yet understood
2022-2023: Limitations Discovered
- Cannot count objects
- Poor at fine-grained classification
- Fails spatial reasoning
- Can't bind attributes
2024: Alternatives Emerge
- DCSMs: Preserve patch/token topology
- PC-CLIP: Trained on pairwise comparisons
- SpLiCE: Sparse interpretable embeddings
2025: Current Best Practices
- Use CLIP for what it's good at
- Task-specific models for limitations
- Compositional models for complex queries
LLM Mistake: LLMs trained on 2021-2023 data will suggest CLIP for everything because limitations weren't widely known. This skill corrects that.
---
Validation Script
Before using CLIP, check if it's appropriate:
```bash
python scripts/validate_clip_usage.py \
--query "your query here" \
--check-all
```
Returns:
- β CLIP is appropriate
- β Use alternative (with suggestion)
Task-Specific Guidance
Image Search (CLIP β)
```python
# Good use of CLIP
queries = ["beach", "mountain", "city skyline"]
# Works well for broad semantic concepts
```
Zero-Shot Classification (CLIP β)
```python
# Good: Broad categories
categories = ["indoor", "outdoor", "nature", "urban"]
# CLIP excels at this
```
Object Counting (CLIP β)
```python
# Use object detection instead
from transformers import DetrImageProcessor, DetrForObjectDetection
# See /references/object_detection.md
```
Fine-Grained Classification (CLIP β)
```python
# Use specialized models
# See /references/fine_grained_models.md
```
Spatial Reasoning (CLIP β)
```python
# Use spatial relation models
# See /references/spatial_models.md
```
---
Troubleshooting
Issue: CLIP gives unexpected results
Check:
- Is this a counting task? β Use object detection
- Fine-grained classification? β Use specialized model
- Spatial query? β Use spatial model
- Multiple objects with attributes? β Use compositional model
Validation:
```bash
python scripts/diagnose_clip_issue.py --image path/to/image --query "your query"
```
Issue: Low similarity scores
Possible causes:
- Query too specific (CLIP works better with broad concepts)
- Fine-grained task (not CLIP's strength)
- Need to adjust threshold
Solution: Try broader query or use alternative model
---
Model Selection Guide
| Model | Best For | Avoid For |
|-------|----------|-----------|
| CLIP ViT-L/14 | Semantic search, broad categories | Counting, fine-grained, spatial |
| DETR | Object detection, counting | Semantic similarity |
| DINOv2 | Fine-grained features | Text-image matching |
| PC-CLIP | Attribute binding, comparisons | General embedding |
| DCSMs | Compositional reasoning | Simple similarity |
Performance Notes
CLIP models:
- ViT-B/32: Fast, lower quality
- ViT-L/14: Balanced (recommended)
- ViT-g-14: Highest quality, slower
Inference time (single image, CPU):
- ViT-B/32: ~100ms
- ViT-L/14: ~300ms
- ViT-g-14: ~1000ms
Further Reading
/references/clip_limitations.md- Detailed analysis of CLIP's failures/references/alternatives.md- When to use what model/references/compositional_reasoning.md- DCSMs and PC-CLIP deep dive/scripts/validate_clip_usage.py- Pre-flight validation tool/scripts/diagnose_clip_issue.py- Debug unexpected results
---
See CHANGELOG.md for version history.
More from this repository10
Builds production-ready LLM applications with advanced RAG, vector search, and intelligent agent architectures for enterprise AI solutions.
Conducts comprehensive market research, competitive analysis, and evidence-based strategy recommendations across diverse landscapes and industries.
Generates harmonious color palettes using color theory principles, recommending complementary, analogous, and triadic color schemes for design projects.
Systematically creates, validates, and improves Agent Skills by encoding domain expertise and preventing incorrect activations.
Manages real-time streaming responses from language models, enabling smooth parsing, buffering, and event-driven handling of incremental AI outputs
Analyzes and refines typography, providing expert guidance on font selection, kerning, readability, and design consistency across digital and print media
Validates and enforces output quality by checking agent responses against predefined schemas, structural requirements, and content standards.
Systematically builds comprehensive visual design databases by analyzing 500-1000 real-world examples across diverse domains, extracting actionable design patterns and trends.
Intelligently coordinates multiple specialized skills, dynamically decomposes complex tasks, synthesizes outputs, and creates new skills to fill capability gaps.
Analyze and optimize audio tracks by applying professional mixing techniques, EQ adjustments, and mastering effects for high-quality sound production