🎯

clip

🎯Skill

from ovachiever/droid-tings

What it does

Enables zero-shot image classification and cross-modal retrieval by understanding images through natural language descriptions.

📦

Part of

ovachiever/droid-tings(370 items)

clip

Installation

pip installInstall Python package

pip install git+https://github.com/openai/CLIP.git

pip installInstall Python package

pip install torch torchvision ftfy regex tqdm

📖 Extracted from docs: ovachiever/droid-tings

Need more details? View full documentation on GitHub →

17Installs

AddedFeb 4, 2026

View on GitHub Back to Skills

Skill Details

SKILL.md

OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.

Overview

# CLIP - Contrastive Language-Image Pre-Training

OpenAI's model that understands images from natural language.

When to use CLIP

Use when:

Zero-shot image classification (no training data needed)
Image-text similarity/matching
Semantic image search
Content moderation (detect NSFW, violence)
Visual question answering
Cross-modal retrieval (image→text, text→image)

Metrics:

25,300+ GitHub stars
Trained on 400M image-text pairs
Matches ResNet-50 on ImageNet (zero-shot)
MIT License

Use alternatives instead:

BLIP-2: Better captioning
LLaVA: Vision-language chat
Segment Anything: Image segmentation

Quick start

Installation

```bash

pip install git+https://github.com/openai/CLIP.git

pip install torch torchvision ftfy regex tqdm

```

Zero-shot classification

```python

import torch

import clip

from PIL import Image

# Load model

device = "cuda" if torch.cuda.is_available() else "cpu"

model, preprocess = clip.load("ViT-B/32", device=device)

# Load image

image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)

# Define possible labels

text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)

# Compute similarity

with torch.no_grad():

image_features = model.encode_image(image)

text_features = model.encode_text(text)

# Cosine similarity

logits_per_image, logits_per_text = model(image, text)

probs = logits_per_image.softmax(dim=-1).cpu().numpy()

# Print results

labels = ["a dog", "a cat", "a bird", "a car"]

for label, prob in zip(labels, probs[0]):

print(f"{label}: {prob:.2%}")

```

Available models

```python

# Models (sorted by size)

models = [

"RN50", # ResNet-50

"RN101", # ResNet-101

"ViT-B/32", # Vision Transformer (recommended)

"ViT-B/16", # Better quality, slower

"ViT-L/14", # Best quality, slowest

]

model, preprocess = clip.load("ViT-B/32")

```

|-------|------------|-------|---------|

| RN50 | 102M | Fast | Good |

Image-text similarity

```python

# Compute embeddings

image_features = model.encode_image(image)

text_features = model.encode_text(text)

# Normalize

image_features /= image_features.norm(dim=-1, keepdim=True)

text_features /= text_features.norm(dim=-1, keepdim=True)

# Cosine similarity

similarity = (image_features @ text_features.T).item()

print(f"Similarity: {similarity:.4f}")

```

Semantic image search

```python

# Index images

image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]

image_embeddings = []

for img_path in image_paths:

image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)

with torch.no_grad():

embedding = model.encode_image(image)

embedding /= embedding.norm(dim=-1, keepdim=True)

image_embeddings.append(embedding)

image_embeddings = torch.cat(image_embeddings)

# Search with text query

query = "a sunset over the ocean"

text_input = clip.tokenize([query]).to(device)

with torch.no_grad():

text_embedding = model.encode_text(text_input)

text_embedding /= text_embedding.norm(dim=-1, keepdim=True)

# Find most similar images

similarities = (text_embedding @ image_embeddings.T).squeeze(0)

top_k = similarities.topk(3)

for idx, score in zip(top_k.indices, top_k.values):

print(f"{image_paths[idx]}: {score:.3f}")

```

Content moderation

```python

# Define categories

categories = [

"safe for work",

"not safe for work",

"violent content",

"graphic content"

]

text = clip.tokenize(categories).to(device)

# Check image

with torch.no_grad():

logits_per_image, _ = model(image, text)

probs = logits_per_image.softmax(dim=-1)

# Get classification

max_idx = probs.argmax().item()

max_prob = probs[0, max_idx].item()

print(f"Category: {categories[max_idx]} ({max_prob:.2%})")

```

Batch processing

```python

# Process multiple images

images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)]

images = torch.stack(images).to(device)

with torch.no_grad():

image_features = model.encode_image(images)

image_features /= image_features.norm(dim=-1, keepdim=True)

# Batch text

texts = ["a dog", "a cat", "a bird"]

text_tokens = clip.tokenize(texts).to(device)

with torch.no_grad():

text_features = model.encode_text(text_tokens)

text_features /= text_features.norm(dim=-1, keepdim=True)

# Similarity matrix (10 images × 3 texts)

similarities = image_features @ text_features.T

print(similarities.shape) # (10, 3)

```

Integration with vector databases

```python

# Store CLIP embeddings in Chroma/FAISS

import chromadb

client = chromadb.Client()

collection = client.create_collection("image_embeddings")

# Add image embeddings

for img_path, embedding in zip(image_paths, image_embeddings):

collection.add(

embeddings=[embedding.cpu().numpy().tolist()],

metadatas=[{"path": img_path}],

ids=[img_path]

)

# Query with text

query = "a sunset"

text_embedding = model.encode_text(clip.tokenize([query]))

results = collection.query(

query_embeddings=[text_embedding.cpu().numpy().tolist()],

n_results=5

)

```

Best practices

Use ViT-B/32 for most cases - Good balance
Normalize embeddings - Required for cosine similarity
Batch processing - More efficient
Cache embeddings - Expensive to recompute
Use descriptive labels - Better zero-shot performance
GPU recommended - 10-50× faster
Preprocess images - Use provided preprocess function

Performance

| Operation | CPU | GPU (V100) |

|-----------|-----|------------|

| Image encoding | ~200ms | ~20ms |

| Text encoding | ~50ms | ~5ms |

| Similarity compute | <1ms | <1ms |

Limitations

Not for fine-grained tasks - Best for broad categories
Requires descriptive text - Vague labels perform poorly
Biased on web data - May have dataset biases
No bounding boxes - Whole image only
Limited spatial understanding - Position/counting weak

Resources

GitHub: https://github.com/openai/CLIP ⭐ 25,300+
Paper: https://arxiv.org/abs/2103.00020
Colab: https://colab.research.google.com/github/openai/clip/
License: MIT

More from this repository10

🎯

nextjs-shadcn-builder🎯Skill

nextjs-shadcn-builder skill from ovachiever/droid-tings

🎯

security-auditor🎯Skill

security-auditor skill from ovachiever/droid-tings

🎯

threejs-graphics-optimizer🎯Skill

threejs-graphics-optimizer skill from ovachiever/droid-tings

🎯

api-documenter🎯Skill

api-documenter skill from ovachiever/droid-tings

🎯

secret-scanner🎯Skill

secret-scanner skill from ovachiever/droid-tings

🎯

readme-updater🎯Skill

readme-updater skill from ovachiever/droid-tings

🎯

applying-brand-guidelines🎯Skill

applying-brand-guidelines skill from ovachiever/droid-tings

🎯

tailwind-v4-shadcn🎯Skill

Configures Tailwind v4 with shadcn/ui, automating CSS variable setup, dark mode, and preventing common initialization errors.

🎯

deep-reading-analyst🎯Skill

deep-reading-analyst skill from ovachiever/droid-tings

🎯

dependency-auditor🎯Skill

dependency-auditor skill from ovachiever/droid-tings