🎯

anndata

🎯Skill

from ovachiever/droid-tings

VibeIndex|
What it does

Manages and manipulates annotated data matrices, especially for single-cell genomics, with efficient storage and multi-dimensional metadata handling.

πŸ“¦

Part of

ovachiever/droid-tings(370 items)

anndata

Installation

git cloneClone repository
git clone https://github.com/ovachiever/droid-tings.git
πŸ“– Extracted from docs: ovachiever/droid-tings
16Installs
20
-
AddedFeb 4, 2026

Skill Details

SKILL.md

This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.

Overview

# AnnData

Overview

AnnData is a Python package for handling annotated data matrices, storing experimental measurements (X) alongside observation metadata (obs), variable metadata (var), and multi-dimensional annotations (obsm, varm, obsp, varp, uns). Originally designed for single-cell genomics through Scanpy, it now serves as a general-purpose framework for any annotated data requiring efficient storage, manipulation, and analysis.

When to Use This Skill

Use this skill when:

  • Creating, reading, or writing AnnData objects
  • Working with h5ad, zarr, or other genomics data formats
  • Performing single-cell RNA-seq analysis
  • Managing large datasets with sparse matrices or backed mode
  • Concatenating multiple datasets or experimental batches
  • Subsetting, filtering, or transforming annotated data
  • Integrating with scanpy, scvi-tools, or other scverse ecosystem tools

Installation

```bash

uv pip install anndata

# With optional dependencies

uv pip install anndata[dev,test,doc]

```

Quick Start

Creating an AnnData object

```python

import anndata as ad

import numpy as np

import pandas as pd

# Minimal creation

X = np.random.rand(100, 2000) # 100 cells Γ— 2000 genes

adata = ad.AnnData(X)

# With metadata

obs = pd.DataFrame({

'cell_type': ['T cell', 'B cell'] * 50,

'sample': ['A', 'B'] * 50

}, index=[f'cell_{i}' for i in range(100)])

var = pd.DataFrame({

'gene_name': [f'Gene_{i}' for i in range(2000)]

}, index=[f'ENSG{i:05d}' for i in range(2000)])

adata = ad.AnnData(X=X, obs=obs, var=var)

```

Reading data

```python

# Read h5ad file

adata = ad.read_h5ad('data.h5ad')

# Read with backed mode (for large files)

adata = ad.read_h5ad('large_data.h5ad', backed='r')

# Read other formats

adata = ad.read_csv('data.csv')

adata = ad.read_loom('data.loom')

adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')

```

Writing data

```python

# Write h5ad file

adata.write_h5ad('output.h5ad')

# Write with compression

adata.write_h5ad('output.h5ad', compression='gzip')

# Write other formats

adata.write_zarr('output.zarr')

adata.write_csvs('output_dir/')

```

Basic operations

```python

# Subset by conditions

t_cells = adata[adata.obs['cell_type'] == 'T cell']

# Subset by indices

subset = adata[0:50, 0:100]

# Add metadata

adata.obs['quality_score'] = np.random.rand(adata.n_obs)

adata.var['highly_variable'] = np.random.rand(adata.n_vars) > 0.8

# Access dimensions

print(f"{adata.n_obs} observations Γ— {adata.n_vars} variables")

```

Core Capabilities

1. Data Structure

Understand the AnnData object structure including X, obs, var, layers, obsm, varm, obsp, varp, uns, and raw components.

See: references/data_structure.md for comprehensive information on:

  • Core components (X, obs, var, layers, obsm, varm, obsp, varp, uns, raw)
  • Creating AnnData objects from various sources
  • Accessing and manipulating data components
  • Memory-efficient practices

2. Input/Output Operations

Read and write data in various formats with support for compression, backed mode, and cloud storage.

See: references/io_operations.md for details on:

  • Native formats (h5ad, zarr)
  • Alternative formats (CSV, MTX, Loom, 10X, Excel)
  • Backed mode for large datasets
  • Remote data access
  • Format conversion
  • Performance optimization

Common commands:

```python

# Read/write h5ad

adata = ad.read_h5ad('data.h5ad', backed='r')

adata.write_h5ad('output.h5ad', compression='gzip')

# Read 10X data

adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')

# Read MTX format

adata = ad.read_mtx('matrix.mtx').T

```

3. Concatenation

Combine multiple AnnData objects along observations or variables with flexible join strategies.

See: references/concatenation.md for comprehensive coverage of:

  • Basic concatenation (axis=0 for observations, axis=1 for variables)
  • Join types (inner, outer)
  • Merge strategies (same, unique, first, only)
  • Tracking data sources with labels
  • Lazy concatenation (AnnCollection)
  • On-disk concatenation for large datasets

Common commands:

```python

# Concatenate observations (combine samples)

adata = ad.concat(

[adata1, adata2, adata3],

axis=0,

join='inner',

label='batch',

keys=['batch1', 'batch2', 'batch3']

)

# Concatenate variables (combine modalities)

adata = ad.concat([adata_rna, adata_protein], axis=1)

# Lazy concatenation

from anndata.experimental import AnnCollection

collection = AnnCollection(

['data1.h5ad', 'data2.h5ad'],

join_obs='outer',

label='dataset'

)

```

4. Data Manipulation

Transform, subset, filter, and reorganize data efficiently.

See: references/manipulation.md for detailed guidance on:

  • Subsetting (by indices, names, boolean masks, metadata conditions)
  • Transposition
  • Copying (full copies vs views)
  • Renaming (observations, variables, categories)
  • Type conversions (strings to categoricals, sparse/dense)
  • Adding/removing data components
  • Reordering
  • Quality control filtering

Common commands:

```python

# Subset by metadata

filtered = adata[adata.obs['quality_score'] > 0.8]

hv_genes = adata[:, adata.var['highly_variable']]

# Transpose

adata_T = adata.T

# Copy vs view

view = adata[0:100, :] # View (lightweight reference)

copy = adata[0:100, :].copy() # Independent copy

# Convert strings to categoricals

adata.strings_to_categoricals()

```

5. Best Practices

Follow recommended patterns for memory efficiency, performance, and reproducibility.

See: references/best_practices.md for guidelines on:

  • Memory management (sparse matrices, categoricals, backed mode)
  • Views vs copies
  • Data storage optimization
  • Performance optimization
  • Working with raw data
  • Metadata management
  • Reproducibility
  • Error handling
  • Integration with other tools
  • Common pitfalls and solutions

Key recommendations:

```python

# Use sparse matrices for sparse data

from scipy.sparse import csr_matrix

adata.X = csr_matrix(adata.X)

# Convert strings to categoricals

adata.strings_to_categoricals()

# Use backed mode for large files

adata = ad.read_h5ad('large.h5ad', backed='r')

# Store raw before filtering

adata.raw = adata.copy()

adata = adata[:, adata.var['highly_variable']]

```

Integration with Scverse Ecosystem

AnnData serves as the foundational data structure for the scverse ecosystem:

Scanpy (Single-cell analysis)

```python

import scanpy as sc

# Preprocessing

sc.pp.filter_cells(adata, min_genes=200)

sc.pp.normalize_total(adata, target_sum=1e4)

sc.pp.log1p(adata)

sc.pp.highly_variable_genes(adata, n_top_genes=2000)

# Dimensionality reduction

sc.pp.pca(adata, n_comps=50)

sc.pp.neighbors(adata, n_neighbors=15)

sc.tl.umap(adata)

sc.tl.leiden(adata)

# Visualization

sc.pl.umap(adata, color=['cell_type', 'leiden'])

```

Muon (Multimodal data)

```python

import muon as mu

# Combine RNA and protein data

mdata = mu.MuData({'rna': adata_rna, 'protein': adata_protein})

```

PyTorch integration

```python

from anndata.experimental import AnnLoader

# Create DataLoader for deep learning

dataloader = AnnLoader(adata, batch_size=128, shuffle=True)

for batch in dataloader:

X = batch.X

# Train model

```

Common Workflows

Single-cell RNA-seq analysis

```python

import anndata as ad

import scanpy as sc

# 1. Load data

adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')

# 2. Quality control

adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)

adata.obs['n_counts'] = adata.X.sum(axis=1)

adata = adata[adata.obs['n_genes'] > 200]

adata = adata[adata.obs['n_counts'] < 50000]

# 3. Store raw

adata.raw = adata.copy()

# 4. Normalize and filter

sc.pp.normalize_total(adata, target_sum=1e4)

sc.pp.log1p(adata)

sc.pp.highly_variable_genes(adata, n_top_genes=2000)

adata = adata[:, adata.var['highly_variable']]

# 5. Save processed data

adata.write_h5ad('processed.h5ad')

```

Batch integration

```python

# Load multiple batches

adata1 = ad.read_h5ad('batch1.h5ad')

adata2 = ad.read_h5ad('batch2.h5ad')

adata3 = ad.read_h5ad('batch3.h5ad')

# Concatenate with batch labels

adata = ad.concat(

[adata1, adata2, adata3],

label='batch',

keys=['batch1', 'batch2', 'batch3'],

join='inner'

)

# Apply batch correction

import scanpy as sc

sc.pp.combat(adata, key='batch')

# Continue analysis

sc.pp.pca(adata)

sc.pp.neighbors(adata)

sc.tl.umap(adata)

```

Working with large datasets

```python

# Open in backed mode

adata = ad.read_h5ad('100GB_dataset.h5ad', backed='r')

# Filter based on metadata (no data loading)

high_quality = adata[adata.obs['quality_score'] > 0.8]

# Load filtered subset

adata_subset = high_quality.to_memory()

# Process subset

process(adata_subset)

# Or process in chunks

chunk_size = 1000

for i in range(0, adata.n_obs, chunk_size):

chunk = adata[i:i+chunk_size, :].to_memory()

process(chunk)

```

Troubleshooting

Out of memory errors

Use backed mode or convert to sparse matrices:

```python

# Backed mode

adata = ad.read_h5ad('file.h5ad', backed='r')

# Sparse matrices

from scipy.sparse import csr_matrix

adata.X = csr_matrix(adata.X)

```

Slow file reading

Use compression and appropriate formats:

```python

# Optimize for storage

adata.strings_to_categoricals()

adata.write_h5ad('file.h5ad', compression='gzip')

# Use Zarr for cloud storage

adata.write_zarr('file.zarr', chunks=(1000, 1000))

```

Index alignment issues

Always align external data on index:

```python

# Wrong

adata.obs['new_col'] = external_data['values']

# Correct

adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']

```

Additional Resources

  • Official documentation: https://anndata.readthedocs.io/
  • Scanpy tutorials: https://scanpy.readthedocs.io/
  • Scverse ecosystem: https://scverse.org/
  • GitHub repository: https://github.com/scverse/anndata