🎯

datamol

🎯Skill

from ovachiever/droid-tings

VibeIndex|
What it does

Simplifies molecular cheminformatics by providing a Pythonic, user-friendly wrapper around RDKit for efficient molecular processing and analysis.

πŸ“¦

Part of

ovachiever/droid-tings(370 items)

datamol

Installation

git cloneClone repository
git clone https://github.com/ovachiever/droid-tings.git
πŸ“– Extracted from docs: ovachiever/droid-tings
16Installs
-
AddedFeb 4, 2026

Skill Details

SKILL.md

"Pythonic wrapper around RDKit with simplified interface and sensible defaults. Preferred for standard drug discovery: SMILES parsing, standardization, descriptors, fingerprints, clustering, 3D conformers, parallel processing. Returns native rdkit.Chem.Mol objects. For advanced control or custom parameters, use rdkit directly."

Overview

# Datamol Cheminformatics Skill

Overview

Datamol is a Python library that provides a lightweight, Pythonic abstraction layer over RDKit for molecular cheminformatics. Simplify complex molecular operations with sensible defaults, efficient parallelization, and modern I/O capabilities. All molecular objects are native rdkit.Chem.Mol instances, ensuring full compatibility with the RDKit ecosystem.

Key capabilities:

  • Molecular format conversion (SMILES, SELFIES, InChI)
  • Structure standardization and sanitization
  • Molecular descriptors and fingerprints
  • 3D conformer generation and analysis
  • Clustering and diversity selection
  • Scaffold and fragment analysis
  • Chemical reaction application
  • Visualization and alignment
  • Batch processing with parallelization
  • Cloud storage support via fsspec

Installation and Setup

Guide users to install datamol:

```bash

uv pip install datamol

```

Import convention:

```python

import datamol as dm

```

Core Workflows

1. Basic Molecule Handling

Creating molecules from SMILES:

```python

import datamol as dm

# Single molecule

mol = dm.to_mol("CCO") # Ethanol

# From list of SMILES

smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]

mols = [dm.to_mol(smi) for smi in smiles_list]

# Error handling

mol = dm.to_mol("invalid_smiles") # Returns None

if mol is None:

print("Failed to parse SMILES")

```

Converting molecules to SMILES:

```python

# Canonical SMILES

smiles = dm.to_smiles(mol)

# Isomeric SMILES (includes stereochemistry)

smiles = dm.to_smiles(mol, isomeric=True)

# Other formats

inchi = dm.to_inchi(mol)

inchikey = dm.to_inchikey(mol)

selfies = dm.to_selfies(mol)

```

Standardization and sanitization (always recommend for user-provided molecules):

```python

# Sanitize molecule

mol = dm.sanitize_mol(mol)

# Full standardization (recommended for datasets)

mol = dm.standardize_mol(

mol,

disconnect_metals=True,

normalize=True,

reionize=True

)

# For SMILES strings directly

clean_smiles = dm.standardize_smiles(smiles)

```

2. Reading and Writing Molecular Files

Refer to references/io_module.md for comprehensive I/O documentation.

Reading files:

```python

# SDF files (most common in chemistry)

df = dm.read_sdf("compounds.sdf", mol_column='mol')

# SMILES files

df = dm.read_smi("molecules.smi", smiles_column='smiles', mol_column='mol')

# CSV with SMILES column

df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")

# Excel files

df = dm.read_excel("compounds.xlsx", sheet_name=0, mol_column="mol")

# Universal reader (auto-detects format)

df = dm.open_df("file.sdf") # Works with .sdf, .csv, .xlsx, .parquet, .json

```

Writing files:

```python

# Save as SDF

dm.to_sdf(mols, "output.sdf")

# Or from DataFrame

dm.to_sdf(df, "output.sdf", mol_column="mol")

# Save as SMILES file

dm.to_smi(mols, "output.smi")

# Excel with rendered molecule images

dm.to_xlsx(df, "output.xlsx", mol_columns=["mol"])

```

Remote file support (S3, GCS, HTTP):

```python

# Read from cloud storage

df = dm.read_sdf("s3://bucket/compounds.sdf")

df = dm.read_csv("https://example.com/data.csv")

# Write to cloud storage

dm.to_sdf(mols, "s3://bucket/output.sdf")

```

3. Molecular Descriptors and Properties

Refer to references/descriptors_viz.md for detailed descriptor documentation.

Computing descriptors for a single molecule:

```python

# Get standard descriptor set

descriptors = dm.descriptors.compute_many_descriptors(mol)

# Returns: {'mw': 46.07, 'logp': -0.03, 'hbd': 1, 'hba': 1,

# 'tpsa': 20.23, 'n_aromatic_atoms': 0, ...}

```

Batch descriptor computation (recommended for datasets):

```python

# Compute for all molecules in parallel

desc_df = dm.descriptors.batch_compute_many_descriptors(

mols,

n_jobs=-1, # Use all CPU cores

progress=True # Show progress bar

)

```

Specific descriptors:

```python

# Aromaticity

n_aromatic = dm.descriptors.n_aromatic_atoms(mol)

aromatic_ratio = dm.descriptors.n_aromatic_atoms_proportion(mol)

# Stereochemistry

n_stereo = dm.descriptors.n_stereo_centers(mol)

n_unspec = dm.descriptors.n_stereo_centers_unspecified(mol)

# Flexibility

n_rigid = dm.descriptors.n_rigid_bonds(mol)

```

Drug-likeness filtering (Lipinski's Rule of Five):

```python

# Filter compounds

def is_druglike(mol):

desc = dm.descriptors.compute_many_descriptors(mol)

return (

desc['mw'] <= 500 and

desc['logp'] <= 5 and

desc['hbd'] <= 5 and

desc['hba'] <= 10

)

druglike_mols = [mol for mol in mols if is_druglike(mol)]

```

4. Molecular Fingerprints and Similarity

Generating fingerprints:

```python

# ECFP (Extended Connectivity Fingerprint, default)

fp = dm.to_fp(mol, fp_type='ecfp', radius=2, n_bits=2048)

# Other fingerprint types

fp_maccs = dm.to_fp(mol, fp_type='maccs')

fp_topological = dm.to_fp(mol, fp_type='topological')

fp_atompair = dm.to_fp(mol, fp_type='atompair')

```

Similarity calculations:

```python

# Pairwise distances within a set

distance_matrix = dm.pdist(mols, n_jobs=-1)

# Distances between two sets

distances = dm.cdist(query_mols, library_mols, n_jobs=-1)

# Find most similar molecules

from scipy.spatial.distance import squareform

dist_matrix = squareform(dm.pdist(mols))

# Lower distance = higher similarity (Tanimoto distance = 1 - Tanimoto similarity)

```

5. Clustering and Diversity Selection

Refer to references/core_api.md for clustering details.

Butina clustering:

```python

# Cluster molecules by structural similarity

clusters = dm.cluster_mols(

mols,

cutoff=0.2, # Tanimoto distance threshold (0=identical, 1=completely different)

n_jobs=-1 # Parallel processing

)

# Each cluster is a list of molecule indices

for i, cluster in enumerate(clusters):

print(f"Cluster {i}: {len(cluster)} molecules")

cluster_mols = [mols[idx] for idx in cluster]

```

Important: Butina clustering builds a full distance matrix - suitable for ~1000 molecules, not for 10,000+.

Diversity selection:

```python

# Pick diverse subset

diverse_mols = dm.pick_diverse(

mols,

npick=100 # Select 100 diverse molecules

)

# Pick cluster centroids

centroids = dm.pick_centroids(

mols,

npick=50 # Select 50 representative molecules

)

```

6. Scaffold Analysis

Refer to references/fragments_scaffolds.md for complete scaffold documentation.

Extracting Murcko scaffolds:

```python

# Get Bemis-Murcko scaffold (core structure)

scaffold = dm.to_scaffold_murcko(mol)

scaffold_smiles = dm.to_smiles(scaffold)

```

Scaffold-based analysis:

```python

# Group compounds by scaffold

from collections import Counter

scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]

scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]

# Count scaffold frequency

scaffold_counts = Counter(scaffold_smiles)

most_common = scaffold_counts.most_common(10)

# Create scaffold-to-molecules mapping

scaffold_groups = {}

for mol, scaf_smi in zip(mols, scaffold_smiles):

if scaf_smi not in scaffold_groups:

scaffold_groups[scaf_smi] = []

scaffold_groups[scaf_smi].append(mol)

```

Scaffold-based train/test splitting (for ML):

```python

# Ensure train and test sets have different scaffolds

scaffold_to_mols = {}

for mol, scaf in zip(mols, scaffold_smiles):

if scaf not in scaffold_to_mols:

scaffold_to_mols[scaf] = []

scaffold_to_mols[scaf].append(mol)

# Split scaffolds into train/test

import random

scaffolds = list(scaffold_to_mols.keys())

random.shuffle(scaffolds)

split_idx = int(0.8 * len(scaffolds))

train_scaffolds = scaffolds[:split_idx]

test_scaffolds = scaffolds[split_idx:]

# Get molecules for each split

train_mols = [mol for scaf in train_scaffolds for mol in scaffold_to_mols[scaf]]

test_mols = [mol for scaf in test_scaffolds for mol in scaffold_to_mols[scaf]]

```

7. Molecular Fragmentation

Refer to references/fragments_scaffolds.md for fragmentation details.

BRICS fragmentation (16 bond types):

```python

# Fragment molecule

fragments = dm.fragment.brics(mol)

# Returns: set of fragment SMILES with attachment points like '[1*]CCN'

```

RECAP fragmentation (11 bond types):

```python

fragments = dm.fragment.recap(mol)

```

Fragment analysis:

```python

# Find common fragments across compound library

from collections import Counter

all_fragments = []

for mol in mols:

frags = dm.fragment.brics(mol)

all_fragments.extend(frags)

fragment_counts = Counter(all_fragments)

common_frags = fragment_counts.most_common(20)

# Fragment-based scoring

def fragment_score(mol, reference_fragments):

mol_frags = dm.fragment.brics(mol)

overlap = mol_frags.intersection(reference_fragments)

return len(overlap) / len(mol_frags) if mol_frags else 0

```

8. 3D Conformer Generation

Refer to references/conformers_module.md for detailed conformer documentation.

Generating conformers:

```python

# Generate 3D conformers

mol_3d = dm.conformers.generate(

mol,

n_confs=50, # Number to generate (auto if None)

rms_cutoff=0.5, # Filter similar conformers (Γ…ngstrΓΆms)

minimize_energy=True, # Minimize with UFF force field

method='ETKDGv3' # Embedding method (recommended)

)

# Access conformers

n_conformers = mol_3d.GetNumConformers()

conf = mol_3d.GetConformer(0) # Get first conformer

positions = conf.GetPositions() # Nx3 array of atom coordinates

```

Conformer clustering:

```python

# Cluster conformers by RMSD

clusters = dm.conformers.cluster(

mol_3d,

rms_cutoff=1.0,

centroids=False

)

# Get representative conformers

centroids = dm.conformers.return_centroids(mol_3d, clusters)

```

SASA calculation:

```python

# Calculate solvent accessible surface area

sasa_values = dm.conformers.sasa(mol_3d, n_jobs=-1)

# Access SASA from conformer properties

conf = mol_3d.GetConformer(0)

sasa = conf.GetDoubleProp('rdkit_free_sasa')

```

9. Visualization

Refer to references/descriptors_viz.md for visualization documentation.

Basic molecule grid:

```python

# Visualize molecules

dm.viz.to_image(

mols[:20],

legends=[dm.to_smiles(m) for m in mols[:20]],

n_cols=5,

mol_size=(300, 300)

)

# Save to file

dm.viz.to_image(mols, outfile="molecules.png")

# SVG for publications

dm.viz.to_image(mols, outfile="molecules.svg", use_svg=True)

```

Aligned visualization (for SAR analysis):

```python

# Align molecules by common substructure

dm.viz.to_image(

similar_mols,

align=True, # Enable MCS alignment

legends=activity_labels,

n_cols=4

)

```

Highlighting substructures:

```python

# Highlight specific atoms and bonds

dm.viz.to_image(

mol,

highlight_atom=[0, 1, 2, 3], # Atom indices

highlight_bond=[0, 1, 2] # Bond indices

)

```

Conformer visualization:

```python

# Display multiple conformers

dm.viz.conformers(

mol_3d,

n_confs=10,

align_conf=True,

n_cols=3

)

```

10. Chemical Reactions

Refer to references/reactions_data.md for reactions documentation.

Applying reactions:

```python

from rdkit.Chem import rdChemReactions

# Define reaction from SMARTS

rxn_smarts = '[C:1](=[O:2])[OH:3]>>[C:1](=[O:2])[Cl:3]'

rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)

# Apply to molecule

reactant = dm.to_mol("CC(=O)O") # Acetic acid

product = dm.reactions.apply_reaction(

rxn,

(reactant,),

sanitize=True

)

# Convert to SMILES

product_smiles = dm.to_smiles(product)

```

Batch reaction application:

```python

# Apply reaction to library

products = []

for mol in reactant_mols:

try:

prod = dm.reactions.apply_reaction(rxn, (mol,))

if prod is not None:

products.append(prod)

except Exception as e:

print(f"Reaction failed: {e}")

```

Parallelization

Datamol includes built-in parallelization for many operations. Use n_jobs parameter:

  • n_jobs=1: Sequential (no parallelization)
  • n_jobs=-1: Use all available CPU cores
  • n_jobs=4: Use 4 cores

Functions supporting parallelization:

  • dm.read_sdf(..., n_jobs=-1)
  • dm.descriptors.batch_compute_many_descriptors(..., n_jobs=-1)
  • dm.cluster_mols(..., n_jobs=-1)
  • dm.pdist(..., n_jobs=-1)
  • dm.conformers.sasa(..., n_jobs=-1)

Progress bars: Many batch operations support progress=True parameter.

Common Workflows and Patterns

Complete Pipeline: Data Loading β†’ Filtering β†’ Analysis

```python

import datamol as dm

import pandas as pd

# 1. Load molecules

df = dm.read_sdf("compounds.sdf")

# 2. Standardize

df['mol'] = df['mol'].apply(lambda m: dm.standardize_mol(m) if m else None)

df = df[df['mol'].notna()] # Remove failed molecules

# 3. Compute descriptors

desc_df = dm.descriptors.batch_compute_many_descriptors(

df['mol'].tolist(),

n_jobs=-1,

progress=True

)

# 4. Filter by drug-likeness

druglike = (

(desc_df['mw'] <= 500) &

(desc_df['logp'] <= 5) &

(desc_df['hbd'] <= 5) &

(desc_df['hba'] <= 10)

)

filtered_df = df[druglike]

# 5. Cluster and select diverse subset

diverse_mols = dm.pick_diverse(

filtered_df['mol'].tolist(),

npick=100

)

# 6. Visualize results

dm.viz.to_image(

diverse_mols,

legends=[dm.to_smiles(m) for m in diverse_mols],

outfile="diverse_compounds.png",

n_cols=10

)

```

Structure-Activity Relationship (SAR) Analysis

```python

# Group by scaffold

scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]

scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]

# Create DataFrame with activities

sar_df = pd.DataFrame({

'mol': mols,

'scaffold': scaffold_smiles,

'activity': activities # User-provided activity data

})

# Analyze each scaffold series

for scaffold, group in sar_df.groupby('scaffold'):

if len(group) >= 3: # Need multiple examples

print(f"\nScaffold: {scaffold}")

print(f"Count: {len(group)}")

print(f"Activity range: {group['activity'].min():.2f} - {group['activity'].max():.2f}")

# Visualize with activities as legends

dm.viz.to_image(

group['mol'].tolist(),

legends=[f"Activity: {act:.2f}" for act in group['activity']],

align=True # Align by common substructure

)

```

Virtual Screening Pipeline

```python

# 1. Generate fingerprints for query and library

query_fps = [dm.to_fp(mol) for mol in query_actives]

library_fps = [dm.to_fp(mol) for mol in library_mols]

# 2. Calculate similarities

from scipy.spatial.distance import cdist

import numpy as np

distances = dm.cdist(query_actives, library_mols, n_jobs=-1)

# 3. Find closest matches (min distance to any query)

min_distances = distances.min(axis=0)

similarities = 1 - min_distances # Convert distance to similarity

# 4. Rank and select top hits

top_indices = np.argsort(similarities)[::-1][:100] # Top 100

top_hits = [library_mols[i] for i in top_indices]

top_scores = [similarities[i] for i in top_indices]

# 5. Visualize hits

dm.viz.to_image(

top_hits[:20],

legends=[f"Sim: {score:.3f}" for score in top_scores[:20]],

outfile="screening_hits.png"

)

```

Reference Documentation

For detailed API documentation, consult these reference files:

  • references/core_api.md: Core namespace functions (conversions, standardization, fingerprints, clustering)
  • references/io_module.md: File I/O operations (read/write SDF, CSV, Excel, remote files)
  • references/conformers_module.md: 3D conformer generation, clustering, SASA calculations
  • references/descriptors_viz.md: Molecular descriptors and visualization functions
  • references/fragments_scaffolds.md: Scaffold extraction, BRICS/RECAP fragmentation
  • references/reactions_data.md: Chemical reactions and toy datasets

Best Practices

  1. Always standardize molecules from external sources:

```python

mol = dm.standardize_mol(mol, disconnect_metals=True, normalize=True, reionize=True)

```

  1. Check for None values after molecule parsing:

```python

mol = dm.to_mol(smiles)

if mol is None:

# Handle invalid SMILES

```

  1. Use parallel processing for large datasets:

```python

result = dm.operation(..., n_jobs=-1, progress=True)

```

  1. Leverage fsspec for cloud storage:

```python

df = dm.read_sdf("s3://bucket/compounds.sdf")

```

  1. Use appropriate fingerprints for similarity:

- ECFP (Morgan): General purpose, structural similarity

- MACCS: Fast, smaller feature space

- Atom pairs: Considers atom pairs and distances

  1. Consider scale limitations:

- Butina clustering: ~1,000 molecules (full distance matrix)

- For larger datasets: Use diversity selection or hierarchical methods

  1. Scaffold splitting for ML: Ensure proper train/test separation by scaffold
  1. Align molecules when visualizing SAR series

Error Handling

```python

# Safe molecule creation

def safe_to_mol(smiles):

try:

mol = dm.to_mol(smiles)

if mol is not None:

mol = dm.standardize_mol(mol)

return mol

except Exception as e:

print(f"Failed to process {smiles}: {e}")

return None

# Safe batch processing

valid_mols = []

for smiles in smiles_list:

mol = safe_to_mol(smiles)

if mol is not None:

valid_mols.append(mol)

```

Integration with Machine Learning

```python

# Feature generation

X = np.array([dm.to_fp(mol) for mol in mols])

# Or descriptors

desc_df = dm.descriptors.batch_compute_many_descriptors(mols, n_jobs=-1)

X = desc_df.values

# Train model

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()

model.fit(X, y_target)

# Predict

predictions = model.predict(X_test)

```

Troubleshooting

Issue: Molecule parsing fails

  • Solution: Use dm.standardize_smiles() first or try dm.fix_mol()

Issue: Memory errors with clustering

  • Solution: Use dm.pick_diverse() instead of full clustering for large sets

Issue: Slow conformer generation

  • Solution: Reduce n_confs or increase rms_cutoff to generate fewer conformers

Issue: Remote file access fails

  • Solution: Ensure fsspec and appropriate cloud provider libraries are installed (s3fs, gcsfs, etc.)

Additional Resources

  • Datamol Documentation: https://docs.datamol.io/
  • RDKit Documentation: https://www.rdkit.org/docs/
  • GitHub Repository: https://github.com/datamol-io/datamol