🎯

pubchem-database

🎯Skill

from ovachiever/droid-tings

VibeIndex|
What it does

Queries PubChem's chemical database to search, retrieve, and analyze molecular properties, structures, and bioactivity data across 110M+ compounds.

πŸ“¦

Part of

ovachiever/droid-tings(370 items)

pubchem-database

Installation

git cloneClone repository
git clone https://github.com/ovachiever/droid-tings.git
πŸ“– Extracted from docs: ovachiever/droid-tings
16Installs
-
AddedFeb 4, 2026

Skill Details

SKILL.md

"Query PubChem via PUG-REST API/PubChemPy (110M+ compounds). Search by name/CID/SMILES, retrieve properties, similarity/substructure searches, bioactivity, for cheminformatics."

Overview

# PubChem Database

Overview

PubChem is the world's largest freely available chemical database with 110M+ compounds and 270M+ bioactivities. Query chemical structures by name, CID, or SMILES, retrieve molecular properties, perform similarity and substructure searches, access bioactivity data using PUG-REST API and PubChemPy.

When to Use This Skill

This skill should be used when:

  • Searching for chemical compounds by name, structure (SMILES/InChI), or molecular formula
  • Retrieving molecular properties (MW, LogP, TPSA, hydrogen bonding descriptors)
  • Performing similarity searches to find structurally related compounds
  • Conducting substructure searches for specific chemical motifs
  • Accessing bioactivity data from screening assays
  • Converting between chemical identifier formats (CID, SMILES, InChI)
  • Batch processing multiple compounds for drug-likeness screening or property analysis

Core Capabilities

1. Chemical Structure Search

Search for compounds using multiple identifier types:

By Chemical Name:

```python

import pubchempy as pcp

compounds = pcp.get_compounds('aspirin', 'name')

compound = compounds[0]

```

By CID (Compound ID):

```python

compound = pcp.Compound.from_cid(2244) # Aspirin

```

By SMILES:

```python

compound = pcp.get_compounds('CC(=O)OC1=CC=CC=C1C(=O)O', 'smiles')[0]

```

By InChI:

```python

compound = pcp.get_compounds('InChI=1S/C9H8O4/...', 'inchi')[0]

```

By Molecular Formula:

```python

compounds = pcp.get_compounds('C9H8O4', 'formula')

# Returns all compounds matching this formula

```

2. Property Retrieval

Retrieve molecular properties for compounds using either high-level or low-level approaches:

Using PubChemPy (Recommended):

```python

import pubchempy as pcp

# Get compound object with all properties

compound = pcp.get_compounds('caffeine', 'name')[0]

# Access individual properties

molecular_formula = compound.molecular_formula

molecular_weight = compound.molecular_weight

iupac_name = compound.iupac_name

smiles = compound.canonical_smiles

inchi = compound.inchi

xlogp = compound.xlogp # Partition coefficient

tpsa = compound.tpsa # Topological polar surface area

```

Get Specific Properties:

```python

# Request only specific properties

properties = pcp.get_properties(

['MolecularFormula', 'MolecularWeight', 'CanonicalSMILES', 'XLogP'],

'aspirin',

'name'

)

# Returns list of dictionaries

```

Batch Property Retrieval:

```python

import pandas as pd

compound_names = ['aspirin', 'ibuprofen', 'paracetamol']

all_properties = []

for name in compound_names:

props = pcp.get_properties(

['MolecularFormula', 'MolecularWeight', 'XLogP'],

name,

'name'

)

all_properties.extend(props)

df = pd.DataFrame(all_properties)

```

Available Properties: MolecularFormula, MolecularWeight, CanonicalSMILES, IsomericSMILES, InChI, InChIKey, IUPACName, XLogP, TPSA, HBondDonorCount, HBondAcceptorCount, RotatableBondCount, Complexity, Charge, and many more (see references/api_reference.md for complete list).

3. Similarity Search

Find structurally similar compounds using Tanimoto similarity:

```python

import pubchempy as pcp

# Start with a query compound

query_compound = pcp.get_compounds('gefitinib', 'name')[0]

query_smiles = query_compound.canonical_smiles

# Perform similarity search

similar_compounds = pcp.get_compounds(

query_smiles,

'smiles',

searchtype='similarity',

Threshold=85, # Similarity threshold (0-100)

MaxRecords=50

)

# Process results

for compound in similar_compounds[:10]:

print(f"CID {compound.cid}: {compound.iupac_name}")

print(f" MW: {compound.molecular_weight}")

```

Note: Similarity searches are asynchronous for large queries and may take 15-30 seconds to complete. PubChemPy handles the asynchronous pattern automatically.

4. Substructure Search

Find compounds containing a specific structural motif:

```python

import pubchempy as pcp

# Search for compounds containing pyridine ring

pyridine_smiles = 'c1ccncc1'

matches = pcp.get_compounds(

pyridine_smiles,

'smiles',

searchtype='substructure',

MaxRecords=100

)

print(f"Found {len(matches)} compounds containing pyridine")

```

Common Substructures:

  • Benzene ring: c1ccccc1
  • Pyridine: c1ccncc1
  • Phenol: c1ccc(O)cc1
  • Carboxylic acid: C(=O)O

5. Format Conversion

Convert between different chemical structure formats:

```python

import pubchempy as pcp

compound = pcp.get_compounds('aspirin', 'name')[0]

# Convert to different formats

smiles = compound.canonical_smiles

inchi = compound.inchi

inchikey = compound.inchikey

cid = compound.cid

# Download structure files

pcp.download('SDF', 'aspirin', 'name', 'aspirin.sdf', overwrite=True)

pcp.download('JSON', '2244', 'cid', 'aspirin.json', overwrite=True)

```

6. Structure Visualization

Generate 2D structure images:

```python

import pubchempy as pcp

# Download compound structure as PNG

pcp.download('PNG', 'caffeine', 'name', 'caffeine.png', overwrite=True)

# Using direct URL (via requests)

import requests

cid = 2244 # Aspirin

url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/PNG?image_size=large"

response = requests.get(url)

with open('structure.png', 'wb') as f:

f.write(response.content)

```

7. Synonym Retrieval

Get all known names and synonyms for a compound:

```python

import pubchempy as pcp

synonyms_data = pcp.get_synonyms('aspirin', 'name')

if synonyms_data:

cid = synonyms_data[0]['CID']

synonyms = synonyms_data[0]['Synonym']

print(f"CID {cid} has {len(synonyms)} synonyms:")

for syn in synonyms[:10]: # First 10

print(f" - {syn}")

```

8. Bioactivity Data Access

Retrieve biological activity data from assays:

```python

import requests

import json

# Get bioassay summary for a compound

cid = 2244 # Aspirin

url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON"

response = requests.get(url)

if response.status_code == 200:

data = response.json()

# Process bioassay information

table = data.get('Table', {})

rows = table.get('Row', [])

print(f"Found {len(rows)} bioassay records")

```

For more complex bioactivity queries, use the scripts/bioactivity_query.py helper script which provides:

  • Bioassay summaries with activity outcome filtering
  • Assay target identification
  • Search for compounds by biological target
  • Active compound lists for specific assays

9. Comprehensive Compound Annotations

Access detailed compound information through PUG-View:

```python

import requests

cid = 2244

url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON"

response = requests.get(url)

if response.status_code == 200:

annotations = response.json()

# Contains extensive data including:

# - Chemical and Physical Properties

# - Drug and Medication Information

# - Pharmacology and Biochemistry

# - Safety and Hazards

# - Toxicity

# - Literature references

# - Patents

```

Get Specific Section:

```python

# Get only drug information

url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON?heading=Drug and Medication Information"

```

Installation Requirements

Install PubChemPy for Python-based access:

```bash

uv pip install pubchempy

```

For direct API access and bioactivity queries:

```bash

uv pip install requests

```

Optional for data analysis:

```bash

uv pip install pandas

```

Helper Scripts

This skill includes Python scripts for common PubChem tasks:

scripts/compound_search.py

Provides utility functions for searching and retrieving compound information:

Key Functions:

  • search_by_name(name, max_results=10): Search compounds by name
  • search_by_smiles(smiles): Search by SMILES string
  • get_compound_by_cid(cid): Retrieve compound by CID
  • get_compound_properties(identifier, namespace, properties): Get specific properties
  • similarity_search(smiles, threshold, max_records): Perform similarity search
  • substructure_search(smiles, max_records): Perform substructure search
  • get_synonyms(identifier, namespace): Get all synonyms
  • batch_search(identifiers, namespace, properties): Batch search multiple compounds
  • download_structure(identifier, namespace, format, filename): Download structures
  • print_compound_info(compound): Print formatted compound information

Usage:

```python

from scripts.compound_search import search_by_name, get_compound_properties

# Search for a compound

compounds = search_by_name('ibuprofen')

# Get specific properties

props = get_compound_properties('aspirin', 'name', ['MolecularWeight', 'XLogP'])

```

scripts/bioactivity_query.py

Provides functions for retrieving biological activity data:

Key Functions:

  • get_bioassay_summary(cid): Get bioassay summary for compound
  • get_compound_bioactivities(cid, activity_outcome): Get filtered bioactivities
  • get_assay_description(aid): Get detailed assay information
  • get_assay_targets(aid): Get biological targets for assay
  • search_assays_by_target(target_name, max_results): Find assays by target
  • get_active_compounds_in_assay(aid, max_results): Get active compounds
  • get_compound_annotations(cid, section): Get PUG-View annotations
  • summarize_bioactivities(cid): Generate bioactivity summary statistics
  • find_compounds_by_bioactivity(target, threshold, max_compounds): Find compounds by target

Usage:

```python

from scripts.bioactivity_query import get_bioassay_summary, summarize_bioactivities

# Get bioactivity summary

summary = summarize_bioactivities(2244) # Aspirin

print(f"Total assays: {summary['total_assays']}")

print(f"Active: {summary['active']}, Inactive: {summary['inactive']}")

```

API Rate Limits and Best Practices

Rate Limits:

  • Maximum 5 requests per second
  • Maximum 400 requests per minute
  • Maximum 300 seconds running time per minute

Best Practices:

  1. Use CIDs for repeated queries: CIDs are more efficient than names or structures
  2. Cache results locally: Store frequently accessed data
  3. Batch requests: Combine multiple queries when possible
  4. Implement delays: Add 0.2-0.3 second delays between requests
  5. Handle errors gracefully: Check for HTTP errors and missing data
  6. Use PubChemPy: Higher-level abstraction handles many edge cases
  7. Leverage asynchronous pattern: For large similarity/substructure searches
  8. Specify MaxRecords: Limit results to avoid timeouts

Error Handling:

```python

from pubchempy import BadRequestError, NotFoundError, TimeoutError

try:

compound = pcp.get_compounds('query', 'name')[0]

except NotFoundError:

print("Compound not found")

except BadRequestError:

print("Invalid request format")

except TimeoutError:

print("Request timed out - try reducing scope")

except IndexError:

print("No results returned")

```

Common Workflows

Workflow 1: Chemical Identifier Conversion Pipeline

Convert between different chemical identifiers:

```python

import pubchempy as pcp

# Start with any identifier type

compound = pcp.get_compounds('caffeine', 'name')[0]

# Extract all identifier formats

identifiers = {

'CID': compound.cid,

'Name': compound.iupac_name,

'SMILES': compound.canonical_smiles,

'InChI': compound.inchi,

'InChIKey': compound.inchikey,

'Formula': compound.molecular_formula

}

```

Workflow 2: Drug-Like Property Screening

Screen compounds using Lipinski's Rule of Five:

```python

import pubchempy as pcp

def check_drug_likeness(compound_name):

compound = pcp.get_compounds(compound_name, 'name')[0]

# Lipinski's Rule of Five

rules = {

'MW <= 500': compound.molecular_weight <= 500,

'LogP <= 5': compound.xlogp <= 5 if compound.xlogp else None,

'HBD <= 5': compound.h_bond_donor_count <= 5,

'HBA <= 10': compound.h_bond_acceptor_count <= 10

}

violations = sum(1 for v in rules.values() if v is False)

return rules, violations

rules, violations = check_drug_likeness('aspirin')

print(f"Lipinski violations: {violations}")

```

Workflow 3: Finding Similar Drug Candidates

Identify structurally similar compounds to a known drug:

```python

import pubchempy as pcp

# Start with known drug

reference_drug = pcp.get_compounds('imatinib', 'name')[0]

reference_smiles = reference_drug.canonical_smiles

# Find similar compounds

similar = pcp.get_compounds(

reference_smiles,

'smiles',

searchtype='similarity',

Threshold=85,

MaxRecords=20

)

# Filter by drug-like properties

candidates = []

for comp in similar:

if comp.molecular_weight and 200 <= comp.molecular_weight <= 600:

if comp.xlogp and -1 <= comp.xlogp <= 5:

candidates.append(comp)

print(f"Found {len(candidates)} drug-like candidates")

```

Workflow 4: Batch Compound Property Comparison

Compare properties across multiple compounds:

```python

import pubchempy as pcp

import pandas as pd

compound_list = ['aspirin', 'ibuprofen', 'naproxen', 'celecoxib']

properties_list = []

for name in compound_list:

try:

compound = pcp.get_compounds(name, 'name')[0]

properties_list.append({

'Name': name,

'CID': compound.cid,

'Formula': compound.molecular_formula,

'MW': compound.molecular_weight,

'LogP': compound.xlogp,

'TPSA': compound.tpsa,

'HBD': compound.h_bond_donor_count,

'HBA': compound.h_bond_acceptor_count

})

except Exception as e:

print(f"Error processing {name}: {e}")

df = pd.DataFrame(properties_list)

print(df.to_string(index=False))

```

Workflow 5: Substructure-Based Virtual Screening

Screen for compounds containing specific pharmacophores:

```python

import pubchempy as pcp

# Define pharmacophore (e.g., sulfonamide group)

pharmacophore_smiles = 'S(=O)(=O)N'

# Search for compounds containing this substructure

hits = pcp.get_compounds(

pharmacophore_smiles,

'smiles',

searchtype='substructure',

MaxRecords=100

)

# Further filter by properties

filtered_hits = [

comp for comp in hits

if comp.molecular_weight and comp.molecular_weight < 500

]

print(f"Found {len(filtered_hits)} compounds with desired substructure")

```

Reference Documentation

For detailed API documentation, including complete property lists, URL patterns, advanced query options, and more examples, consult references/api_reference.md. This comprehensive reference includes:

  • Complete PUG-REST API endpoint documentation
  • Full list of available molecular properties
  • Asynchronous request handling patterns
  • PubChemPy API reference
  • PUG-View API for annotations
  • Common workflows and use cases
  • Links to official PubChem documentation

Troubleshooting

Compound Not Found:

  • Try alternative names or synonyms
  • Use CID if known
  • Check spelling and chemical name format

Timeout Errors:

  • Reduce MaxRecords parameter
  • Add delays between requests
  • Use CIDs instead of names for faster queries

Empty Property Values:

  • Not all properties are available for all compounds
  • Check if property exists before accessing: if compound.xlogp:
  • Some properties only available for certain compound types

Rate Limit Exceeded:

  • Implement delays (0.2-0.3 seconds) between requests
  • Use batch operations where possible
  • Consider caching results locally

Similarity/Substructure Search Hangs:

  • These are asynchronous operations that may take 15-30 seconds
  • PubChemPy handles polling automatically
  • Reduce MaxRecords if timing out

Additional Resources

  • PubChem Home: https://pubchem.ncbi.nlm.nih.gov/
  • PUG-REST Documentation: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
  • PUG-REST Tutorial: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest-tutorial
  • PubChemPy Documentation: https://pubchempy.readthedocs.io/
  • PubChemPy GitHub: https://github.com/mcs07/PubChemPy