🎯

pytdc

🎯Skill

from ovachiever/droid-tings

VibeIndex|
What it does

Provides AI-ready drug discovery datasets, benchmarks, and molecular oracles for therapeutic machine learning and pharmacological prediction tasks.

πŸ“¦

Part of

ovachiever/droid-tings(370 items)

pytdc

Installation

git cloneClone repository
git clone https://github.com/ovachiever/droid-tings.git
πŸ“– Extracted from docs: ovachiever/droid-tings
16Installs
20
-
AddedFeb 4, 2026

Skill Details

SKILL.md

"Therapeutics Data Commons. AI-ready drug discovery datasets (ADME, toxicity, DTI), benchmarks, scaffold splits, molecular oracles, for therapeutic ML and pharmacological prediction."

Overview

# PyTDC (Therapeutics Data Commons)

Overview

PyTDC is an open-science platform providing AI-ready datasets and benchmarks for drug discovery and development. Access curated datasets spanning the entire therapeutics pipeline with standardized evaluation metrics and meaningful data splits, organized into three categories: single-instance prediction (molecular/protein properties), multi-instance prediction (drug-target interactions, DDI), and generation (molecule generation, retrosynthesis).

When to Use This Skill

This skill should be used when:

  • Working with drug discovery or therapeutic ML datasets
  • Benchmarking machine learning models on standardized pharmaceutical tasks
  • Predicting molecular properties (ADME, toxicity, bioactivity)
  • Predicting drug-target or drug-drug interactions
  • Generating novel molecules with desired properties
  • Accessing curated datasets with proper train/test splits (scaffold, cold-split)
  • Using molecular oracles for property optimization

Installation & Setup

Install PyTDC using pip:

```bash

uv pip install PyTDC

```

To upgrade to the latest version:

```bash

uv pip install PyTDC --upgrade

```

Core dependencies (automatically installed):

  • numpy, pandas, tqdm, seaborn, scikit_learn, fuzzywuzzy

Additional packages are installed automatically as needed for specific features.

Quick Start

The basic pattern for accessing any TDC dataset follows this structure:

```python

from tdc. import

data = (name='')

split = data.get_split(method='scaffold', seed=1, frac=[0.7, 0.1, 0.2])

df = data.get_data(format='df')

```

Where:

  • : One of single_pred, multi_pred, or generation
  • : Specific task category (e.g., ADME, DTI, MolGen)
  • : Dataset name within that task

Example - Loading ADME data:

```python

from tdc.single_pred import ADME

data = ADME(name='Caco2_Wang')

split = data.get_split(method='scaffold')

# Returns dict with 'train', 'valid', 'test' DataFrames

```

Single-Instance Prediction Tasks

Single-instance prediction involves forecasting properties of individual biomedical entities (molecules, proteins, etc.).

Available Task Categories

#### 1. ADME (Absorption, Distribution, Metabolism, Excretion)

Predict pharmacokinetic properties of drug molecules.

```python

from tdc.single_pred import ADME

data = ADME(name='Caco2_Wang') # Intestinal permeability

# Other datasets: HIA_Hou, Bioavailability_Ma, Lipophilicity_AstraZeneca, etc.

```

Common ADME datasets:

  • Caco2 - Intestinal permeability
  • HIA - Human intestinal absorption
  • Bioavailability - Oral bioavailability
  • Lipophilicity - Octanol-water partition coefficient
  • Solubility - Aqueous solubility
  • BBB - Blood-brain barrier penetration
  • CYP - Cytochrome P450 metabolism

#### 2. Toxicity (Tox)

Predict toxicity and adverse effects of compounds.

```python

from tdc.single_pred import Tox

data = Tox(name='hERG') # Cardiotoxicity

# Other datasets: AMES, DILI, Carcinogens_Lagunin, etc.

```

Common toxicity datasets:

  • hERG - Cardiac toxicity
  • AMES - Mutagenicity
  • DILI - Drug-induced liver injury
  • Carcinogens - Carcinogenicity
  • ClinTox - Clinical trial toxicity

#### 3. HTS (High-Throughput Screening)

Bioactivity predictions from screening data.

```python

from tdc.single_pred import HTS

data = HTS(name='SARSCoV2_Vitro_Touret')

```

#### 4. QM (Quantum Mechanics)

Quantum mechanical properties of molecules.

```python

from tdc.single_pred import QM

data = QM(name='QM7')

```

#### 5. Other Single Prediction Tasks

  • Yields: Chemical reaction yield prediction
  • Epitope: Epitope prediction for biologics
  • Develop: Development-stage predictions
  • CRISPROutcome: Gene editing outcome prediction

Data Format

Single prediction datasets typically return DataFrames with columns:

  • Drug_ID or Compound_ID: Unique identifier
  • Drug or X: SMILES string or molecular representation
  • Y: Target label (continuous or binary)

Multi-Instance Prediction Tasks

Multi-instance prediction involves forecasting properties of interactions between multiple biomedical entities.

Available Task Categories

#### 1. DTI (Drug-Target Interaction)

Predict binding affinity between drugs and protein targets.

```python

from tdc.multi_pred import DTI

data = DTI(name='BindingDB_Kd')

split = data.get_split()

```

Available datasets:

  • BindingDB_Kd - Dissociation constant (52,284 pairs)
  • BindingDB_IC50 - Half-maximal inhibitory concentration (991,486 pairs)
  • BindingDB_Ki - Inhibition constant (375,032 pairs)
  • DAVIS, KIBA - Kinase binding datasets

Data format: Drug_ID, Target_ID, Drug (SMILES), Target (sequence), Y (binding affinity)

#### 2. DDI (Drug-Drug Interaction)

Predict interactions between drug pairs.

```python

from tdc.multi_pred import DDI

data = DDI(name='DrugBank')

split = data.get_split()

```

Multi-class classification task predicting interaction types. Dataset contains 191,808 DDI pairs with 1,706 drugs.

#### 3. PPI (Protein-Protein Interaction)

Predict protein-protein interactions.

```python

from tdc.multi_pred import PPI

data = PPI(name='HuRI')

```

#### 4. Other Multi-Prediction Tasks

  • GDA: Gene-disease associations
  • DrugRes: Drug resistance prediction
  • DrugSyn: Drug synergy prediction
  • PeptideMHC: Peptide-MHC binding
  • AntibodyAff: Antibody affinity prediction
  • MTI: miRNA-target interactions
  • Catalyst: Catalyst prediction
  • TrialOutcome: Clinical trial outcome prediction

Generation Tasks

Generation tasks involve creating novel biomedical entities with desired properties.

1. Molecular Generation (MolGen)

Generate diverse, novel molecules with desirable chemical properties.

```python

from tdc.generation import MolGen

data = MolGen(name='ChEMBL_V29')

split = data.get_split()

```

Use with oracles to optimize for specific properties:

```python

from tdc import Oracle

oracle = Oracle(name='GSK3B')

score = oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O') # Evaluate SMILES

```

See references/oracles.md for all available oracle functions.

2. Retrosynthesis (RetroSyn)

Predict reactants needed to synthesize a target molecule.

```python

from tdc.generation import RetroSyn

data = RetroSyn(name='USPTO')

split = data.get_split()

```

Dataset contains 1,939,253 reactions from USPTO database.

3. Paired Molecule Generation

Generate molecule pairs (e.g., prodrug-drug pairs).

```python

from tdc.generation import PairMolGen

data = PairMolGen(name='Prodrug')

```

For detailed oracle documentation and molecular generation workflows, refer to references/oracles.md and scripts/molecular_generation.py.

Benchmark Groups

Benchmark groups provide curated collections of related datasets for systematic model evaluation.

ADMET Benchmark Group

```python

from tdc.benchmark_group import admet_group

group = admet_group(path='data/')

# Get benchmark datasets

benchmark = group.get('Caco2_Wang')

predictions = {}

for seed in [1, 2, 3, 4, 5]:

train, valid = benchmark['train'], benchmark['valid']

# Train model here

predictions[seed] = model.predict(benchmark['test'])

# Evaluate with required 5 seeds

results = group.evaluate(predictions)

```

ADMET Group includes 22 datasets covering absorption, distribution, metabolism, excretion, and toxicity.

Other Benchmark Groups

Available benchmark groups include collections for:

  • ADMET properties
  • Drug-target interactions
  • Drug combination prediction
  • And more specialized therapeutic tasks

For benchmark evaluation workflows, see scripts/benchmark_evaluation.py.

Data Functions

TDC provides comprehensive data processing utilities organized into four categories.

1. Dataset Splits

Retrieve train/validation/test partitions with various strategies:

```python

# Scaffold split (default for most tasks)

split = data.get_split(method='scaffold', seed=1, frac=[0.7, 0.1, 0.2])

# Random split

split = data.get_split(method='random', seed=42, frac=[0.8, 0.1, 0.1])

# Cold split (for DTI/DDI tasks)

split = data.get_split(method='cold_drug', seed=1) # Unseen drugs in test

split = data.get_split(method='cold_target', seed=1) # Unseen targets in test

```

Available split strategies:

  • random: Random shuffling
  • scaffold: Scaffold-based (for chemical diversity)
  • cold_drug, cold_target, cold_drug_target: For DTI tasks
  • temporal: Time-based splits for temporal datasets

2. Model Evaluation

Use standardized metrics for evaluation:

```python

from tdc import Evaluator

# For binary classification

evaluator = Evaluator(name='ROC-AUC')

score = evaluator(y_true, y_pred)

# For regression

evaluator = Evaluator(name='RMSE')

score = evaluator(y_true, y_pred)

```

Available metrics: ROC-AUC, PR-AUC, F1, Accuracy, RMSE, MAE, R2, Spearman, Pearson, and more.

3. Data Processing

TDC provides 11 key processing utilities:

```python

from tdc.chem_utils import MolConvert

# Molecule format conversion

converter = MolConvert(src='SMILES', dst='PyG')

pyg_graph = converter('CC(C)Cc1ccc(cc1)C(C)C(O)=O')

```

Processing utilities include:

  • Molecule format conversion (SMILES, SELFIES, PyG, DGL, ECFP, etc.)
  • Molecule filters (PAINS, drug-likeness)
  • Label binarization and unit conversion
  • Data balancing (over/under-sampling)
  • Negative sampling for pair data
  • Graph transformation
  • Entity retrieval (CID to SMILES, UniProt to sequence)

For comprehensive utilities documentation, see references/utilities.md.

4. Molecule Generation Oracles

TDC provides 17+ oracle functions for molecular optimization:

```python

from tdc import Oracle

# Single oracle

oracle = Oracle(name='DRD2')

score = oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O')

# Multiple oracles

oracle = Oracle(name='JNK3')

scores = oracle(['SMILES1', 'SMILES2', 'SMILES3'])

```

For complete oracle documentation, see references/oracles.md.

Advanced Features

Retrieve Available Datasets

```python

from tdc.utils import retrieve_dataset_names

# Get all ADME datasets

adme_datasets = retrieve_dataset_names('ADME')

# Get all DTI datasets

dti_datasets = retrieve_dataset_names('DTI')

```

Label Transformations

```python

# Get label mapping

label_map = data.get_label_map(name='DrugBank')

# Convert labels

from tdc.chem_utils import label_transform

transformed = label_transform(y, from_unit='nM', to_unit='p')

```

Database Queries

```python

from tdc.utils import cid2smiles, uniprot2seq

# Convert PubChem CID to SMILES

smiles = cid2smiles(2244)

# Convert UniProt ID to amino acid sequence

sequence = uniprot2seq('P12345')

```

Common Workflows

Workflow 1: Train a Single Prediction Model

See scripts/load_and_split_data.py for a complete example:

```python

from tdc.single_pred import ADME

from tdc import Evaluator

# Load data

data = ADME(name='Caco2_Wang')

split = data.get_split(method='scaffold', seed=42)

train, valid, test = split['train'], split['valid'], split['test']

# Train model (user implements)

# model.fit(train['Drug'], train['Y'])

# Evaluate

evaluator = Evaluator(name='MAE')

# score = evaluator(test['Y'], predictions)

```

Workflow 2: Benchmark Evaluation

See scripts/benchmark_evaluation.py for a complete example with multiple seeds and proper evaluation protocol.

Workflow 3: Molecular Generation with Oracles

See scripts/molecular_generation.py for an example of goal-directed generation using oracle functions.

Resources

This skill includes bundled resources for common TDC workflows:

scripts/

  • load_and_split_data.py: Template for loading and splitting TDC datasets with various strategies
  • benchmark_evaluation.py: Template for running benchmark group evaluations with proper 5-seed protocol
  • molecular_generation.py: Template for molecular generation using oracle functions

references/

  • datasets.md: Comprehensive catalog of all available datasets organized by task type
  • oracles.md: Complete documentation of all 17+ molecule generation oracles
  • utilities.md: Detailed guide to data processing, splitting, and evaluation utilities

Additional Resources

  • Official Website: https://tdcommons.ai
  • Documentation: https://tdc.readthedocs.io
  • GitHub: https://github.com/mims-harvard/TDC
  • Paper: NeurIPS 2021 - "Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development"