🎯

arboreto

🎯Skill

from ovachiever/droid-tings

VibeIndex|
What it does

Infers gene regulatory networks from transcriptomics data using scalable machine learning algorithms, identifying transcription factor-target gene relationships across large datasets.

πŸ“¦

Part of

ovachiever/droid-tings(370 items)

arboreto

Installation

PythonRun Python server
python scripts/basic_grn_inference.py expression_data.tsv output_network.tsv --tf-file tfs.txt --seed 777
πŸ“– Extracted from docs: ovachiever/droid-tings
16Installs
-
AddedFeb 4, 2026

Skill Details

SKILL.md

Infer gene regulatory networks (GRNs) from gene expression data using scalable algorithms (GRNBoost2, GENIE3). Use when analyzing transcriptomics data (bulk RNA-seq, single-cell RNA-seq) to identify transcription factor-target gene relationships and regulatory interactions. Supports distributed computation for large-scale datasets.

Overview

# Arboreto

Overview

Arboreto is a computational library for inferring gene regulatory networks (GRNs) from gene expression data using parallelized algorithms that scale from single machines to multi-node clusters.

Core capability: Identify which transcription factors (TFs) regulate which target genes based on expression patterns across observations (cells, samples, conditions).

Quick Start

Install arboreto:

```bash

uv pip install arboreto

```

Basic GRN inference:

```python

import pandas as pd

from arboreto.algo import grnboost2

if __name__ == '__main__':

# Load expression data (genes as columns)

expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')

# Infer regulatory network

network = grnboost2(expression_data=expression_matrix)

# Save results (TF, target, importance)

network.to_csv('network.tsv', sep='\t', index=False, header=False)

```

Critical: Always use if __name__ == '__main__': guard because Dask spawns new processes.

Core Capabilities

1. Basic GRN Inference

For standard GRN inference workflows including:

  • Input data preparation (Pandas DataFrame or NumPy array)
  • Running inference with GRNBoost2 or GENIE3
  • Filtering by transcription factors
  • Output format and interpretation

See: references/basic_inference.md

Use the ready-to-run script: scripts/basic_grn_inference.py for standard inference tasks:

```bash

python scripts/basic_grn_inference.py expression_data.tsv output_network.tsv --tf-file tfs.txt --seed 777

```

2. Algorithm Selection

Arboreto provides two algorithms:

GRNBoost2 (Recommended):

  • Fast gradient boosting-based inference
  • Optimized for large datasets (10k+ observations)
  • Default choice for most analyses

GENIE3:

  • Random Forest-based inference
  • Original multiple regression approach
  • Use for comparison or validation

Quick comparison:

```python

from arboreto.algo import grnboost2, genie3

# Fast, recommended

network_grnboost = grnboost2(expression_data=matrix)

# Classic algorithm

network_genie3 = genie3(expression_data=matrix)

```

For detailed algorithm comparison, parameters, and selection guidance: references/algorithms.md

3. Distributed Computing

Scale inference from local multi-core to cluster environments:

Local (default) - Uses all available cores automatically:

```python

network = grnboost2(expression_data=matrix)

```

Custom local client - Control resources:

```python

from distributed import LocalCluster, Client

local_cluster = LocalCluster(n_workers=10, memory_limit='8GB')

client = Client(local_cluster)

network = grnboost2(expression_data=matrix, client_or_address=client)

client.close()

local_cluster.close()

```

Cluster computing - Connect to remote Dask scheduler:

```python

from distributed import Client

client = Client('tcp://scheduler:8786')

network = grnboost2(expression_data=matrix, client_or_address=client)

```

For cluster setup, performance optimization, and large-scale workflows: references/distributed_computing.md

Installation

```bash

uv pip install arboreto

```

Dependencies: scipy, scikit-learn, numpy, pandas, dask, distributed

Common Use Cases

Single-Cell RNA-seq Analysis

```python

import pandas as pd

from arboreto.algo import grnboost2

if __name__ == '__main__':

# Load single-cell expression matrix (cells x genes)

sc_data = pd.read_csv('scrna_counts.tsv', sep='\t')

# Infer cell-type-specific regulatory network

network = grnboost2(expression_data=sc_data, seed=42)

# Filter high-confidence links

high_confidence = network[network['importance'] > 0.5]

high_confidence.to_csv('grn_high_confidence.tsv', sep='\t', index=False)

```

Bulk RNA-seq with TF Filtering

```python

from arboreto.utils import load_tf_names

from arboreto.algo import grnboost2

if __name__ == '__main__':

# Load data

expression_data = pd.read_csv('rnaseq_tpm.tsv', sep='\t')

tf_names = load_tf_names('human_tfs.txt')

# Infer with TF restriction

network = grnboost2(

expression_data=expression_data,

tf_names=tf_names,

seed=123

)

network.to_csv('tf_target_network.tsv', sep='\t', index=False)

```

Comparative Analysis (Multiple Conditions)

```python

from arboreto.algo import grnboost2

if __name__ == '__main__':

# Infer networks for different conditions

conditions = ['control', 'treatment_24h', 'treatment_48h']

for condition in conditions:

data = pd.read_csv(f'{condition}_expression.tsv', sep='\t')

network = grnboost2(expression_data=data, seed=42)

network.to_csv(f'{condition}_network.tsv', sep='\t', index=False)

```

Output Interpretation

Arboreto returns a DataFrame with regulatory links:

| Column | Description |

|--------|-------------|

| TF | Transcription factor (regulator) |

| target | Target gene |

| importance | Regulatory importance score (higher = stronger) |

Filtering strategy:

  • Top N links per target gene
  • Importance threshold (e.g., > 0.5)
  • Statistical significance testing (permutation tests)

Integration with pySCENIC

Arboreto is a core component of the SCENIC pipeline for single-cell regulatory network analysis:

```python

# Step 1: Use arboreto for GRN inference

from arboreto.algo import grnboost2

network = grnboost2(expression_data=sc_data, tf_names=tf_list)

# Step 2: Use pySCENIC for regulon identification and activity scoring

# (See pySCENIC documentation for downstream analysis)

```

Reproducibility

Always set a seed for reproducible results:

```python

network = grnboost2(expression_data=matrix, seed=777)

```

Run multiple seeds for robustness analysis:

```python

from distributed import LocalCluster, Client

if __name__ == '__main__':

client = Client(LocalCluster())

seeds = [42, 123, 777]

networks = []

for seed in seeds:

net = grnboost2(expression_data=matrix, client_or_address=client, seed=seed)

networks.append(net)

# Combine networks and filter consensus links

consensus = analyze_consensus(networks)

```

Troubleshooting

Memory errors: Reduce dataset size by filtering low-variance genes or use distributed computing

Slow performance: Use GRNBoost2 instead of GENIE3, enable distributed client, filter TF list

Dask errors: Ensure if __name__ == '__main__': guard is present in scripts

Empty results: Check data format (genes as columns), verify TF names match gene names