Gtars is organized into specialized modules, each focused on specific genomic analysis tasks:
1. Overlap Detection and IGD Indexing
Efficiently detect overlaps between genomic intervals using the Integrated Genome Database (IGD) data structure.
When to use:
- Finding overlapping regulatory elements
- Variant annotation
- Comparing ChIP-seq peaks
- Identifying shared genomic features
Quick example:
```python
import gtars
# Build IGD index and query overlaps
igd = gtars.igd.build_index("regions.bed")
overlaps = igd.query("chr1", 1000, 2000)
```
See references/overlap.md for comprehensive overlap detection documentation.
2. Coverage Track Generation
Generate coverage tracks from sequencing data with the uniwig module.
When to use:
- ATAC-seq accessibility profiles
- ChIP-seq coverage visualization
- RNA-seq read coverage
- Differential coverage analysis
Quick example:
```bash
# Generate BigWig coverage track
gtars uniwig generate --input fragments.bed --output coverage.bw --format bigwig
```
See references/coverage.md for detailed coverage analysis workflows.
3. Genomic Tokenization
Convert genomic regions into discrete tokens for machine learning applications, particularly for deep learning models on genomic data.
When to use:
- Preprocessing for genomic ML models
- Integration with geniml library
- Creating position encodings
- Training transformer models on genomic sequences
Quick example:
```python
from gtars.tokenizers import TreeTokenizer
tokenizer = TreeTokenizer.from_bed_file("training_regions.bed")
token = tokenizer.tokenize("chr1", 1000, 2000)
```
See references/tokenizers.md for tokenization documentation.
4. Reference Sequence Management
Handle reference genome sequences and compute digests following the GA4GH refget protocol.
When to use:
- Validating reference genome integrity
- Extracting specific genomic sequences
- Computing sequence digests
- Cross-reference comparisons
Quick example:
```python
# Load reference and extract sequences
store = gtars.RefgetStore.from_fasta("hg38.fa")
sequence = store.get_subsequence("chr1", 1000, 2000)
```
See references/refget.md for reference sequence operations.
5. Fragment Processing
Split and analyze fragment files, particularly useful for single-cell genomics data.
When to use:
- Processing single-cell ATAC-seq data
- Splitting fragments by cell barcodes
- Cluster-based fragment analysis
- Fragment quality control
Quick example:
```bash
# Split fragments by clusters
gtars fragsplit cluster-split --input fragments.tsv --clusters clusters.txt --output-dir ./by_cluster/
```
See references/cli.md for fragment processing commands.
6. Fragment Scoring
Score fragment overlaps against reference datasets.
When to use:
- Evaluating fragment enrichment
- Comparing experimental data to references
- Quality metrics computation
- Batch scoring across samples
Quick example:
```bash
# Score fragments against reference
gtars scoring score --fragments fragments.bed --reference reference.bed --output scores.txt
```