🎯

tooluniverse-sequence-retrieval

🎯Skill

from mims-harvard/tooluniverse

What it does

Retrieves biological sequences from NCBI and ENA with precise gene disambiguation, accession handling, and comprehensive sequence metadata.

📦

Part of

mims-harvard/tooluniverse(19 items)

tooluniverse-sequence-retrieval

Installation

Quick InstallInstall with npx

npx skills add mims-harvard/ToolUniverse

pip installInstall Python package

pip install tooluniverse

Claude CLIAdd MCP server via Claude CLI

claude mcp add --transport stdio tooluniverse -- tooluniverse-smcp-stdio --compact-mode

pip installInstall Python package

pip install tooluniverse[client] # Minimal installation

Server ConfigurationMCP server configuration block

{
  "mcpServers": {
    "tooluniverse": {
      "command": "uvx",
      "args": ...

📖 Extracted from docs: mims-harvard/tooluniverse

Need more details? View full documentation on GitHub →

8Installs

AddedFeb 4, 2026

View on GitHub Back to Skills

Skill Details

SKILL.md

Retrieves biological sequences (DNA, RNA, protein) from NCBI and ENA with gene disambiguation, accession type handling, and comprehensive sequence profiles. Creates detailed reports with sequence metadata, cross-database references, and download options. Use when users need nucleotide sequences, protein sequences, genome data, or mention GenBank, RefSeq, EMBL accessions.

Overview

# Biological Sequence Retrieval

Retrieve DNA, RNA, and protein sequences with proper disambiguation and cross-database handling.

Workflow Overview

```

Phase 0: Clarify (if needed)

↓

Phase 1: Disambiguate Gene/Organism

↓

Phase 2: Search & Retrieve (Internal)

↓

Phase 3: Report Sequence Profile

```

---

Phase 0: Clarification (When Needed)

Ask the user ONLY if:

Gene name exists in multiple organisms (e.g., "BRCA1" → human or mouse?)
Sequence type unclear (mRNA, genomic, protein?)
Strain/isolate matters (e.g., E. coli → K-12, O157:H7, etc.)

Skip clarification for:

Specific accession numbers (NC_, NM_, U*, etc.)
Clear organism + gene combinations
Complete genome requests with organism specified

---

Phase 1: Gene/Organism Disambiguation

1.1 Resolve Identifiers

```python

from tooluniverse import ToolUniverse

tu = ToolUniverse()

tu.load_tools()

# Strategy depends on input type

if user_provided_accession:

# Direct retrieval based on accession type

accession = user_provided_accession

elif user_provided_gene_and_organism:

# Search NCBI Nucleotide

result = tu.tools.NCBI_search_nucleotide(

operation="search",

organism=organism,

gene=gene,

limit=10

)

```

1.2 Accession Type Decision Tree

CRITICAL: Accession prefix determines which tools to use.

| Prefix | Type | Use With |

|--------|------|----------|

| NC_* | RefSeq chromosome | NCBI only |

| NM_* | RefSeq mRNA | NCBI only |

| NR_* | RefSeq ncRNA | NCBI only |

| NP_* | RefSeq protein | NCBI only |

| XM_* | RefSeq predicted mRNA | NCBI only |

| U, M, K, X | GenBank | NCBI or ENA |

| CP, NZ_ | GenBank genome | NCBI or ENA |

| EMBL format | EMBL | ENA preferred |

1.3 Identity Resolution Checklist

[ ] Organism confirmed (scientific name)
[ ] Gene symbol/name identified
[ ] Sequence type determined (genomic/mRNA/protein)
[ ] Strain specified (if relevant)
[ ] Accession prefix identified → tool selection

---

Phase 2: Data Retrieval (Internal)

Retrieve silently. Do NOT narrate the search process.

2.1 Search for Sequences

```python

# Search NCBI Nucleotide

result = tu.tools.NCBI_search_nucleotide(

operation="search",

organism=organism,

gene=gene,

strain=strain, # Optional

keywords=keywords, # Optional

seq_type=seq_type, # complete_genome, mrna, refseq

limit=10

)

# Get accession numbers from UIDs

accessions = tu.tools.NCBI_fetch_accessions(

operation="fetch_accession",

uids=result["data"]["uids"]

)

```

2.2 Retrieve Sequence Data

```python

# Get sequence in desired format

sequence = tu.tools.NCBI_get_sequence(

operation="fetch_sequence",

accession=accession,

format="fasta" # or "genbank"

)

# GenBank format for annotations

annotations = tu.tools.NCBI_get_sequence(

operation="fetch_sequence",

accession=accession,

format="genbank"

)

```

2.3 ENA Alternative (for GenBank/EMBL accessions)

```python

# Only for non-RefSeq accessions!

if not accession.startswith(("NC_", "NM_", "NR_", "NP_", "XM_", "XR_")):

# ENA entry info

entry = tu.tools.ena_get_entry(accession=accession)

# ENA FASTA

fasta = tu.tools.ena_get_sequence_fasta(accession=accession)

# ENA summary

summary = tu.tools.ena_get_entry_summary(accession=accession)

```

Fallback Chains

| Primary | Fallback | Notes |

|---------|----------|-------|

| NCBI_get_sequence | ENA (if GenBank format) | NCBI unavailable |

| ENA_get_entry | NCBI_get_sequence | ENA doesn't have RefSeq |

| NCBI_search_nucleotide | Try broader keywords | No results |

Critical Rule: Never try ENA tools with RefSeq accessions (NC_, NM_, etc.) - they will return 404 errors.

---

Phase 3: Report Sequence Profile

Output Structure

Present as a Sequence Profile Report. Hide search process.

```markdown

# Sequence Profile: [Gene/Organism]

Search Summary

Query: [gene] in [organism]
Database: NCBI Nucleotide
Results: [N] sequences found

---

Primary Sequence

[Accession]: [Definition/Title]

| Attribute | Value |

|-----------|-------|

| Accession | [accession] |

| Type | RefSeq / GenBank |

| Organism | [scientific name] |

| Strain | [strain if applicable] |

| Length | [X,XXX bp / aa] |

| Molecule | DNA / mRNA / Protein |

| Topology | Linear / Circular |

Curation Level: ●●● RefSeq (curated) / ●●○ GenBank (submitted) / ●○○ Third-party

Sequence Statistics

| Statistic | Value |

|-----------|-------|

| Length | [X,XXX] bp |

| GC Content | [XX.X]% |

| Genes | [N] (if genome) |

| CDS | [N] (if annotated) |

Sequence Preview

```fasta

>[accession] [definition]

ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG

ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA

... [truncated, full sequence in download]

```

Annotations Summary (from GenBank format)

| Feature | Count | Examples |

|---------|-------|----------|

| CDS | [N] | [gene names] |

| tRNA | [N] | - |

| rRNA | [N] | 16S, 23S |

| Regulatory | [N] | promoters |

---

Alternative Sequences

Ranked by relevance and curation level:

|-----------|------|--------|-------------|----------------|

| NC_000913.3 | RefSeq | 4.6 Mb | E. coli K-12 reference | ✗ |

| U00096.3 | GenBank | 4.6 Mb | E. coli K-12 | ✓ |

| CP001509.3 | GenBank | 4.6 Mb | E. coli DH10B | ✓ |

---

Cross-Database References

| Database | Accession | Link |

|----------|-----------|------|

| RefSeq | [NC_*] | [NCBI link] |

| GenBank | [U*] | [NCBI link] |

| ENA/EMBL | [same as GenBank] | [ENA link] |

| BioProject | [PRJNA*] | [link] |

| BioSample | [SAMN*] | [link] |

---

Download Options

Formats Available

| Format | Description | Use Case |

|--------|-------------|----------|

| FASTA | Sequence only | BLAST, alignment |

| GenBank | Sequence + annotations | Gene analysis |

| GFF3 | Annotations only | Genome browsers |

Direct Commands

```python

# FASTA format

tu.tools.NCBI_get_sequence(

operation="fetch_sequence",

accession="[accession]",

format="fasta"

)

# GenBank format (with annotations)

tu.tools.NCBI_get_sequence(

operation="fetch_sequence",

accession="[accession]",

format="genbank"

)

```

---

Related Sequences

Other Strains/Isolates

|-----------|--------|------------|-------|

| [acc1] | [strain1] | 99.9% | [notes] |

| [acc2] | [strain2] | 99.5% | [notes] |

Protein Products (if applicable)

| Protein Accession | Product Name | Length |

|-------------------|--------------|--------|

| [NP_*] | [protein name] | [X] aa |

---

Retrieved: [date]

Database: NCBI Nucleotide

```

---

Curation Level Tiers (Aligned with Evidence Grading)

Sequence Curation Levels

|------|--------|------------------|-------------|---------------------|

| Third Party | ○○○○ | TPA_ | Third-party annotation | ★☆☆ |

Data Reliability Mapping

| Data Type | Reliability | Notes |

|-----------|-------------|-------|

| RefSeq curated sequence | ★★★ | Gold standard for reference |

| RefSeq annotations | ★★★ | Validated gene models |

| GenBank sequence | ★★☆ | Submitted, generally reliable |

| GenBank annotations | ★☆☆ | Submitter-provided, verify |

| Predicted genes (XM_) | ★★☆ | Computational, may lack validation |

| Genome assembly | ★★★-★☆☆ | Depends on assembly quality |

Include in report:

```markdown

Curation Level: ●●●● RefSeq Reference (★★★)

Curated by NCBI RefSeq project
Regular updates and validation
Recommended for reference use

Data Reliability Note:

Sequence: ★★★ (experimentally derived)
Gene annotations: ★★★ (curated models)
Variant annotations: ★★☆ (computational)

```

---

Completeness Checklist

Every sequence report MUST include:

Per Sequence (Required)

[ ] Accession number
[ ] Organism (scientific name)
[ ] Sequence type (DNA/RNA/protein)
[ ] Length
[ ] Curation level
[ ] Database source

Search Summary (Required)

[ ] Query parameters
[ ] Number of results
[ ] Ranking rationale

Include Even If Limited

[ ] Alternative sequences (or "Only one sequence found")
[ ] Cross-database references (or "No cross-references available")
[ ] Download instructions

---

Common Use Cases

Reference Genome

User: "Get E. coli K-12 complete genome"

```python

result = tu.tools.NCBI_search_nucleotide(

operation="search",

organism="Escherichia coli",

strain="K-12",

seq_type="complete_genome",

limit=3

)

# Return NC_000913.3 (RefSeq reference)

```

Gene Sequence

User: "Find human BRCA1 mRNA"

```python

result = tu.tools.NCBI_search_nucleotide(

operation="search",

organism="Homo sapiens",

gene="BRCA1",

seq_type="mrna",

limit=10

)

```

Specific Accession

User: "Get sequence for NC_045512.2"

→ Direct retrieval with full metadata

Strain Comparison

User: "Compare E. coli K-12 and O157:H7 genomes"

→ Search both strains, provide comparison table

---

Error Handling

| Error | Response |

|-------|----------|

| "No search criteria provided" | Add organism, gene, or keywords |

| "ENA 404 error" | Accession is likely RefSeq → use NCBI only |

| "No results found" | Broaden search, check spelling, try synonyms |

| "Sequence too large" | Note size, provide download link instead of preview |

| "API rate limit" | Tools auto-retry; if persistent, wait briefly |

---

Tool Reference

NCBI Tools (All Accessions)

| Tool | Purpose |

|------|---------|

| NCBI_search_nucleotide | Search by gene/organism |

| NCBI_fetch_accessions | Convert UIDs to accessions |

| NCBI_get_sequence | Retrieve sequence data |

ENA Tools (GenBank/EMBL Only)

| Tool | Purpose |

|------|---------|

| ena_get_entry | Entry metadata |

| ena_get_sequence_fasta | FASTA sequence |

| ena_get_entry_summary | Summary info |

---

Search Parameters Reference

NCBI_search_nucleotide

| Parameter | Description | Example |

|-----------|-------------|---------|

| operation | Always "search" | "search" |

| organism | Scientific name | "Homo sapiens" |

| gene | Gene symbol | "BRCA1" |

| strain | Specific strain | "K-12" |

| keywords | Free text | "complete genome" |

| seq_type | Sequence type | "complete_genome", "mrna", "refseq" |

| limit | Max results | 10 |

NCBI_get_sequence

| Parameter | Description | Example |

|-----------|-------------|---------|

| operation | Always "fetch_sequence" | "fetch_sequence" |

| accession | Accession number | "NC_000913.3" |

| format | Output format | "fasta", "genbank" |

More from this repository10

🎯

tooluniverse-literature-deep-research🎯Skill

Performs comprehensive literature research with target disambiguation, evidence grading, and structured theme extraction for thorough scientific investigations.

🎯

tooluniverse-protein-structure-retrieval🎯Skill

Retrieves protein structure data from various databases and provides detailed structural information for scientific research and analysis.

🎯

tooluniverse-chemical-compound-retrieval🎯Skill

Retrieves comprehensive chemical compound data from PubChem and ChEMBL, providing detailed profiles with identifiers, properties, and bioactivity information.

🎯

tooluniverse-sdk🎯Skill

Enables programmatic access to 1000+ scientific tools for building AI-powered research workflows, data analysis, and computational biology tasks.

🎯

tooluniverse-disease-research🎯Skill

Researches and provides comprehensive insights into diseases, symptoms, treatments, and medical research using advanced AI analysis.

🎯

tooluniverse-expression-data-retrieval🎯Skill

Retrieves comprehensive gene expression and multi-omics datasets from ArrayExpress and BioStudies with intelligent gene disambiguation and quality assessment.

🎯

tooluniverse-target-research🎯Skill

Performs targeted research by systematically exploring and analyzing information sources to gather comprehensive insights on a specific topic or research question.

🎯

devtu-optimize-skills🎯Skill

Streamlines developer tool skill optimization by analyzing performance, identifying bottlenecks, and recommending targeted improvements for code efficiency.

🎯

devtu-optimize-descriptions🎯Skill

Optimizes tool descriptions in ToolUniverse JSON configs by reviewing and enhancing clarity, prerequisites, parameter guidance, and usage examples.

🎯

devtu-create-tool🎯Skill

Generates scientific tool classes and configurations for ToolUniverse framework, ensuring proper structure, validation, and automated wrapper creation.