🎯

article-extractor

🎯Skill

from ryanhudson/tapestry-skills-for-claude-code

What it does

Extracts clean, readable text from web articles and blog posts by removing navigation, ads, and clutter.

📦

Part of

ryanhudson/tapestry-skills-for-claude-code(4 items)

article-extractor

Installation

uv runRun with uv

uv run tapestry-validate-url "$URL" || exit 1

npm installInstall npm package

npm install -g @aspect/readability-cli

npm installInstall npm package

npm install -g reader-cli

uv runRun with uv

uv run trafilatura --URL "$URL" --output-format txt --no-comments > "$TEMP_FILE"

uv runRun with uv

uv run tapestry-extract-html "$URL" --output article.txt

+ 1 more commands

📖 Extracted from docs: ryanhudson/tapestry-skills-for-claude-code

Need more details? View full documentation on GitHub →

4Installs

AddedFeb 4, 2026

View on GitHub Back to Skills

Skill Details

SKILL.md

This skill should be used when the user wants to "download article", "extract article", "save blog post", "get article text", or provides a web URL and asks to extract the main content without ads, navigation, or clutter. Saves clean, readable text from web articles and blog posts.

Overview

# Article Extractor

Extract main content from web articles and blog posts, removing navigation, ads, and clutter. Saves clean, readable text.

Prerequisites

This skill requires [UV](https://docs.astral.sh/uv/) for dependency management. Run from the tapestry-skills project root.

Workflow

```

URL → Validate → Detect Tool → Extract Content → Sanitize Filename → Save

```

Tools (in priority order):

reader (Mozilla Readability) - best for most articles
trafilatura - excellent for blogs/news (included in dependencies)
Fallback (tapestry-extract-html) - works without external tools

Security Requirements

All security utilities are available via UV from the project root.

URL Validation

```bash

# Run security validation (checks protocol, blocks SSRF, etc.)

uv run tapestry-validate-url "$URL" || exit 1

```

Filename Sanitization

```bash

# Use tapestry sanitization utility

SAFE_TITLE=$(uv run tapestry-sanitize-filename "$TITLE")

```

Step 1: Check Available Tools

```bash

if command -v reader &> /dev/null; then

TOOL="reader"

else

# trafilatura is included in project dependencies

TOOL="trafilatura"

echo "Using: $TOOL"

```

Optional: Install reader (npm)

For best results, install reader separately:

```bash

npm install -g @aspect/readability-cli

# or

npm install -g reader-cli

```

Step 2: Extract Content

Using reader

```bash

TEMP_FILE=$(mktemp)

trap "rm -f '$TEMP_FILE'" EXIT

reader "$URL" > "$TEMP_FILE"

# Get title (first line in markdown format)

TITLE=$(head -n 1 "$TEMP_FILE" | sed 's/^# //')

```

Using trafilatura (via UV)

```bash

TEMP_FILE=$(mktemp)

trap "rm -f '$TEMP_FILE'" EXIT

# Get content

uv run trafilatura --URL "$URL" --output-format txt --no-comments > "$TEMP_FILE"

# Get title from metadata

TITLE=$(uv run trafilatura --URL "$URL" --json 2>/dev/null | \

python3 -c "import json,sys; print(json.load(sys.stdin).get('title','Article'))" 2>/dev/null || echo "Article")

```

Fallback Method (tapestry-extract-html)

Use the built-in HTML extractor when other tools aren't available:

```bash

# Extract content using tapestry's HTML extractor

uv run tapestry-extract-html "$URL" --output article.txt

```

Or get title and content separately:

```bash

TEMP_FILE=$(mktemp)

trap "rm -f '$TEMP_FILE'" EXIT

# Fetch and extract

uv run tapestry-extract-html "$URL" --output "$TEMP_FILE"

# Title is on first line (after "# ")

TITLE=$(head -n 1 "$TEMP_FILE" | sed 's/^# //')

```

Step 3: Save with Clean Filename

```bash

SAFE_TITLE=$(uv run tapestry-sanitize-filename "$TITLE")

CONTENT_FILE="${SAFE_TITLE}.txt"

# Verify content was extracted

if [ ! -s "$TEMP_FILE" ]; then

echo "Error: No content extracted"

exit 1

mv "$TEMP_FILE" "$CONTENT_FILE"

trap - EXIT

WORD_COUNT=$(wc -w < "$CONTENT_FILE" | tr -d ' ')

echo "Extracted: $TITLE"

echo "Saved to: $CONTENT_FILE"

echo "Words: $WORD_COUNT"

```

Complete Workflow Script

```bash

#!/bin/bash

set -e

URL="$1"

# Validate URL

uv run tapestry-validate-url "$URL" || exit 1

# Detect tool

if command -v reader &> /dev/null; then

TOOL="reader"

else

TOOL="trafilatura"

echo "Extracting with: $TOOL"

# Create temp file

TEMP_FILE=$(mktemp)

trap "rm -f '$TEMP_FILE'" EXIT

# Extract based on tool

case $TOOL in

reader)

reader "$URL" > "$TEMP_FILE"

TITLE=$(head -n 1 "$TEMP_FILE" | sed 's/^# //')

;;

trafilatura)

uv run trafilatura --URL "$URL" --output-format txt --no-comments > "$TEMP_FILE"

TITLE=$(uv run trafilatura --URL "$URL" --json 2>/dev/null | \

python3 -c "import json,sys; print(json.load(sys.stdin).get('title','Article'))" 2>/dev/null || echo "Article")

;;

esac

# If extraction failed, try fallback

if [ ! -s "$TEMP_FILE" ]; then

echo "Primary extraction failed, trying fallback..."

uv run tapestry-extract-html "$URL" --output "$TEMP_FILE"

TITLE=$(head -n 1 "$TEMP_FILE" | sed 's/^# //')

# Verify extraction

if [ ! -s "$TEMP_FILE" ]; then

echo "Error: No content extracted. Site may require authentication."

exit 1

# Save with clean filename

SAFE_TITLE=$(uv run tapestry-sanitize-filename "$TITLE")

CONTENT_FILE="${SAFE_TITLE}.txt"

mv "$TEMP_FILE" "$CONTENT_FILE"

trap - EXIT

# Show results

WORD_COUNT=$(wc -w < "$CONTENT_FILE" | tr -d ' ')

echo ""

echo "Extracted: $TITLE"

echo "Saved to: $CONTENT_FILE"

echo "Words: $WORD_COUNT"

echo ""

echo "Preview:"

head -n 10 "$CONTENT_FILE"

```

Error Handling

| Issue | Solution |

|-------|----------|

| UV not installed | Install with curl -LsSf https://astral.sh/uv/install.sh \| sh |

| Invalid URL | Reject with clear message |

| Internal URL (SSRF) | Block localhost/private IPs |

| Paywall/login required | Inform user, cannot extract |

| Empty extraction | Try fallback method, inform user |

| Timeout | Fallback uses 30s timeout |

What Gets Extracted

Included:

Article title
Author (if available)
Main text content
Section headings

Removed:

Navigation menus
Ads and promotions
Newsletter signups
Related articles
Comment sections
Social buttons
Cookie notices

Tool Comparison

| Tool | Strengths | Availability |

|------|-----------|--------------|

| reader | Best overall, Firefox algorithm | npm install separately |

| trafilatura | News/blogs, multi-language | Included in dependencies |

| tapestry-extract-html | No external dependencies | Built-in fallback |

Dependencies

All dependencies are managed via UV and pyproject.toml:

trafilatura: Article extraction (pinned version)
tapestry-extract-html: Built-in fallback extractor

Optional (install separately):

reader: Mozilla Readability CLI (npm install -g reader-cli)

Security Reference

For complete security guidelines: ../shared/references/security-guidelines.md

More from this repository3

🎯

youtube-transcript🎯Skill

Extracts and retrieves transcripts from YouTube videos for text analysis and content review.

🎯

ship-learn-next🎯Skill

Transforms learning content into actionable, iterative plans using the Ship-Learn-Next framework, helping users convert passive knowledge into concrete, shippable iterations.

🎯

tapestry🎯Skill

I apologize, but I cannot generate a description without seeing the actual details or context of the "tapestry" skill. Could you provide more information about what the skill does, its purpose, or ...