🎯

article-extractor

🎯Skill

from ryanhudson/tapestry-skills-for-claude-code

VibeIndex|
What it does

Extracts clean, readable text from web articles and blog posts by removing navigation, ads, and clutter.

πŸ“¦

Part of

ryanhudson/tapestry-skills-for-claude-code(4 items)

article-extractor

Installation

uv runRun with uv
uv run tapestry-validate-url "$URL" || exit 1
npm installInstall npm package
npm install -g @aspect/readability-cli
npm installInstall npm package
npm install -g reader-cli
uv runRun with uv
uv run trafilatura --URL "$URL" --output-format txt --no-comments > "$TEMP_FILE"
uv runRun with uv
uv run tapestry-extract-html "$URL" --output article.txt

+ 1 more commands

πŸ“– Extracted from docs: ryanhudson/tapestry-skills-for-claude-code
4Installs
-
AddedFeb 4, 2026

Skill Details

SKILL.md

This skill should be used when the user wants to "download article", "extract article", "save blog post", "get article text", or provides a web URL and asks to extract the main content without ads, navigation, or clutter. Saves clean, readable text from web articles and blog posts.

Overview

# Article Extractor

Extract main content from web articles and blog posts, removing navigation, ads, and clutter. Saves clean, readable text.

Prerequisites

This skill requires [UV](https://docs.astral.sh/uv/) for dependency management. Run from the tapestry-skills project root.

Workflow

```

URL β†’ Validate β†’ Detect Tool β†’ Extract Content β†’ Sanitize Filename β†’ Save

```

Tools (in priority order):

  1. reader (Mozilla Readability) - best for most articles
  2. trafilatura - excellent for blogs/news (included in dependencies)
  3. Fallback (tapestry-extract-html) - works without external tools

Security Requirements

All security utilities are available via UV from the project root.

URL Validation

```bash

# Run security validation (checks protocol, blocks SSRF, etc.)

uv run tapestry-validate-url "$URL" || exit 1

```

Filename Sanitization

```bash

# Use tapestry sanitization utility

SAFE_TITLE=$(uv run tapestry-sanitize-filename "$TITLE")

```

Step 1: Check Available Tools

```bash

if command -v reader &> /dev/null; then

TOOL="reader"

else

# trafilatura is included in project dependencies

TOOL="trafilatura"

fi

echo "Using: $TOOL"

```

Optional: Install reader (npm)

For best results, install reader separately:

```bash

npm install -g @aspect/readability-cli

# or

npm install -g reader-cli

```

Step 2: Extract Content

Using reader

```bash

TEMP_FILE=$(mktemp)

trap "rm -f '$TEMP_FILE'" EXIT

reader "$URL" > "$TEMP_FILE"

# Get title (first line in markdown format)

TITLE=$(head -n 1 "$TEMP_FILE" | sed 's/^# //')

```

Using trafilatura (via UV)

```bash

TEMP_FILE=$(mktemp)

trap "rm -f '$TEMP_FILE'" EXIT

# Get content

uv run trafilatura --URL "$URL" --output-format txt --no-comments > "$TEMP_FILE"

# Get title from metadata

TITLE=$(uv run trafilatura --URL "$URL" --json 2>/dev/null | \

python3 -c "import json,sys; print(json.load(sys.stdin).get('title','Article'))" 2>/dev/null || echo "Article")

```

Fallback Method (tapestry-extract-html)

Use the built-in HTML extractor when other tools aren't available:

```bash

# Extract content using tapestry's HTML extractor

uv run tapestry-extract-html "$URL" --output article.txt

```

Or get title and content separately:

```bash

TEMP_FILE=$(mktemp)

trap "rm -f '$TEMP_FILE'" EXIT

# Fetch and extract

uv run tapestry-extract-html "$URL" --output "$TEMP_FILE"

# Title is on first line (after "# ")

TITLE=$(head -n 1 "$TEMP_FILE" | sed 's/^# //')

```

Step 3: Save with Clean Filename

```bash

SAFE_TITLE=$(uv run tapestry-sanitize-filename "$TITLE")

CONTENT_FILE="${SAFE_TITLE}.txt"

# Verify content was extracted

if [ ! -s "$TEMP_FILE" ]; then

echo "Error: No content extracted"

exit 1

fi

mv "$TEMP_FILE" "$CONTENT_FILE"

trap - EXIT

WORD_COUNT=$(wc -w < "$CONTENT_FILE" | tr -d ' ')

echo "Extracted: $TITLE"

echo "Saved to: $CONTENT_FILE"

echo "Words: $WORD_COUNT"

```

Complete Workflow Script

```bash

#!/bin/bash

set -e

URL="$1"

# Validate URL

uv run tapestry-validate-url "$URL" || exit 1

# Detect tool

if command -v reader &> /dev/null; then

TOOL="reader"

else

TOOL="trafilatura"

fi

echo "Extracting with: $TOOL"

# Create temp file

TEMP_FILE=$(mktemp)

trap "rm -f '$TEMP_FILE'" EXIT

# Extract based on tool

case $TOOL in

reader)

reader "$URL" > "$TEMP_FILE"

TITLE=$(head -n 1 "$TEMP_FILE" | sed 's/^# //')

;;

trafilatura)

uv run trafilatura --URL "$URL" --output-format txt --no-comments > "$TEMP_FILE"

TITLE=$(uv run trafilatura --URL "$URL" --json 2>/dev/null | \

python3 -c "import json,sys; print(json.load(sys.stdin).get('title','Article'))" 2>/dev/null || echo "Article")

;;

esac

# If extraction failed, try fallback

if [ ! -s "$TEMP_FILE" ]; then

echo "Primary extraction failed, trying fallback..."

uv run tapestry-extract-html "$URL" --output "$TEMP_FILE"

TITLE=$(head -n 1 "$TEMP_FILE" | sed 's/^# //')

fi

# Verify extraction

if [ ! -s "$TEMP_FILE" ]; then

echo "Error: No content extracted. Site may require authentication."

exit 1

fi

# Save with clean filename

SAFE_TITLE=$(uv run tapestry-sanitize-filename "$TITLE")

CONTENT_FILE="${SAFE_TITLE}.txt"

mv "$TEMP_FILE" "$CONTENT_FILE"

trap - EXIT

# Show results

WORD_COUNT=$(wc -w < "$CONTENT_FILE" | tr -d ' ')

echo ""

echo "Extracted: $TITLE"

echo "Saved to: $CONTENT_FILE"

echo "Words: $WORD_COUNT"

echo ""

echo "Preview:"

head -n 10 "$CONTENT_FILE"

```

Error Handling

| Issue | Solution |

|-------|----------|

| UV not installed | Install with curl -LsSf https://astral.sh/uv/install.sh \| sh |

| Invalid URL | Reject with clear message |

| Internal URL (SSRF) | Block localhost/private IPs |

| Paywall/login required | Inform user, cannot extract |

| Empty extraction | Try fallback method, inform user |

| Timeout | Fallback uses 30s timeout |

What Gets Extracted

Included:

  • Article title
  • Author (if available)
  • Main text content
  • Section headings

Removed:

  • Navigation menus
  • Ads and promotions
  • Newsletter signups
  • Related articles
  • Comment sections
  • Social buttons
  • Cookie notices

Tool Comparison

| Tool | Strengths | Availability |

|------|-----------|--------------|

| reader | Best overall, Firefox algorithm | npm install separately |

| trafilatura | News/blogs, multi-language | Included in dependencies |

| tapestry-extract-html | No external dependencies | Built-in fallback |

Dependencies

All dependencies are managed via UV and pyproject.toml:

  • trafilatura: Article extraction (pinned version)
  • tapestry-extract-html: Built-in fallback extractor

Optional (install separately):

  • reader: Mozilla Readability CLI (npm install -g reader-cli)

Security Reference

For complete security guidelines: ../shared/references/security-guidelines.md