🎯

web-content-fetcher

🎯Skill

from shino369/claude-code-personal-workspace

VibeIndex|
What it does

web-content-fetcher skill from shino369/claude-code-personal-workspace

web-content-fetcher

Installation

Install skill:
npx skills add https://github.com/shino369/claude-code-personal-workspace --skill web-content-fetcher
5
-
Last UpdatedJan 13, 2026

Skill Details

SKILL.md

Expert guidance for fetching and parsing web content from URLs, handling various content sizes, JavaScript-rendered sites, and working around tool limitations. Use when fetching articles, documentation, social media posts, or any web content for translation, analysis, or archiving.

Overview

# Web Content Fetcher

Expert knowledge for fetching and parsing web content, handling size limitations, JavaScript-rendered sites, and extracting clean article content from HTML.

Overview

This skill provides battle-tested strategies for fetching web content in Claude Code, addressing critical challenges:

  1. WebFetch tool limitation: ~50KB max content size
  2. Read tool limitation: 256KB max file size per call
  3. JavaScript-rendered content: Twitter/X, React SPAs require special handling

Four-tier approach:

  • Tier 1: WebFetch for small content (< 50KB)
  • Tier 2: curl + Task agent for any size static content ⭐ DEFAULT
  • Tier 3: Scripts for edge cases (EUC-JP encoding, complex HTML)
  • Tier 4: Playwright for JavaScript-rendered sites (Twitter/X, React SPAs)

Quick Tool Limitations

| Tool | Max Size | Best For | Limitation |

| -------------- | --------- | ------------------------------- | -------------------------------- |

| WebFetch | ~50KB | Small articles, first attempt | Prompt length constraints |

| Read | 256KB | Reading fetched files | Large files need pagination |

| curl | Unlimited | Any size content, raw downloads | No parsing |

| Task agent | Unlimited | Extraction from any size HTML | Handles pagination automatically |

Tiered Fetching Strategy

Tier 1: Small Content - WebFetch Direct

Use when: Simple articles, API responses, first attempt on unknown content

Quick example:

```

WebFetch tool with url and extraction prompt

```

If fails: "Prompt too long" error β†’ Switch to Tier 2

Tier 2: Any Size Content - curl + Task Agent ⭐ RECOMMENDED

Use when: News articles, blog posts, WebFetch fails, any content of any size

Quick workflow:

```bash

# 1. Create directory

mkdir -p output/tasks/YYYYMMDD_taskname/original

# 2. Fetch HTML

curl -s "URL" > output/tasks/YYYYMMDD_taskname/original/raw_html.html

# 3. Extract with Task agent

# Prompt: "Read HTML at [path], extract article content, save to fetched_content.md"

```

Why default: Handles any size, AI-powered extraction, no script setup needed.

For detailed workflow: See [references/workflows.md](references/workflows.md)

Tier 3: Script-Based Extraction - Edge Cases Only

Use when: Task agent struggles with complex HTML or special encoding requirements (EUC-JP)

Available scripts:

  • scripts/extract_article.js - Standard HTML with Mozilla Readability
  • scripts/extract_eucjp.js - Japanese EUC-JP encoded sites (4gamer.net)

Quick example:

```bash

node .claude/skills/web-content-fetcher/scripts/extract_article.js raw_html.html > fetched_content.md

```

Note: Try Task agent (Tier 2) first. Scripts only when Task agent explicitly fails.

For script details: See [references/advanced-fetching.md](references/advanced-fetching.md)

Tier 4: JavaScript-Rendered Content - Playwright

Use when: Twitter/X, React SPAs, dynamic sites, static fetch returns empty/error

Quick example:

```bash

node .claude/skills/web-content-fetcher/scripts/fetch_js_content.js \

"https://x.com/user/status/123456789" \

--output fetched_content.md

```

Common sites: Twitter/X, Threads, Instagram, React/Vue/Angular SPAs, AJAX-loaded content

Image Extraction:

  • Automatically extracts images if exist (Twitter/X, Threads, blogs, etc.)
  • Captures image URLs with alt text from main content area
  • Filters out small images (icons/logos) automatically
  • Output includes structured Media section with URLs for downloading

Setup required (one-time):

```bash

cd .claude/skills/web-content-fetcher

pnpm add -D playwright --save-catalog-name=dev

pnpm exec playwright install chromium

```

For Playwright details: See [references/advanced-fetching.md](references/advanced-fetching.md)

Decision Tree

Use this flowchart to choose the right approach:

```

Need to fetch web content?

β”‚

β”œβ”€ JavaScript-rendered site (Twitter/X, React SPA, dynamic content)?

β”‚ └─ Use Playwright script (Tier 4)

β”‚ └─ See: references/advanced-fetching.md

β”‚

β”œβ”€ Size unknown or small expected content?

β”‚ └─ Try WebFetch (Tier 1)

β”‚ β”œβ”€ Success? β†’ Done βœ“

β”‚ └─ "Prompt too long" error? β†’ Use Tier 2

β”‚

β”œβ”€ Any other case (medium/large content, WebFetch failed)?

β”‚ └─ Use curl + Task agent (Tier 2) ⭐ DEFAULT

β”‚ β”œβ”€ Success? β†’ Done βœ“

β”‚ └─ Task agent struggles? β†’ Use Tier 3

β”‚ └─ See: references/advanced-fetching.md

β”‚

└─ Edge cases (Task agent fails, special encoding)?

└─ Use curl + script (Tier 3)

└─ See: references/advanced-fetching.md

```

Standard Workflow Quick Reference

Most common pattern (works for 90% of cases):

```bash

# 1. Setup

mkdir -p output/tasks/20260111_taskname/original

# 2. Fetch

curl -s "URL" > output/tasks/20260111_taskname/original/raw_html.html

# 3. Extract (Task agent prompt)

"Read HTML file at [path], extract main article content,

remove navigation/ads/sidebars/comments/footer,

save clean markdown to: [path]/fetched_content.md"

# 4. Use

Read: output/tasks/20260111_taskname/original/fetched_content.md

```

For detailed workflows and patterns: See [references/workflows.md](references/workflows.md)

Common Use Cases

Translation with images:

  1. Fetch content (Tier 2 or Tier 4 for JS sites)
  2. Extract to original/fetched_content.md with readable format (includes Media section with image URLs)
  3. Download images using URLs from Media section
  4. Pass to /translate command

Analysis:

  1. Fetch content (Tier 2 or Tier 4)
  2. Extract to original/fetched_content.md with readable format
  3. Read and analyze

Social media posts (Twitter/X, Threads, Instagram):

  1. Use Playwright (Tier 4) directly
  2. Automatically extracts text, images, and metadata
  3. Outputs to fetched_content.md with Media section in a readable format
  4. Download images from extracted URLs
  5. Use for translation/analysis

For all patterns: See [references/workflows.md](references/workflows.md)

When Things Go Wrong

Common issues and quick fixes:

| Issue | Quick Fix | Details |

| -------------------------- | ----------------------------- | --------------------------------------------------- |

| WebFetch "Prompt too long" | Switch to Tier 2 | [troubleshooting.md](references/troubleshooting.md) |

| Read tool file too large | Use Task agent | [troubleshooting.md](references/troubleshooting.md) |

| Garbled Japanese text | EUC-JP encoding issue | [troubleshooting.md](references/troubleshooting.md) |

| JavaScript required | Use Playwright (Tier 4) | [troubleshooting.md](references/troubleshooting.md) |

| Anti-bot protection | Add user agent | [troubleshooting.md](references/troubleshooting.md) |

| Authentication required | Use curl with headers/cookies | [troubleshooting.md](references/troubleshooting.md) |

For complete troubleshooting guide: See [references/troubleshooting.md](references/troubleshooting.md)

Best Practices

  1. Always use task directories: output/tasks/YYYYMMDD_taskname/
  2. Default to Tier 2: curl + Task agent works for nearly all cases
  3. Keep raw HTML: Save to original/raw_html.html for reference (except Tier 4)
  4. Use Playwright for JS sites: Twitter/X, React SPAs, dynamic content
  5. Clean content format: Always save extracted content as markdown
  6. Descriptive naming: Use date prefix (YYYYMMDD_) and descriptive task names
  7. Try simple first: WebFetch β†’ curl + Task agent β†’ Scripts (only if needed)
  8. Task agent for extraction: Let AI handle pagination and parsing complexity

Reference Files

This skill uses progressive disclosure for efficiency. Core information is in this file. Detailed guides are in reference files:

  • [references/workflows.md](references/workflows.md) - Standard workflows, common patterns, directory structure, content extraction priorities
  • [references/advanced-fetching.md](references/advanced-fetching.md) - Tier 3 script-based extraction, Tier 4 Playwright details, setup instructions
  • [references/troubleshooting.md](references/troubleshooting.md) - Common issues, solutions, quick reference commands

Read reference files when you need detailed guidance for specific scenarios.

Quick Start Examples

Simple article:

```bash

mkdir -p output/tasks/20260111_article/original

curl -s "https://example.com/article" > output/tasks/20260111_article/original/raw_html.html

# Task agent: Extract to fetched_content.md

```

Twitter/X post:

```bash

mkdir -p output/tasks/20260111_twitter/original

node .claude/skills/web-content-fetcher/scripts/fetch_js_content.js \

"https://x.com/user/status/123" \

--output output/tasks/20260111_twitter/original/fetched_content.md

```

Threads post:

```bash

mkdir -p output/tasks/20260111_threads/original

node .claude/skills/web-content-fetcher/scripts/fetch_js_content.js \

"https://www.threads.com/@user/post/ABC123" \

--output output/tasks/20260111_threads/original/fetched_content.md

```

Output format with images (works for all sites):

```markdown

# Title

Main content text here...

Media

  1. Image description or alt text

- URL: https://example.com/image1.jpg

  1. Another image

- URL: https://example.com/image2.jpg

```

Note:

  • Images are automatically extracted from the main content area
  • Small images (< 100x100px) like icons/logos are filtered out
  • Image formats: .jpg, .png, .webp, .gif, .svg - all preserved in URLs
  • When downloading, respect the original file extension

Japanese site (potential encoding issue):

```bash

mkdir -p output/tasks/20260111_japanese/original

curl -s "https://4gamer.net/..." > output/tasks/20260111_japanese/original/raw_html.html

# Task agent with encoding detection, or use extract_eucjp.js if needed

```