web-content-fetcher
π―Skillfrom shino369/claude-code-personal-workspace
web-content-fetcher skill from shino369/claude-code-personal-workspace
Installation
npx skills add https://github.com/shino369/claude-code-personal-workspace --skill web-content-fetcherSkill Details
Expert guidance for fetching and parsing web content from URLs, handling various content sizes, JavaScript-rendered sites, and working around tool limitations. Use when fetching articles, documentation, social media posts, or any web content for translation, analysis, or archiving.
Overview
# Web Content Fetcher
Expert knowledge for fetching and parsing web content, handling size limitations, JavaScript-rendered sites, and extracting clean article content from HTML.
Overview
This skill provides battle-tested strategies for fetching web content in Claude Code, addressing critical challenges:
- WebFetch tool limitation: ~50KB max content size
- Read tool limitation: 256KB max file size per call
- JavaScript-rendered content: Twitter/X, React SPAs require special handling
Four-tier approach:
- Tier 1: WebFetch for small content (< 50KB)
- Tier 2: curl + Task agent for any size static content β DEFAULT
- Tier 3: Scripts for edge cases (EUC-JP encoding, complex HTML)
- Tier 4: Playwright for JavaScript-rendered sites (Twitter/X, React SPAs)
Quick Tool Limitations
| Tool | Max Size | Best For | Limitation |
| -------------- | --------- | ------------------------------- | -------------------------------- |
| WebFetch | ~50KB | Small articles, first attempt | Prompt length constraints |
| Read | 256KB | Reading fetched files | Large files need pagination |
| curl | Unlimited | Any size content, raw downloads | No parsing |
| Task agent | Unlimited | Extraction from any size HTML | Handles pagination automatically |
Tiered Fetching Strategy
Tier 1: Small Content - WebFetch Direct
Use when: Simple articles, API responses, first attempt on unknown content
Quick example:
```
WebFetch tool with url and extraction prompt
```
If fails: "Prompt too long" error β Switch to Tier 2
Tier 2: Any Size Content - curl + Task Agent β RECOMMENDED
Use when: News articles, blog posts, WebFetch fails, any content of any size
Quick workflow:
```bash
# 1. Create directory
mkdir -p output/tasks/YYYYMMDD_taskname/original
# 2. Fetch HTML
curl -s "URL" > output/tasks/YYYYMMDD_taskname/original/raw_html.html
# 3. Extract with Task agent
# Prompt: "Read HTML at [path], extract article content, save to fetched_content.md"
```
Why default: Handles any size, AI-powered extraction, no script setup needed.
For detailed workflow: See [references/workflows.md](references/workflows.md)
Tier 3: Script-Based Extraction - Edge Cases Only
Use when: Task agent struggles with complex HTML or special encoding requirements (EUC-JP)
Available scripts:
scripts/extract_article.js- Standard HTML with Mozilla Readabilityscripts/extract_eucjp.js- Japanese EUC-JP encoded sites (4gamer.net)
Quick example:
```bash
node .claude/skills/web-content-fetcher/scripts/extract_article.js raw_html.html > fetched_content.md
```
Note: Try Task agent (Tier 2) first. Scripts only when Task agent explicitly fails.
For script details: See [references/advanced-fetching.md](references/advanced-fetching.md)
Tier 4: JavaScript-Rendered Content - Playwright
Use when: Twitter/X, React SPAs, dynamic sites, static fetch returns empty/error
Quick example:
```bash
node .claude/skills/web-content-fetcher/scripts/fetch_js_content.js \
"https://x.com/user/status/123456789" \
--output fetched_content.md
```
Common sites: Twitter/X, Threads, Instagram, React/Vue/Angular SPAs, AJAX-loaded content
Image Extraction:
- Automatically extracts images if exist (Twitter/X, Threads, blogs, etc.)
- Captures image URLs with alt text from main content area
- Filters out small images (icons/logos) automatically
- Output includes structured Media section with URLs for downloading
Setup required (one-time):
```bash
cd .claude/skills/web-content-fetcher
pnpm add -D playwright --save-catalog-name=dev
pnpm exec playwright install chromium
```
For Playwright details: See [references/advanced-fetching.md](references/advanced-fetching.md)
Decision Tree
Use this flowchart to choose the right approach:
```
Need to fetch web content?
β
ββ JavaScript-rendered site (Twitter/X, React SPA, dynamic content)?
β ββ Use Playwright script (Tier 4)
β ββ See: references/advanced-fetching.md
β
ββ Size unknown or small expected content?
β ββ Try WebFetch (Tier 1)
β ββ Success? β Done β
β ββ "Prompt too long" error? β Use Tier 2
β
ββ Any other case (medium/large content, WebFetch failed)?
β ββ Use curl + Task agent (Tier 2) β DEFAULT
β ββ Success? β Done β
β ββ Task agent struggles? β Use Tier 3
β ββ See: references/advanced-fetching.md
β
ββ Edge cases (Task agent fails, special encoding)?
ββ Use curl + script (Tier 3)
ββ See: references/advanced-fetching.md
```
Standard Workflow Quick Reference
Most common pattern (works for 90% of cases):
```bash
# 1. Setup
mkdir -p output/tasks/20260111_taskname/original
# 2. Fetch
curl -s "URL" > output/tasks/20260111_taskname/original/raw_html.html
# 3. Extract (Task agent prompt)
"Read HTML file at [path], extract main article content,
remove navigation/ads/sidebars/comments/footer,
save clean markdown to: [path]/fetched_content.md"
# 4. Use
Read: output/tasks/20260111_taskname/original/fetched_content.md
```
For detailed workflows and patterns: See [references/workflows.md](references/workflows.md)
Common Use Cases
Translation with images:
- Fetch content (Tier 2 or Tier 4 for JS sites)
- Extract to
original/fetched_content.mdwith readable format (includes Media section with image URLs) - Download images using URLs from Media section
- Pass to
/translatecommand
Analysis:
- Fetch content (Tier 2 or Tier 4)
- Extract to
original/fetched_content.mdwith readable format - Read and analyze
Social media posts (Twitter/X, Threads, Instagram):
- Use Playwright (Tier 4) directly
- Automatically extracts text, images, and metadata
- Outputs to
fetched_content.mdwith Media section in a readable format - Download images from extracted URLs
- Use for translation/analysis
For all patterns: See [references/workflows.md](references/workflows.md)
When Things Go Wrong
Common issues and quick fixes:
| Issue | Quick Fix | Details |
| -------------------------- | ----------------------------- | --------------------------------------------------- |
| WebFetch "Prompt too long" | Switch to Tier 2 | [troubleshooting.md](references/troubleshooting.md) |
| Read tool file too large | Use Task agent | [troubleshooting.md](references/troubleshooting.md) |
| Garbled Japanese text | EUC-JP encoding issue | [troubleshooting.md](references/troubleshooting.md) |
| JavaScript required | Use Playwright (Tier 4) | [troubleshooting.md](references/troubleshooting.md) |
| Anti-bot protection | Add user agent | [troubleshooting.md](references/troubleshooting.md) |
| Authentication required | Use curl with headers/cookies | [troubleshooting.md](references/troubleshooting.md) |
For complete troubleshooting guide: See [references/troubleshooting.md](references/troubleshooting.md)
Best Practices
- Always use task directories:
output/tasks/YYYYMMDD_taskname/ - Default to Tier 2: curl + Task agent works for nearly all cases
- Keep raw HTML: Save to
original/raw_html.htmlfor reference (except Tier 4) - Use Playwright for JS sites: Twitter/X, React SPAs, dynamic content
- Clean content format: Always save extracted content as markdown
- Descriptive naming: Use date prefix (
YYYYMMDD_) and descriptive task names - Try simple first: WebFetch β curl + Task agent β Scripts (only if needed)
- Task agent for extraction: Let AI handle pagination and parsing complexity
Reference Files
This skill uses progressive disclosure for efficiency. Core information is in this file. Detailed guides are in reference files:
- [references/workflows.md](references/workflows.md) - Standard workflows, common patterns, directory structure, content extraction priorities
- [references/advanced-fetching.md](references/advanced-fetching.md) - Tier 3 script-based extraction, Tier 4 Playwright details, setup instructions
- [references/troubleshooting.md](references/troubleshooting.md) - Common issues, solutions, quick reference commands
Read reference files when you need detailed guidance for specific scenarios.
Quick Start Examples
Simple article:
```bash
mkdir -p output/tasks/20260111_article/original
curl -s "https://example.com/article" > output/tasks/20260111_article/original/raw_html.html
# Task agent: Extract to fetched_content.md
```
Twitter/X post:
```bash
mkdir -p output/tasks/20260111_twitter/original
node .claude/skills/web-content-fetcher/scripts/fetch_js_content.js \
"https://x.com/user/status/123" \
--output output/tasks/20260111_twitter/original/fetched_content.md
```
Threads post:
```bash
mkdir -p output/tasks/20260111_threads/original
node .claude/skills/web-content-fetcher/scripts/fetch_js_content.js \
"https://www.threads.com/@user/post/ABC123" \
--output output/tasks/20260111_threads/original/fetched_content.md
```
Output format with images (works for all sites):
```markdown
# Title
Main content text here...
Media
- Image description or alt text
- URL: https://example.com/image1.jpg
- Another image
- URL: https://example.com/image2.jpg
```
Note:
- Images are automatically extracted from the main content area
- Small images (< 100x100px) like icons/logos are filtered out
- Image formats:
.jpg,.png,.webp,.gif,.svg- all preserved in URLs - When downloading, respect the original file extension
Japanese site (potential encoding issue):
```bash
mkdir -p output/tasks/20260111_japanese/original
curl -s "https://4gamer.net/..." > output/tasks/20260111_japanese/original/raw_html.html
# Task agent with encoding detection, or use extract_eucjp.js if needed
```
More from this repository5
translation-expertise skill from shino369/claude-code-personal-workspace
document-writing skill from shino369/claude-code-personal-workspace
claude-code-development skill from shino369/claude-code-personal-workspace
javascript-testing skill from shino369/claude-code-personal-workspace
engineering-terminology skill from shino369/claude-code-personal-workspace