🎯

pseo-scale

🎯Skill

from lisbeth718/pseo-skills

What it does

Scales programmatic SEO systems to 10K-100K+ pages with database-backed architectures, incremental validation, and optimized content delivery.

📦

Part of

lisbeth718/pseo-skills(12 items)

pseo-scale

Installation

📋 No install commands found in docs. Showing default command. Check GitHub for actual instructions.

Quick InstallInstall with npx

npx skills add lisbeth718/pseo-skills --skill pseo-scale

Need more details? View full documentation on GitHub →

1Installs

AddedFeb 4, 2026

View on GitHub Back to Skills

Skill Details

SKILL.md

Architect programmatic SEO systems for 10K-100K+ pages with database-backed data layers, data sufficiency gating, incremental validation, crawl budget management, content enrichment pipelines, and edge delivery. Use when scaling beyond 10K pages, when builds are OOMing, when Google is not indexing all pages, or when the current in-memory architecture has hit its limits.

Overview

# pSEO Scale Architecture

Architect pSEO systems that work at 10K-100K+ pages. The patterns in the other pseo-* skills are correct at 1K-10K. Beyond 10K, in-memory data layers, full-corpus validation, and single-deploy rollouts break down. This skill provides the architecture changes needed at scale.

Scale Tiers

|------|-------|-----------|------------|---------|---------|

This skill focuses on the Large and Very Large tiers. If the project is under 10K pages, the standard pseo-* skills are sufficient.

1. Database-Backed Data Layer

At 10K+ pages, JSON files and in-memory arrays stop working. The data layer must move to a database with proper indexing.

Why In-Memory Breaks

```

Pages PageIndex in memory Full content (if loaded)

1K ~1MB ~100-500MB

10K ~10MB ~1-5GB (OOM)

50K ~50MB (borderline) ~5-25GB (impossible)

100K ~100MB ~10-50GB (impossible)

```

At 50K+, even holding all PageIndex records in memory is borderline. At 100K, getAllSlugs() returning an array of 100K objects takes ~100MB and seconds to deserialize. You need cursor-based iteration.

Database Requirements

Minimum schema:

```sql

CREATE TABLE pages (

id SERIAL PRIMARY KEY,

slug TEXT UNIQUE NOT NULL,

canonical_path TEXT UNIQUE NOT NULL,

title TEXT NOT NULL,

h1 TEXT NOT NULL,

meta_description TEXT NOT NULL,

category TEXT NOT NULL,

subcategory TEXT,

status TEXT DEFAULT 'published',

last_modified TIMESTAMPTZ NOT NULL,

published_at TIMESTAMPTZ,

-- Heavy fields (only loaded per-page)

intro_text TEXT,

body_content TEXT,

faqs JSONB,

related_slugs TEXT[],

featured_image JSONB,

-- Scale fields

data_sufficiency_score REAL, -- see section 2

content_hash TEXT, -- for incremental validation

last_validated TIMESTAMPTZ

);

-- Required indexes for pSEO queries

CREATE INDEX idx_pages_category ON pages(category);

CREATE INDEX idx_pages_status ON pages(status) WHERE status = 'published';

CREATE INDEX idx_pages_slug ON pages(slug);

CREATE INDEX idx_pages_last_modified ON pages(last_modified DESC);

CREATE INDEX idx_pages_sufficiency ON pages(data_sufficiency_score);

CREATE INDEX idx_pages_category_status ON pages(category, status);

```

Categories table:

```sql

CREATE TABLE categories (

slug TEXT PRIMARY KEY,

name TEXT NOT NULL,

description TEXT,

parent_slug TEXT REFERENCES categories(slug),

page_count INT DEFAULT 0,

last_modified TIMESTAMPTZ

);

```

Redirects table:

```sql

CREATE TABLE redirects (

source TEXT PRIMARY KEY,

destination TEXT NOT NULL,

created_at TIMESTAMPTZ DEFAULT NOW()

);

CREATE INDEX idx_redirects_destination ON redirects(destination);

```

Data Layer API at Scale

The pseo-data API contract changes at scale:

```typescript

// getAllSlugs() must support cursor-based iteration at 50K+

async function* getAllSlugsCursor(): AsyncGenerator<{ slug: string; category: string }> {

let cursor: string | null = null;

while (true) {

const batch = await db.query(

`SELECT slug, category FROM pages

WHERE status = 'published'

${cursor ? AND slug > $1 : ''}

ORDER BY slug LIMIT 1000`,

cursor ? [cursor] : []

);

if (batch.length === 0) break;

for (const row of batch) yield row;

cursor = batch[batch.length - 1].slug;

}

// getPagesByCategory() must use DB pagination, not in-memory slicing

async function getPagesByCategory(

category: string,

opts?: { limit?: number; offset?: number }

): Promise {

return db.query(

`SELECT slug, title, h1, meta_description, canonical_path, category, last_modified

FROM pages

WHERE category = $1 AND status = 'published'

ORDER BY title

LIMIT $2 OFFSET $3`,

[category, opts?.limit ?? 50, opts?.offset ?? 0]

);

}

// getRelatedPages() needs a precomputed index or efficient query

async function getRelatedPages(slug: string, limit = 5): Promise {

// Option A: Use related_slugs array from the page record

// Option B: Same category + shared tags query

// Option C: Precomputed relatedness table (best at 50K+)

return db.query(

`SELECT p.slug, p.title, p.h1, p.meta_description, p.canonical_path,

p.category, p.last_modified

FROM pages p

JOIN pages source ON source.slug = $1

WHERE p.category = source.category

AND p.slug != $1

AND p.status = 'published'

ORDER BY p.last_modified DESC

LIMIT $2`,

[slug, limit]

);

}

```

Connection Pooling

At build time with parallel page generation, you need connection pooling:

```typescript

import { Pool } from "pg";

const pool = new Pool({

max: 10, // limit concurrent connections during build

idleTimeoutMillis: 30000,

connectionTimeoutMillis: 5000,

});

```

ORM alternative: If using Prisma or Drizzle, configure connection pool limits in the ORM config. Default pool sizes are often too high for build processes.

See references/database-patterns.md for full patterns by database type.

2. Data Sufficiency Gating

At 100K pages, many combinations will produce thin content. Gate page generation BEFORE build time — don't create pages that will fail quality checks.

Sufficiency Score

Compute a score per potential page based on available data:

```typescript

function computeSufficiencyScore(record: RawRecord): number {

let score = 0;

const weights = {

hasTitle: 10,

hasDescription: 10, // > 50 chars

hasBodyContent: 20, // > 200 words

hasFAQs: 15, // >= 3 Q&A pairs

hasUniqueAttributes: 15, // >= 3 non-null structured attributes

hasImage: 5,

hasCategory: 10,

hasNumericData: 10, // stats, ratings, prices — LLM citation signal

hasSourceCitation: 5, // data provenance for E-E-A-T

};

if (record.title?.length > 10) score += weights.hasTitle;

if (record.description?.length > 50) score += weights.hasDescription;

if (wordCount(record.bodyContent) > 200) score += weights.hasBodyContent;

if (record.faqs?.length >= 3) score += weights.hasFAQs;

// ... etc

return score; // 0-100

}

```

Gating Thresholds

| Score | Action |

|-------|--------|

| 80-100 | Generate page — sufficient data |

| 60-79 | Generate page with enrichment flag — mark for content pipeline |

| 40-59 | Hold — do not generate until data is enriched |

| 0-39 | Reject — insufficient data, do not generate |

Store the score in the database (data_sufficiency_score column) so you can:

Query how many pages are gated vs. ready
Track enrichment progress over time
Re-score after data enrichment
Gate at build time: WHERE data_sufficiency_score >= 60 AND status = 'published'

Combination Gating

For combination pages (service × city, product × use-case), both dimensions must have sufficient data:

```typescript

function gateCombination(dimA: RawRecord, dimB: RawRecord): boolean {

// Both dimensions must independently clear a minimum bar

const scoreA = computeSufficiencyScore(dimA);

const scoreB = computeSufficiencyScore(dimB);

// The combination itself must produce enough unique content

// that it's not just "dimA text + dimB text" pasted together

const combinationHasUniqueContent =

dimA.attributes?.length >= 2 &&

dimB.attributes?.length >= 2;

return scoreA >= 40 && scoreB >= 40 && combinationHasUniqueContent;

}

```

This is the most important scale pattern. 500 services × 200 cities = 100K combinations, but maybe only 15K have enough data for a quality page. Generate only the 15K.

3. Content Enrichment Pipeline

At 100K pages, you can't manually write intros, FAQs, and descriptions. You need an automated enrichment pipeline with quality controls.

Pipeline Architecture

```

Raw data (DB/CMS/API)

│

▼

Data sufficiency scoring ──→ Reject/hold insufficient records

│

▼

Automated enrichment ──→ Generate intros, FAQs, summaries from structured data

│

▼

Quality sampling ──→ Human reviews a random sample (5-10%) per batch

│

▼

Publish gate ──→ Only records with score >= 60 and enrichment complete

│

▼

Page generation

```

Enrichment Sources (non-LLM)

Before reaching for LLM generation, exhaust structured data enrichment:

Template composition from multiple fields: Combine 3+ data fields into prose. Not "Best {service} in {city}" but structured sentences using attributes, stats, and category context.
Aggregation: Roll up child data into parent summaries (category stats, comparison tables, top-N lists)
Cross-referencing: Enrich records by joining data sources (product + reviews, service + location demographics, listing + category averages)
FAQ generation from data patterns: Turn common attribute variations into Q&A pairs ("How much does X cost in Y?" → answer from price data)
Comparison data: Auto-generate comparison sections from sibling records in the same category

LLM-Assisted Enrichment

If structured enrichment is insufficient, LLM-assisted generation is acceptable under strict conditions:

Never generate the entire page content — LLM fills gaps in an otherwise data-driven page
Always ground in real data — the LLM prompt includes the record's actual attributes, stats, and context
Human review sampling — review 5-10% of LLM-generated content per batch before publishing
Store the generation metadata — track which fields were LLM-generated vs. sourced from data
Apply quality guard after enrichment — run pseo-quality-guard on enriched content before publishing
Regenerate periodically — stale LLM content should be refreshed when underlying data changes

Google's position (2025): AI-generated content is acceptable if it provides genuine value. The risk is not the generation method but the output quality. Scaled LLM generation that produces interchangeable pages will trigger the same penalties as template spam.

4. Incremental Validation

At 100K pages, full-corpus quality checks take hours. Switch to incremental validation.

Delta Validation

Only validate pages that changed since the last validation run:

```typescript

async function getChangedPages(since: Date): Promise {

const result = await db.query(

`SELECT slug FROM pages

WHERE last_modified > $1

OR last_validated IS NULL

AND status = 'published'`,

[since]

);

return result.map(r => r.slug);

}

```

Content hashing: Store a hash of each page's rendered content. On validation, only re-check pages whose hash changed:

```typescript

import { createHash } from "crypto";

function contentHash(page: BaseSEOContent): string {

const content = ${page.title}|${page.h1}|${page.metaDescription}|${page.bodyContent};

return createHash("sha256").update(content).digest("hex").slice(0, 16);

}

```

Periodic Full Scan

Run a complete validation weekly or before major releases. For the full scan at 100K:

Parallelize: Run 4-8 validation workers, each processing a category partition
Sample cross-category similarity: Don't compare all 100K × 100K. Compare each page against 50 random pages from other categories + all pages in the same category.
Stream results to disk: Write validation results to a JSONL file, then aggregate. Don't accumulate all results in memory.

```typescript

// Parallel validation by category

const categories = await getAllCategories();

const workers = categories.map(cat =>

validateCategory(cat.slug) // each worker handles one category

);

const results = await Promise.all(workers);

```

Validation Budget

| Scale | Delta validation | Full validation |

|-------|-----------------|----------------|

| 10K | Minutes | 30-60 minutes |

| 50K | Minutes | 2-4 hours |

| 100K | Minutes | 4-8 hours |

Optimization: Pre-compute and store fingerprints (MinHash signatures) in the database. Validation then only compares fingerprints, not full content.

5. Crawl Budget Management

At 100K pages, Google won't crawl everything immediately. Crawl budget — the number of pages Google crawls per day — becomes a constraint.

Sitemap Submission Strategy

```

sitemap-index.xml

├── sitemap-category-a.xml (≤ 50,000 URLs)

├── sitemap-category-b.xml

├── sitemap-category-c.xml

└── ...

```

Submission cadence:

Submit the sitemap index to both Google Search Console and Bing Webmaster Tools
Update individual category sitemaps as pages are added
Don't submit all 100K URLs at once — Google throttles crawling for new sites with sudden URL spikes

Programmatic sitemap submission:

```typescript

// Use Google Indexing API for high-priority pages (job postings, livestreams)

// For standard pages, rely on sitemap discovery + Search Console

// Batch sitemap updates by category

async function updateCategorySitemap(category: string) {

const pages = await getPagesByCategory(category, { limit: 50000 });

const xml = generateSitemapXml(pages);

await writeFile(public/sitemap-${category}.xml, xml);

}

```

Crawl Budget Optimization

Prioritize high-value pages: Set priority in sitemap to guide Googlebot (0.8 for hubs, 0.6 for pages with high sufficiency scores, 0.4 for the rest)
Remove low-quality pages from sitemap: Pages with sufficiency score < 60 should not be in the sitemap
Fix crawl waste: Ensure no soft 404s, no redirect chains, no parameter-based duplicates — all waste crawl budget
Server response time: Keep TTFB < 500ms. Slow servers get less crawl budget.
Monitor crawl stats: Check Google Search Console → Settings → Crawl stats weekly. If crawl rate drops, investigate.

Indexing Rate Expectations

Google does not guarantee indexing all pages. Realistic expectations:

| Site authority | Pages submitted | Likely indexed (6 months) |

|---------------|----------------|--------------------------|

| New site | 100K | 10-30% |

| Established (DR 30-50) | 100K | 40-70% |

| High authority (DR 60+) | 100K | 70-90% |

If indexing rate is low:

Improve content quality on indexed pages first
Earn more backlinks to category hubs
Reduce total page count (prune thin pages) — a smaller, higher-quality corpus often indexes better than a large, mediocre one
Ensure internal linking reaches every page within 3 clicks

6. Monitoring at Scale

At 100K pages, you can't manually check pages. Build automated monitoring.

Key Metrics to Track

| Metric | Source | Alert Threshold |

|--------|--------|----------------|

| Pages indexed | Google Search Console API | Drops > 5% week-over-week |

| Crawl rate | Google Search Console API | Drops > 20% |

| Crawl errors (5xx, 404) | Server logs, GSC | > 1% of total pages |

| CWV regressions | CrUX API or RUM | LCP > 4s or CLS > 0.25 on any template |

| Build duration | CI/CD logs | > 2x baseline |

| Build memory peak | CI/CD logs | > 80% of available memory |

| Page count by status | Database | Published count deviates from expected |

| Sufficiency score distribution | Database | > 20% of published pages below threshold |

Automated Monitoring Script

```typescript

// scripts/monitor-pseo.ts — run daily via cron or CI

async function monitor() {

const metrics = {

totalPublished: await db.query("SELECT COUNT(*) FROM pages WHERE status = 'published'"),

avgSufficiency: await db.query("SELECT AVG(data_sufficiency_score) FROM pages WHERE status = 'published'"),

belowThreshold: await db.query("SELECT COUNT(*) FROM pages WHERE data_sufficiency_score < 60 AND status = 'published'"),

recentlyModified: await db.query("SELECT COUNT(*) FROM pages WHERE last_modified > NOW() - INTERVAL '7 days'"),

neverValidated: await db.query("SELECT COUNT(*) FROM pages WHERE last_validated IS NULL AND status = 'published'"),

redirectCount: await db.query("SELECT COUNT(*) FROM redirects"),

brokenRedirects: await db.query(

"SELECT COUNT(*) FROM redirects r WHERE NOT EXISTS (SELECT 1 FROM pages p WHERE p.canonical_path = r.destination)"

};

// Output report or send to monitoring service

console.log(JSON.stringify(metrics, null, 2));

// Alert on critical conditions

if (metrics.brokenRedirects > 0) console.error("ALERT: Broken redirects found");

if (metrics.belowThreshold / metrics.totalPublished > 0.2) {

console.error("ALERT: >20% of published pages below sufficiency threshold");

}

```

Search Console API Integration

At 100K pages, manual Search Console checks are impractical. Use the API:

Indexing status: Query the URL Inspection API in batches to check indexing status of new pages
Performance data: Pull clicks, impressions, CTR by page template to identify underperforming page types
Coverage issues: Monitor for "Crawled — currently not indexed" and "Discovered — currently not indexed" trends

7. Edge and CDN Architecture

At 100K pages, the origin server can't handle all traffic directly.

Caching Strategy

```

Client → CDN Edge → Origin (Next.js/framework)

↓

Cache Layer (Redis/edge KV)

↓

Database

```

CDN configuration:

Cache all pSEO pages at the edge with s-maxage=86400, stale-while-revalidate=3600
Use ISR revalidation to refresh cached pages (not full rebuilds)
Set longer TTLs for stable pages (30 days), shorter for dynamic data pages (1 day)

Cache invalidation:

On data change → invalidate the specific page's cache via on-demand revalidation API
On category change → invalidate all pages in the category
On template change → purge the CDN for all pages of that template type

Edge rendering (if supported):

Deploy to edge runtimes (Vercel Edge, Cloudflare Workers) for < 50ms TTFB globally
Not all frameworks support edge rendering — check compatibility
Edge functions have memory limits (~128MB) that constrain complex data operations

Database Connection from Edge

Edge functions can't maintain persistent database connections. Options:

HTTP-based database (PlanetScale, Neon serverless driver, Supabase edge functions)
Edge KV store (Cloudflare KV, Vercel KV) for index-tier data with database as source of truth
Pre-generated JSON at build time for index-tier data, database only for full page content

8. Scale-Specific Build Strategy

At 100K pages, the build process itself needs architecture.

Don't Build All Pages

```typescript

// At 100K, only pre-build the most important pages

export async function generateStaticParams() {

// Pre-build: hub pages + top 1K pages by traffic/priority

const hubs = await getAllCategories();

const topPages = await db.query(

`SELECT slug, category FROM pages

WHERE status = 'published' AND data_sufficiency_score >= 80

ORDER BY priority DESC LIMIT 1000`

);

return [

...hubs.map(h => ({ category: h.slug })),

...topPages.map(p => ({ category: p.category, slug: p.slug })),

];

}

// ISR handles the remaining 99K pages on first request

export const dynamicParams = true;

export const revalidate = 86400; // 24 hours

```

Build Time Budget

| Pages pre-built | Expected build time | Memory |

|----------------|-------------------|--------|

| 1K (hubs + top pages) | 5-15 minutes | 2-4GB |

| 5K | 15-45 minutes | 4-6GB |

| 10K | 30-90 minutes | 6-8GB |

| 100K (DON'T DO THIS) | 5-15 hours | 16GB+ |

Rule: Never pre-build more than 10K pages. Use ISR for everything else.

Warm-Up After Deploy

After deploying, the ISR cache is cold. The first visitor to each page triggers generation. For critical pages:

```typescript

// scripts/warm-cache.ts — run after deploy

async function warmCache() {

const priorityPages = await db.query(

`SELECT canonical_path FROM pages

WHERE data_sufficiency_score >= 80 AND status = 'published'

ORDER BY priority DESC LIMIT 5000`

);

// Hit each page to trigger ISR generation (rate-limited)

for (const page of priorityPages) {

await fetch(${baseUrl}${page.canonical_path});

await sleep(100); // 10 pages/second — don't DDoS yourself

}

```

Checklist

[ ] Database is the primary data store (not JSON files or in-memory arrays)
[ ] Required indexes exist on slug, category, status, last_modified, sufficiency_score
[ ] Data sufficiency scoring is implemented and stored per page
[ ] Pages with score < 60 are gated from generation
[ ] Combination pages are gated on both dimensions
[ ] Content enrichment pipeline exists (structured data first, LLM-assisted only with review)
[ ] Incremental validation is implemented (delta + periodic full scan)
[ ] Content hashes are stored for change detection
[ ] Sitemap is split by category with index file
[ ] Sitemap excludes pages below sufficiency threshold
[ ] Crawl budget is monitored via Search Console
[ ] Monitoring script runs daily with alerts
[ ] CDN caching is configured with appropriate TTLs
[ ] ISR handles the long tail (only top pages pre-built)
[ ] Cache warm-up script exists for post-deploy
[ ] Connection pooling is configured for build-time queries
[ ] No function loads > 10K full page records into memory

Relationship to Other Skills

Extends: pseo-data (replaces in-memory patterns with database), pseo-performance (adds CDN/edge and scale-specific build strategy), pseo-quality-guard (adds incremental validation)
Depends on: All content and structure skills must be in place before scaling
Validated by: pseo-quality-guard (quality doesn't change — scale does)
Works with: pseo-orchestrate (scale considerations at every phase)

More from this repository10

🎯

pseo-quality-guard🎯Skill

Validates programmatic SEO pages by detecting thin content, duplicates, and keyword cannibalization to prevent Google penalties.

🎯

pseo-llm-visibility🎯Skill

Optimizes programmatic SEO pages for AI search visibility by structuring content for LLM extraction and citation across multiple generative engines.

🎯

pseo-performance🎯Skill

Optimizes programmatic SEO applications for fast builds, excellent Core Web Vitals, and scalable performance across 1000+ pages.

🎯

pseo-schema🎯Skill

Generates context-specific JSON-LD structured data schemas for programmatic SEO pages to enhance search engine understanding and rich result potential.

🎯

pseo-orchestrate🎯Skill

Orchestrates the complete programmatic SEO workflow by sequentially executing specialized skills to discover, audit, implement, and validate pSEO pages across multiple phases.

🎯

pseo-discovery🎯Skill

Discovers programmatic SEO opportunities by analyzing codebase, business context, and data assets to identify scalable page types matching search intent.

🎯

pseo-linking🎯Skill

Links and maps potential career pathways by analyzing professional skills, educational backgrounds, and industry connections.

🎯

pseo-audit🎯Skill

Audits a codebase for programmatic SEO readiness, identifying scalability gaps and providing actionable insights for 1000+ page websites.

🎯

pseo-templates🎯Skill

Generates programmatic SEO page templates with dynamic routing, creating unique, intent-matched content for each page automatically.

🎯

pseo-data🎯Skill

Designs and generates structured data architecture for programmatic SEO pages, ensuring type-safe, SEO-optimized content models and data pipelines.