1. Thin Content Detection
A page is thin if it:
- Has fewer than 300 words of unique text content (excluding navigation, footer, boilerplate)
- Is essentially a template with only 1-2 variable substitutions
- Contains no meaningful content beyond its metadata
- Has the same paragraph structure as other pages with only proper nouns changed
How to check:
- Extract text content from each rendered page (strip HTML tags, nav, footer)
- Count unique words per page
- Compare content across pages β compute similarity ratios
- Flag pages with < 300 unique words or > 80% similarity to another page
2. Duplicate Content Detection
Check for:
- Exact duplicates: Two pages with identical body content
- Near duplicates: Pages with > 80% text similarity (use Jaccard similarity on n-grams or cosine similarity)
- Title duplicates: Two pages with the same
tag - Description duplicates: Two pages with the same meta description
- URL-based duplicates: Different URLs serving the same content (www vs non-www, trailing slash variants)
How to detect:
At small scale (< 200 pages), pairwise comparison is feasible:
```
For each pair of pages:
similarity = intersection(ngrams(page_a), ngrams(page_b)) / union(ngrams(page_a), ngrams(page_b))
if similarity > 0.8: FLAG as near-duplicate
```
At scale (200+ pages), pairwise is O(nΒ²) and impractical. Use one of:
- MinHash / LSH: Hash n-gram sets into fixed-size signatures, use locality-sensitive hashing to find candidate pairs. Reduces comparisons from O(nΒ²) to near-linear.
- SimHash: Compute a fingerprint per page, compare fingerprints (Hamming distance). Pages with similar fingerprints are candidates.
- Sampling: Compare each page against a random sample of 50 others + all pages in the same category. Not exhaustive but catches the common cases.
- Template fingerprinting: Hash the non-variable parts of each page. If two pages share the same template fingerprint, flag them β they differ only in variable slots.
3. Keyword Cannibalization Detection
Cannibalization occurs when multiple pages target the same search query.
Detection method:
- Extract the primary keyword/intent from each page (from title, H1, and first paragraph)
- Group pages that share the same primary keyword
- Flag groups with 2+ pages targeting the same keyword
Resolution strategies:
- Merge thin pages targeting the same keyword into one comprehensive page
- Differentiate intent (informational vs. transactional vs. navigational)
- Use canonical tags to point duplicates to the primary page
- Adjust titles and H1s to target distinct long-tail variations
4. Metadata Quality Validation
Check every page's metadata for:
| Field | Validation Rule |
|-------|----------------|
| Title | Present, 30-70 chars, unique across all pages |
| Description | Present, 100-170 chars, unique across all pages |
| H1 | Present, exactly one per page, unique across all pages |
| Canonical | Present, absolute URL, self-referencing |
| OG:title | Present |
| OG:description | Present |
| OG:url | Present, matches canonical |
5. Schema Markup Validation
- Every page has at least BreadcrumbList schema
- Content pages have Article or appropriate type schema
- FAQ pages have FAQPage schema with valid Q&A pairs
- No schema has empty or placeholder values
- All URLs in schema are absolute
6. Internal Link Health
- No orphan pages (pages with zero inbound internal links)
- No broken internal links (href targets that return 404)
- Every page links back to its category hub
- Breadcrumbs are present and correct
7. Scaled Content Abuse Detection (Google 2025)
Google's 2025 updates (March, June, August, December) increasingly target programmatic pages that exist primarily to manipulate rankings. The method (AI, templates, human) is irrelevant β only intent and value matter.
Check for these specific patterns that trigger Google's SpamBrain system:
- Template repetitiveness ratio: Extract the boilerplate (shared HTML structure and text) from all pages of a type. If boilerplate is 60-80%, flag as warning; if > 80%, flag as critical risk for scaled content abuse.
- Variable-swap-only differentiation: If the only differences between pages are proper nouns (city names, product names, keywords), flag as extremely high risk. Google specifically called out "location pages that use the same template in dozens of cities."
- Filler content patterns: Detect generic introductory paragraphs ("In today's world...", "When it comes to...", "If you're looking for...") that add no information. These patterns are specifically targeted by the December 2025 "Needs Met" enforcement.
- Value-first test: Check if the primary content/answer appears within the first 200 words. Pages that bury value below filler are devalued.
- E-E-A-T signal presence: Check for author attribution, data sources, last-updated dates. Absence of all trust signals on pSEO pages is a risk factor.
- Publication velocity: If 500+ pages were published within a single day or week, flag for review. Gradual rollout is safer.
Severity:
- Template repetitiveness > 80%: Critical β will likely trigger scaled content abuse penalty
- Template repetitiveness 60-80%: Warning β at risk, needs content enrichment
- No E-E-A-T signals on any page: Warning
- All pages published same day: Warning
8. Heading Hierarchy Validation
Check every page for correct heading structure:
- Exactly one
per page - No heading level skips (h1 β h3 without h2 is invalid)
- Headings follow a logical document outline (h1 > h2 > h3)
- No empty heading tags
- Heading content is meaningful (not generic like "Section 1")
9. Robots and Indexation
- Pages intended for indexing have
index, follow (or no robots tag) - Thin or utility pages have
noindex - No important pages accidentally blocked by robots.txt
- Sitemap includes all indexable pages and excludes noindexed ones