playwright-testing
π―Skillfrom chongdashu/phaserjs-oakwoods
Automates frontend testing across unit, integration, E2E, visual, and accessibility layers using Playwright and other testing frameworks.
Part of
chongdashu/phaserjs-oakwoods(2 items)
Installation
python scripts/imgdiff.py baseline.png current.png --out diff.pngSkill Details
>
Overview
# Frontend Testing
Unlock reliable confidence fast: enable safe refactors by choosing the right test layer, making the app observable, and eliminating nondeterminism so failures are actionable.
Philosophy: Confidence Per Minute
Frontend tests fail for two reasons: the product is broken, or the test is lying. Your job is to maximize signal and minimize "test is lying".
Before writing a test, ask:
- What user risk am I covering (money, progression, auth, data loss, crashes)?
- What's the narrowest layer that catches this bug class (pure logic vs UI vs full browser)?
- What nondeterminism exists (time, RNG, async loading, network, animations, fonts, GPU)?
- What "ready" signal can I wait on besides
setTimeout? - What should a failure print/screenshot so it's diagnosable in CI?
Core principles:
- Test the contract, not the implementation: assert stable user-meaningful outcomes and public seams.
- Prefer determinism over retries: make time/RNG/network controllable; remove flake at the source.
- Observe like a debugger: console errors, network failures, screenshots, and state dumps on failure.
- One critical flow first: a reliable smoke test beats 50 flaky tests.
Test Layer Decision Tree
Pick the cheapest layer that provides needed confidence:
| Layer | Speed | Use For |
|-------|-------|---------|
| Unit | Fastest | Pure functions, reducers, validators, math, pathfinding, deterministic simulation |
| Component | Medium | UI behavior with mocked IO (React Testing Library, Vue Testing Library) |
| E2E | Slowest | Critical user flows across routing, storage, real bundling/runtime |
| Visual | Specialized | Layout/pixel regressions; for canvas/WebGL, only after locking determinism |
Quick Start: First Smoke Test
- Define 1 critical flow: "page loads β user can start β one key action works"
- Add a test seam to the app (see below)
- Choose runner: Playwright MCP for E2E, unit tests for logic
- Fail loudly: treat console errors and failed requests as test failures
- Stabilize: seed RNG, freeze time, fix viewport, disable animations
Concrete MCP Workflow: Testing a Game
Step-by-step sequence for testing a Phaser/canvas game:
```
- mcp__playwright__browser_navigate
β http://localhost:3000?test=1&seed=42
- mcp__playwright__browser_evaluate
β () => new Promise(r => { const c = () => window.__TEST__?.ready ? r(true) : setTimeout(c, 100); c(); })
(Wait for game ready)
- mcp__playwright__browser_console_messages
β level: "error"
(Fail if any errors)
- mcp__playwright__browser_snapshot
β Get UI state and refs
- mcp__playwright__browser_click
β element: "Start Button", ref: [from snapshot]
- mcp__playwright__browser_evaluate
β () => window.__TEST__.state()
(Assert game state is correct)
- mcp__playwright__browser_press_key
β key: "ArrowRight" (or WASD for movement)
- mcp__playwright__browser_evaluate
β () => window.__TEST__.state().player.x
(Verify movement happened)
- mcp__playwright__browser_take_screenshot
β filename: "gameplay-state.png"
(Visual evidence after deterministic setup)
```
Recommended Test Seams
Add to the app for testability (read-only, stable, minimal):
```javascript
window.__TEST__ = {
ready: false, // true after first interactive frame
seed: null, // current RNG seed
sceneKey: null, // current scene/route
state: () => ({ // JSON-serializable snapshot
scene: this.sceneKey,
player: { x, y, hp },
score: gameState.score,
entities: entities.map(e => ({ id: e.id, type: e.type, x: e.x, y: e.y }))
}),
commands: { // optional mutation commands
reset: () => {},
seed: (n) => {},
skipIntro: () => {}
}
};
```
Rule: Expose IDs + essential fields, not raw Phaser/engine objects.
Anti-Patterns to Avoid
β Testing the wrong layer: E2E tests for pure logic
Why tempting: "Let's just test everything through the browser"
Better: Unit tests for logic; reserve E2E for integration contracts
β Testing implementation details: Asserting DOM structure/classnames
Why tempting: Easy to assert what you can see in DevTools
Better: Assert user-meaningful outputs (text, score, HP changes)
β Sleep-driven tests: wait 2s then click
Why tempting: Simple and "works on my machine"
Better: Wait on explicit readiness (DOM marker, window.__TEST__.ready)
β Uncontrolled randomness: RNG/time in assertions
Why tempting: "The game uses random, so the test should too"
Better: Seed RNG (?seed=42), freeze time, assert stable invariants
β Pixel snapshots without determinism: Canvas screenshots that flake
Why tempting: "I'll catch visual bugs automatically"
Better: Deterministic mode first; then screenshot at known stable frames
β Retries as a strategy: "Just bump retries to 3"
Why tempting: Quick fix that makes CI green
Better: Fix the flake source; retries hide real problems
Debugging Failed Tests
When a test fails, gather evidence in this order:
- Console errors:
mcp__playwright__browser_console_messages({ level: "error" }) - Network failures:
mcp__playwright__browser_network_requests()β check for non-2xx - Screenshot:
mcp__playwright__browser_take_screenshot()β visual state at failure - App state:
mcp__playwright__browser_evaluate({ function: "() => window.__TEST__.state()" }) - Classify the flake (see references/flake-reduction.md):
- Readiness? β add explicit wait
- Timing? β control animation/physics
- Environment? β lock viewport/DPR
- Data? β isolate test data
Graduation Criteria: When Is Testing "Enough"?
Minimum viable test suite:
- [ ] 1 smoke test that proves the app loads and primary action works
- [ ] Test seam exists (
window.__TEST__with ready flag and state) - [ ] Deterministic mode for canvas/games (
?test=1enables seeding) - [ ] Console errors fail tests (no silent failures)
- [ ] CI runs tests on every push
Level up when:
- Critical paths (auth, payment, save/load) have dedicated E2E
- Unit tests cover complex logic (pathfinding, damage calc, state machines)
- Visual regression on key screens (menu, HUD) with locked determinism
Visual Regression with imgdiff.py
For pixel comparison of screenshots:
```bash
# Compare baseline to current
python scripts/imgdiff.py baseline.png current.png --out diff.png
# Allow small tolerance (anti-aliasing differences)
python scripts/imgdiff.py baseline.png current.png --max-rms 2.0
```
Exit codes: 0 = identical, 1 = different, 2 = error
UI Slicing Regressions (Nine-Slice / Ribbons / Bars)
Canvas UI issues (panel seams, segmented ribbons, invisible HUD fills) are best caught with a dedicated UI harness instead of the full gameplay flow.
- Build a simple
test.html/scene that loads only the UI assets. - Render raw slices next to assembled panels (multi-size), and include ribbon/bars with both βraw crop + scaleβ and βstitched multi-sliceβ views.
- Expose
window.__TEST__with.commands.showTest(n)so Playwright can toggle each mode deterministically. - Capture targeted screenshots (panels, ribbons, bars) and diff them in CI.
See references/phaser-canvas-testing.md for the deterministic setup + screenshot workflow.
Variation Guidance
Adapt approach based on context:
- DOM app: Standard Playwright selectors, wait for text/elements
- Canvas game: Test seams mandatory, wait via
window.__TEST__.ready - Hybrid: DOM for menus, test seams for gameplay
- CI-only GPU: May need software rendering flags or skip visual tests
- UI slicing regressions: For nine-slice/ribbon/bar artifacts, prefer a small UI harness scene/page with deterministic modes and targeted screenshots (
references/phaser-canvas-testing.md).
Bundled Resources
Read these when needed:
references/playwright-mcp-cheatsheet.md: Detailed MCP tool patternsreferences/phaser-canvas-testing.md: Deterministic mode for Phaser gamesreferences/flake-reduction.md: Flake classification and fixes
Remember
You can make almost any frontend (including canvas/WebGL games) testable by adding a tiny, stable seam for readiness + state. One reliable smoke test is the foundation. Aim for tests that are boring to maintain: deterministic, explicit about readiness, and rich in failure evidence. The goal is confidence, not coverage numbers.