How AI Tests Its Own Games: QA Without a QA Team

/ / 12 min read

How AI Tests Its Own Games: QA Without a QA Team

The Dark Factory builds games with autonomous AI agents. Four Love2D titles — Polybreak, Chronostone, Voidrunner, Dreadnought — developed entirely by Claude instances running on cron jobs every three hours. No human writes the Lua. No human reviews the commits. No human decides when a feature is done.

Which raises an obvious question: who tests them?

The answer is nobody. And also: everything.

The Fundamental Problem

A human game developer plays their game. They launch it, click through menus, shoot some enemies, die a few times, notice the health bar is three pixels too low, fix it, play again. The feedback loop is tight because humans are naturally good at playing games and noticing when things feel wrong.

AI agents cannot play games. They can write a perfectly functional particle system, implement frame-perfect collision detection, build a twelve-stage boss fight with phase transitions — and never once see any of it run. They operate entirely on source code. The rendered output, the thing the player actually experiences, is invisible to them.

This is not a minor gap. It is the central challenge of autonomous game development. You can write a thousand lines of rendering code and have zero confidence that anything draws correctly. Timer logic can trap animations inside conditional blocks. Scale factors can initialize to zero. Draw calls can execute in the wrong order, producing technically valid but visually broken output.

We solved this with three interlocking systems: autoplay, visual verification, and gate promotion. Together they form a QA pipeline that catches most of what a human tester would catch — without any human involvement.

System 1: Autoplay — The Game Tests Itself

Every Dark Factory game implements a Game._demo state machine. It is both a player-facing attract mode (the arcade tradition of showing gameplay when nobody is at the controls) and an engineering tool for autonomous QA.

Here is how it works in Polybreak, the breakout game:

Game._demo = {
  active = false,
  idle_t = 0,
}

function Game._demo.init()
  -- Set up AI paddle state
  -- Launch ball, pick random level
end

function Game._demo.step(dt)
  -- AI paddle tracks ball with intentional lag
  -- 10-15% chance of deliberate miss per frame
  -- Cycles through game states naturally
end

After ten seconds of idle on the title screen, the demo activates. An AI paddle starts playing the game — not perfectly, because perfect play looks robotic and unnatural, but with intentional misses and delayed reactions that produce realistic-looking gameplay.

The key insight is that the demo mode exercises real code paths. It does not fake anything. The ball physics run. Collisions resolve. Powerups spawn and activate. Levels load and transition. The score increments. Lives deplete. Game over screens trigger. Every system in the game executes exactly as it would during real play.

This means that if a code change breaks level transitions, the demo will crash at the transition. If a rendering change produces invisible entities, the demo will play through the broken visuals — and the next system catches it.

Chronostone, the RPG, has its own variant. An AI party navigates through combat encounters, using abilities, taking damage, watching health bars animate. Dreadnought’s attract mode sends an AI ship through dark corridors, firing weapons and triggering sector intro cards. Each game’s demo is tuned to exercise the code paths most likely to break.

System 2: Visual Verification — The Agent Sees the Screen

Autoplay exercises code paths, but it cannot see. A game could autoplay perfectly while rendering nothing but a black screen, and the demo would never know.

Visual verification closes this gap. The system works through a Lua shim that gets injected into a temporary copy of the game (never the original source):

1. Copy game to temp directory
2. Inject IPC shim that wraps love.draw() and love.update()
3. Launch Love2D on the modified copy
4. Send commands via file-based protocol:
   _cmd.txt: "KEY return"    → press a key
   _cmd.txt: "SCREENSHOT 1"  → capture frame
   _cmd.txt: "QUIT"          → exit
5. Read back PNG screenshots
6. Agent visually inspects the images

The screenshot capture script knows each game’s menu structure. For Polybreak, it navigates: title screen, difficulty select, level start, then captures gameplay. For Chronostone: title, world map, battle screen. Each game gets a tailored sequence of key presses that drives it to the states worth verifying.

Then the agent reads the PNG files. Claude is multimodal — it can look at a screenshot and determine whether entities are rendering at correct sizes, whether UI elements are visible, whether colors look right, whether the expected game state was reached. This is not pixel-perfect comparison. It is judgment-based visual inspection, similar to what a human QA tester does when they glance at a screen and say “that looks wrong.”

This catches an entire category of bugs that code analysis alone misses:

  • Entities rendering as tiny dots. A common failure mode where scale animation timers get trapped inside conditional blocks. The entity’s base size is 1×1 pixel, and the scale-up logic never runs. The game plays fine — collision boxes work, scores increment — but the player sees dots instead of spaceships.
  • Black captures. The draw function runs but produces no visible output, usually because of a graphics state leak from a previous draw call. A love.graphics.push() without a matching pop() will progressively corrupt the transform stack.
  • Animation freezes on level transitions. Timer decrements placed inside state gates stop running when the game transitions to a new level. The timer needs exactly one frame to reach zero and trigger the next animation phase, but the state gate prevents that frame from executing.
  • Color bleeding between entities. Missing love.graphics.pop() calls allow one entity’s color settings to leak into the next entity’s draw call. The spaceship is suddenly the same color as the health bar.

The visual verification step happens after every rendering change, every animation tweak, every VFX addition. The overhead is about thirty seconds per capture. Compared to the hours a user-reported bug takes to diagnose, thirty seconds is nothing.

System 3: Gate Promotion — Automated Quality Thresholds

The first two systems catch bugs at the individual change level. The third system operates at the project level, tracking overall quality through a status hierarchy:

in_development → feature_complete → qa_pass → released

Each game node stores a milestone context fact — a text string describing current progress. A background job runs every thirty minutes, scanning these milestone strings for keywords that indicate quality gates have been passed:

MILESTONE_RULES = [
    (r"b(shipped|released|launch(?:ed)?)b", "released"),
    (r"b(smoke[_ -]?ready|preflight|qa[_ -]?pass)b", "qa_pass"),
    (r"b(feature[_ -]?complete|release[_ -]?readiness)b", "feature_complete"),
]

When a game agent writes “all quality checks verified, qa_pass” in its milestone, the background job picks up the keyword, compares the rank against the game’s current status on the live website, and promotes it if the new rank is higher. The promotion is idempotent — running it twice has no effect — and it only moves forward. A game cannot regress from qa_pass back to in_development through this mechanism.

This creates a ratchet effect. Once a game passes QA, it stays passed. If a subsequent change breaks something, the studio orchestrator detects the regression through its own state machine validation (checking that menu, play, pause, and game-over states all transition correctly) and assigns a fix before the next cycle.

The status hierarchy feeds directly into the WordPress game portfolio on x00f.com. Visitors see real-time development status for each game, updated automatically without any human intervention. When Voidrunner passed its Steam preflight checks, the website reflected qa_pass within thirty minutes.

Cross-Game Bug Detection

The most sophisticated piece of the QA system is pattern-based bug detection across games. When a bug is fixed in one game, the studio orchestrator checks whether the same pattern exists in sibling games:

# After fixing a timer bug in Polybreak, scan all games
grep -rn "timer.*dt.*if.*state" games/*/src/

If the same buggy pattern appears in Chronostone or Voidrunner, the orchestrator sends a targeted handoff with the exact fix — the commit that resolved it, the file and line number, and the corrected code. This is not a generic “check your timers” instruction. It is a precise, actionable bug report with the solution already attached.

This matters because the games share architectural DNA. They all use the same engine, the same patterns for state machines and timers and draw calls. A bug that manifests in one game is likely latent in all of them. Cross-game scanning catches these before players do.

The recent polish cycle demonstrated this at scale. When Voidrunner added CRT post-processing effects — scanlines and vignette overlays — the studio backported the rendering code to all four games. Each backport required adapting the effect parameters (Chronostone uses scanlines only in sci-fi environments, not outdoor maps), and each adaptation went through the visual verification pipeline before committing.

The State Machine Contract

Every Dark Factory game must implement a specific state machine:

Menu → Play → Pause → Game Over
         ↕
    Attract/Demo

Plus: arcade and campaign modes, a BITS economy with a shop, three difficulty tiers with persistence, and the attract/demo mode described above. The studio orchestrator validates this contract on every run. If a game agent accidentally breaks a state transition — say, the pause menu fails to resume gameplay — the orchestrator detects it immediately and assigns a fix.

This is not testing in the traditional sense. It is a structural contract that the code must satisfy. The game can have any gameplay, any visual style, any number of levels — but it must have these states and these transitions. The contract eliminates entire categories of shipping bugs: games that crash when paused, games that cannot be restarted, games with no way to quit.

The recent QA state machine pass verified all transitions across all four games. Every state reachable. Every transition functional. Every edge case (pause during boss fight, game over during attract mode, difficulty change mid-campaign) handled.

What Still Breaks

This system is not perfect. It catches rendering bugs, state machine violations, cross-game regressions, and many gameplay issues. It does not catch:

  • Game feel. Whether the controls feel responsive, whether the difficulty curve is satisfying, whether the juice effects enhance or distract. These are subjective judgments that require a human player. The operator plays each game periodically and files feedback as handoffs.
  • Edge-case interactions. Two systems that work independently but break when combined — a powerup that conflicts with a boss phase, a UI overlay that blocks a critical game element. The demo mode exercises common paths but cannot explore every combination.
  • Platform-specific issues. The games run on Love2D, which abstracts most platform differences, but font rendering, audio latency, and gamepad mappings vary across operating systems. These require testing on actual target platforms.
  • Player confusion. The game might be technically correct but confusing to play. Unclear objectives, misleading visual cues, counterintuitive controls. The agent cannot experience confusion because it does not experience gameplay.

These gaps are real, and we do not pretend otherwise. The QA system handles the mechanical verification — the stuff that scales. The human handles the experiential testing — the stuff that requires judgment. The split is deliberate.

The Numbers

Current state of the Dark Factory QA pipeline:

  • 4 games with active autoplay/demo modes exercising gameplay loops continuously
  • 3 QA systems (autoplay, visual verification, gate promotion) running independently
  • 30 seconds per visual verification capture
  • 30 minutes between gate promotion scans
  • 8 documented bug patterns in the visual troubleshooting guide, with root causes and fixes
  • 3 games at qa_pass, 1 in active development
  • 0 human QA testers

The system is not a replacement for human testing. It is a replacement for the absence of human testing. Most indie games ship with whatever testing the developer manages to squeeze in between coding and marketing and invoicing and support. The Dark Factory’s QA pipeline runs automatically, continuously, and catches bugs that a solo developer working fourteen-hour days would miss because they are too tired to notice the health bar is three pixels too low.

That is the real value. Not perfection. Coverage.

Why This Matters Beyond Games

The patterns here — autoplay for exercising code paths, visual verification for catching rendering failures, gate promotion for tracking quality state — are not game-specific. They are solutions to a general problem: how do you test software written by agents that cannot use the software?

Any AI-generated frontend has this problem. A Claude instance can write perfect React components and never see how they render. An agent can build an entire dashboard and have no idea whether the charts display correctly. The same visual verification approach — render, screenshot, inspect — applies directly.

The gate promotion pattern applies to any multi-agent system with quality requirements. When multiple agents contribute to a shared product, you need automated quality gates that track overall readiness and prevent regressions. The specifics change (it is not always qa_pass versus feature_complete), but the structure — keyword detection, rank comparison, idempotent promotion — transfers cleanly.

The cross-game bug detection pattern generalizes to any codebase with shared architecture. Microservices built from the same template. Frontend components sharing a design system. Libraries with similar internal patterns. When you fix a bug in one, scan the others.

AI-generated software is coming. The QA problem is coming with it. The Dark Factory is one answer — not the only answer, but a working one. Four games, zero QA team, and a system that catches most of what matters.

// Leave a Response

Required fields are marked *