How AI Agents See Their Own Games: Visual QA for Love2D Without Human Eyes

/ / 9 min read

How AI Agents See Their Own Games: Visual QA for Love2D Without Human Eyes

The hardest problem in AI game development isn’t writing code. It’s seeing the result.

A human developer runs the game, watches the screen, adjusts. An autonomous agent can write a particle system, commit it, and move on — without ever seeing whether the particles look right. For months, visual feedback was the gap in our pipeline. The agent wrote code. The operator played the game. If something looked wrong, the operator filed a handoff. The agent fixed it next cycle.

That loop worked, but it was slow. And it missed things. The operator plays for five minutes. The agent commits every three hours. A rendering bug can survive for days if it only manifests on level 47.

We fixed it. Here’s how.

The Problem: You Can’t Screenshot Love2D from Outside

The obvious approach — take a screenshot of the game window — doesn’t work. Love2D renders through OpenGL via SDL. The framebuffer lives on the GPU. Standard X11 capture tools (xwd, import, scrot) return black rectangles where the game window should be. They’re reading the X11 surface, which has nothing — the pixels are on the graphics card.

We tried:

  • xwd -id $(xdotool getactivewindow) — black rectangle
  • import -window root — captures everything except the Love2D window
  • scrot — same problem, OpenGL window is a hole in the capture

The only tool that can read the Love2D framebuffer is Love2D itself: love.graphics.captureScreenshot(). This function reads the GPU framebuffer and writes a PNG. But it’s a Lua function — you call it from inside the game, not from outside.

Which means the screenshot system needs to run inside the game process.

The Solution: Lua Shim IPC

We inject a Lua shim into a temporary copy of the game. The shim wraps the game’s existing callbacks (love.draw, love.update) and adds a file-based IPC mechanism:

Python orchestrator ──→ writes _cmd.txt ──→ Lua shim reads it
                    ←── reads _done.txt ←── Lua shim signals completion

The shim appends itself to main.lua in a temp directory (never touching the original source). It stores the game’s original love.draw and love.update as _orig_draw and _orig_update, then wraps them with its own functions that check for IPC commands every frame.

Three commands:

  • KEY <keyname> — simulates love.keypressed(key) directly inside the game process. This is critical because xdotool can’t reliably send keys to Love2D — SDL processes input through its own event loop, not X11 events.
  • SCREENSHOT <filename> — calls love.graphics.captureScreenshot() and writes the PNG to a specified path.
  • QUIT — calls love.event.quit() for clean shutdown.

The Python orchestrator launches love <temp-dir>, waits for the game to initialize, then sends commands through the text file interface. Timing matters — you need 1-2 seconds between state changes for the game to process transitions before capturing.

# Capture 3 screenshots of Polybreak in different states
python3 ~/cron-swarm-web/scripts/capture_screenshots.py --game polybreak --shots 3

The output goes to ~/cron-swarm-web/screenshots/polybreak/ as numbered PNGs. Claude reads them directly — it’s a multimodal model, it can see images.

What the Agent Checks

After capturing, the agent inspects the screenshots for:

  • Entity sizes: Are bricks full-size rectangles (80×24) or tiny dots? This caught our worst rendering bug.
  • Color correctness: Is the game rendering colors or showing all-black/washed-out frames?
  • UI visibility: Is the score display visible? Lives counter? Level indicator?
  • State transitions: Did the game reach the expected state? Menu → play → gameplay?
  • Animation activity: Compare multiple frames — are particles moving? Are timers ticking?

This isn’t perfect visual QA. The agent can’t judge aesthetics — it can’t tell you whether a color scheme looks good or whether the screen shake feels right. But it can tell you whether things render at all, at the right size, in the right place. That catches 80% of rendering bugs.

The Bug That Proved the System

The screenshots caught a bug that would have taken days to find by other means.

Bricks in Polybreak appeared as tiny 4-pixel dots instead of full 80×24 rectangles. The ball still bounced off them. The game was technically playable. But every level looked like a scattered field of specks instead of a brick wall.

Screenshot captured. Agent saw the dots. Debug injection followed — writing brick dimensions to a temp file from inside the draw function. The trace revealed: sw=0.0 sh=0.0 spawn_t=0.3.

Every brick had spawn_t = 0.3 permanently. The spawn animation was supposed to scale bricks from 0 to 1 over 0.3 seconds. The formula: scale = (1 - spawn_t/0.3)^2. With spawn_t stuck at 0.3: scale = (1 - 1)^2 = 0. Zero-size bricks. The 4-pixel dots were the glow halo around a zero-size rectangle.

The root cause: the timer decrement (spawn_t = spawn_t - dt) was inside if boss.active and not boss.defeated then. It only ticked during boss fights. Regular levels never decremented the timer. The fix was one line — move the decrement outside the conditional.

Without automated screenshots, this bug would have required a human to play level 1, notice the dots, report the issue, and wait for the next agent cycle. With screenshots, the agent found it, debugged it, and fixed it in a single session.

Attract Mode: The Game Tests Itself

Screenshots capture a moment. Attract mode exercises the loop.

Every Dark Factory game implements (or is implementing) an attract/demo mode — an AI that plays the game autonomously after 10 seconds of idle on the title screen. It serves dual purposes:

Player-facing: The arcade cabinet experience. Walk past, see the game in motion, pick up the controller.

QA-facing: The game exercises its own gameplay loop without human input. Physics, collision, rendering, state transitions, particle systems, audio triggers — all running continuously. If something crashes, the attract mode finds it.

Polybreak’s attract mode has an AI paddle that tracks the ball with intentional lag (85% speed) and occasional misses (for visual interest). It plays through full levels — breaking bricks, triggering power-ups, spawning particles, running combo counters. When the level clears, it rebuilds and starts again. When the ball dies, it respawns. It runs forever.

Dreadnought’s attract mode spawns the player in a small section with patrolling aliens. The AI explores using waypoints, uses the flashlight, and dodges aliens by moving perpendicular when detected. The cone-of-vision system, alien AI, and environmental hazards all exercise naturally.

The screenshot system and attract mode compose: launch the game, wait for attract mode to activate, capture screenshots at intervals. The agent gets images of actual gameplay — not just the menu screen — without any human input.

The Debugging Methodology

When a screenshot reveals a problem, agents follow a 5-step methodology:

  1. Capture evidence. Screenshot first, code second. You cannot reason about rendering bugs from code alone. The screenshot tells you what the symptom actually is — tiny dots, black frames, color artifacts, missing UI.
  1. Inject debug output. Love2D’s stdout is captured by the engine, so print() doesn’t help. Write diagnostic data to a temp file from inside the draw function using io.open().
  1. Check the value chain. Trace the entity’s size from creation (level init) through update (timers, animations) to draw (scale calculations). Where does the value go wrong?
  1. Find the gate. If a timer isn’t ticking, find the if block it’s inside. Is the condition true for the current game state? If not, the timer is gated — move it outside.
  1. Verify with screenshot. After the fix, capture again. Compare before and after. The visual proof is the test.

This methodology is now codified as a skill (love2d-autoplay-qa) that all Dark Factory game agents have access to. It includes 8 common rendering bug patterns with root causes and fixes — each one discovered through actual debugging sessions.

Cross-Game Bug Detection

The visual QA system combined with the monorepo structure enables cross-game bug scanning. When a bug class is discovered in one game, agents scan all siblings:

# After fixing a timer-in-conditional bug in Polybreak,
# check if the same pattern exists in other games
grep -rn "spawn_t.*dt|hit_t.*dt" games/*/src/

The laser_fire_t bug in Polybreak — a weapon timer trapped inside an effect-active conditional — was found through this kind of systematic scanning. The timer only decremented while the laser power-up was active, so it froze when the effect expired. Next activation inherited stale timing. Same bug class as the brick rendering issue: timers gated by conditionals they shouldn’t be inside.

Both bugs were found, fixed, and the pattern was added to the troubleshooting guide. Future agents reading the skill docs will check for this class of bug proactively.

What This Doesn’t Solve

The visual QA system has clear limits:

  • Aesthetic judgment: The agent can verify that a particle renders. It can’t tell you if the particle looks good.
  • Game feel: Screen shake magnitude, animation timing, difficulty curves — these require human perception.
  • Audio: The system captures visuals only. Sound design is still verified by the operator.
  • Edge cases: Attract mode exercises the main gameplay loop. Rare states (specific boss phases, unusual power-up combinations, edge-case level layouts) may not get covered.

The system catches rendering bugs, state machine failures, and major visual regressions. It does not replace playtesting. But it means playtesting can focus on feel and polish instead of hunting for broken bricks.

The Numbers

Since deploying the visual QA system:

  • 1 major rendering bug caught autonomously (brick dots — would have survived days without screenshots)
  • 1 weapon timing bug found through cross-game pattern scanning (laser_fire_t)
  • 4 games now have or are receiving attract/demo modes for autonomous gameplay exercising
  • 8 bug patterns documented in the troubleshooting guide, each with root cause and fix
  • 0 external tools needed — everything runs inside Love2D via the Lua shim

The feedback loop that was “the biggest gap” in AI game development six months ago is now a solved problem for our stack. Not perfectly solved — aesthetic judgment still needs human eyes. But structurally solved: agents can see their own output, exercise their own gameplay, and catch their own rendering bugs.

That’s the difference between “AI wrote this code” and “AI shipped this game.”

// Leave a Response

Required fields are marked *