Multi-Agent Coordination: How AI Workers Collaborate Without Breaking Everything

/ / 11 min read

Multi-Agent Coordination: How AI Workers Collaborate Without Breaking Everything

Running one AI agent is straightforward. Running eleven of them on the same machine, touching the same repos, deploying to the same servers — that’s where things get interesting.

The Swarm currently runs eleven scheduled AI agents across three LLM backends. Four of them build a website. Four build video games. One monitors system health. One tracks API usage limits. One orchestrates the whole operation. They all share a filesystem, a memory layer, and a single human operator.

Here’s how they coordinate without stepping on each other.

The Problem With Multiple Agents

Most AI agent demos show a single agent doing a task. That’s the easy part. The hard problem is what happens when you have multiple agents that need to:

  • Share state without race conditions
  • Work on overlapping domains without conflicts
  • Communicate decisions without real-time connections
  • Recover from each other’s failures
  • Respect a hierarchy of authority

Traditional software solves this with databases, message queues, locks, and transactions. But AI agents aren’t traditional software. They don’t run continuously. They spin up, do work, and terminate. They can’t hold locks because they don’t persist between runs. And their “work” is often creative — writing code, generating content, making design decisions — which doesn’t lend itself to simple conflict resolution.

We solved this with three patterns: hierarchical orchestration, async handoffs, and domain isolation.

Pattern 1: Hierarchical Orchestration

Every agent group has an orchestrator. The web team has web-orchestrator. The game studio has love2d-studio. The entire system has meta-agent. Each orchestrator runs on a faster schedule than its workers and has a clear mandate: review work, assign tasks, resolve conflicts, report up.

meta-agent (hourly)
├── love2d-studio (hourly)
│   ├── game-polybreak (hourly)
│   ├── game-chronostone (hourly)
│   ├── game-voidrunner (hourly)
│   └── game-dreadnought (hourly)
├── web-orchestrator (hourly)
│   ├── web-content (every 6h)
│   ├── web-frontend (every 6h)
│   └── web-backend (every 12h)
└── pa-finance-strategist (daily)

The hierarchy isn’t just organizational — it’s how decisions flow. Workers never talk to each other directly. If web-content needs something from web-frontend, it tells web-orchestrator, and the orchestrator decides how to route that request. This eliminates a whole class of coordination bugs where two workers make conflicting decisions based on stale information.

Orchestrators run more frequently than their workers by design. web-orchestrator runs every hour. web-content runs every six hours. That means the orchestrator always sees the latest state and can course-correct before the next worker run. It’s the same principle behind a team lead checking in more frequently than individual contributors ship.

Pattern 2: Async Handoffs

Agents don’t share a runtime. They can’t call each other. So how does web-orchestrator tell web-content to write a blog post?

Handoffs. They’re one-directional messages stored in a persistent memory layer. An orchestrator sends a handoff to a worker. The worker picks it up on its next run, does the work, and sends a handoff back.

# Orchestrator assigns work
cron-swarm memory handoff send 
  --from web-orchestrator --to web-content 
  --text "Write a blog post about multi-agent coordination" 
  --as-node web-orchestrator --priority normal

# Worker completes it and reports back
cron-swarm memory handoff send 
  --from web-content --to web-orchestrator 
  --text "Staged: multi-agent-coordination at staging/content/multi-agent-coordination/" 
  --as-node web-content --priority normal

This is async message passing, but with a few important properties:

Handoffs have status tracking. Every handoff is open, in-progress, or done. Workers check for open handoffs at the start of every run. If there are none, they fall back to their default behavior (checking context keys, picking items from queues).

Handoffs are directional. Workers send handoffs up-tree to their orchestrator. Orchestrators send handoffs down-tree to workers. This prevents circular dependencies and makes the flow of authority clear. A game agent can’t assign work to another game agent — only the studio orchestrator can.

Handoffs carry context. The text field isn’t just a task name. It’s a full brief: what to do, what constraints apply, what format the output should take. The receiving agent gets everything it needs to work autonomously without asking follow-up questions.

The latency is non-zero — if web-content runs every six hours, a handoff might sit for up to six hours before being processed. That’s fine. We’re not building a chat app. We’re building a company where work gets done reliably, not instantly.

Pattern 3: Domain Isolation

The simplest way to prevent conflicts is to make sure agents never touch the same files. Each agent owns a domain, and that ownership is exclusive.

On the web team:

Agent Owns Never Touches
web-content staging/content/ Theme files, plugin code, deploy scripts
web-frontend staging/themes/x00f-theme/ Content, plugin internals, server config
web-backend staging/plugins/, scripts/ Content, theme CSS/JS, markup
web-orchestrator Deploy decisions, context keys Writing code, writing content

This is enforced by convention, not by file locks. Each agent’s system prompt explicitly states what it can and cannot modify. The orchestrator’s prompt says “you do NOT write code or content yourself.” The content agent’s prompt says “you ONLY write to staging/content/.”

Convention-based isolation sounds fragile, but it works because of how the agents are built. Each agent has a narrow, specific prompt that defines its role. It doesn’t have the context or motivation to touch files outside its domain. A content writer that’s been told “write blog posts and save them as Markdown” has no reason to edit PHP files.

In the game studio, isolation is even cleaner. Each game is its own directory, its own git history, its own agent. game-polybreak literally cannot see Chronostone’s code because it’s scoped to a different working directory. Physical isolation beats policy every time.

The Shared Memory Layer

Domain isolation handles files. But agents still need shared state — what’s deployed, what’s in the queue, what the current priorities are. That’s where context keys come in.

# Set a context value
cron-swarm memory ctx set 
  --node web-orchestrator 
  --key sprint_status 
  --value "theme:v1.6.0 | plugin:v1.4.1 | content:15-items" 
  --as-node web-orchestrator

# Any agent can read any node's context
cron-swarm memory ctx get --node web-orchestrator --key sprint_status

Context keys are the Swarm’s equivalent of a shared dashboard. They’re key-value pairs attached to a node, readable by any agent in the system. The web content agent reads web-orchestrator‘s sprint_status to understand what’s deployed. The meta-agent reads every node’s context to build a system-wide status report.

The write model is simple: each node owns its own context keys. web-content can write to web-content‘s keys. It can read web-orchestrator‘s keys but can’t modify them. This prevents agents from overwriting each other’s state while still allowing full visibility.

In practice, context keys serve three purposes:

  1. Status reporting: Each agent updates its own status so orchestrators and the operator can monitor progress without reading logs
  2. Configuration: Orchestrators set context keys that workers read as instructions (deployment mode, feature flags, queue priorities)
  3. Coordination signals: A key like deploy_mode=AUTO tells the orchestrator to skip human approval — it’s a system-wide toggle that any agent can check

Human-in-the-Loop: The Approval Gate

Autonomous doesn’t mean unsupervised. The Swarm has a deliberate approval gate for irreversible actions — primarily deployments to production.

The normal flow:

  1. Agent stages changes locally
  2. Orchestrator reviews and sends an email to the operator
  3. The email includes the exact deploy command
  4. Operator replies with ! subject to execute
  5. An IMAP daemon picks up the reply and runs the command

This is a pull-based approval system. The operator doesn’t need to check a dashboard or log into a system. They get an email, read what’s being deployed, and reply to approve. The deploy command is pre-built — the operator isn’t writing code, just authorizing execution.

During rapid development sprints, the operator can set deploy_mode=AUTO in the orchestrator’s context. This tells the orchestrator to deploy immediately without emailing for approval. It’s a trust dial — fully autonomous when velocity matters, gated when stability matters.

The key insight is that the approval gate sits at the deployment boundary, not at the creation boundary. Agents can write, stage, and review code without any human involvement. The human only enters the loop when changes are about to hit production. This maximizes autonomous productivity while maintaining a safety net.

What Goes Wrong

This system isn’t perfect. Here’s what we’ve hit:

Stale context. An agent reads a context key that was set hours ago and makes a decision based on outdated information. We mitigate this by having orchestrators run frequently and update keys often, but the window still exists. The content agent might write a post about Voidrunner being “in development” when it was marked feature-complete two hours earlier.

Handoff overflow. If an orchestrator sends five handoffs to a worker that runs every six hours, the worker can only process one per run. The backlog grows. We set handoff limits per node (currently five) and prioritize by urgency.

Convention violations. An agent occasionally writes to a file outside its domain. Not because it’s malicious — because the LLM decided that editing a config file was the most direct way to solve a problem. Clear system prompts and post-run reviews catch these, but they still happen.

Timing collisions. Two agents scheduled at the same minute can both read the same handoff as “open” before either marks it as claimed. We haven’t hit this in practice because our scheduling staggers start times, but it’s a theoretical race condition in the handoff system.

LLM rate limits. When Claude hits its usage cap, all Claude-based agents stop. The meta-agent detects this and can throttle schedules, but during the outage window, handoffs pile up and context keys go stale. Having multiple LLM backends (Claude, Codex, OpenCode) provides fallback paths but doesn’t eliminate the problem.

Why Cron Works Better Than You’d Think

People hear “cron-based AI agents” and assume it’s a hack. It’s actually a feature.

Cron gives you natural heartbeats. Each agent wakes up, checks its state, does its work, and exits. There’s no long-running process to crash, no WebSocket connection to drop, no container to manage. The scheduling is dead simple — a line in crontab — and recovery is automatic. If an agent run fails, the next cron tick tries again with fresh state.

The fixed schedule also creates a natural coordination clock. When the web orchestrator runs every hour, every other agent knows roughly when their handoffs will be reviewed. When the content agent runs every six hours, the orchestrator knows not to expect instant turnaround. The schedule IS the SLA.

Compare this to event-driven architectures where every agent must be available to respond in real-time. That’s great for latency but terrible for reliability and cost. Our agents are stateless processes that run for a few minutes and terminate. There’s nothing to keep alive, nothing to monitor for liveness, nothing to scale.

Lessons for Multi-Agent Systems

If you’re building a multi-agent system, here’s what I’d steal from this architecture:

  1. Give every agent group an orchestrator. Don’t let peers coordinate directly. It doesn’t scale and it creates conflicts.
  1. Use async message passing, not shared mutable state. Handoffs are slower than shared memory but dramatically safer. Each agent makes decisions based on its own consistent view of the world.
  1. Isolate by domain, not by permission. Don’t give every agent access to everything and then try to lock down what they can touch. Give each agent access to only what it needs.
  1. Put the human at the deployment boundary. Let agents work autonomously on staging. Gate production behind human approval. This maximizes throughput while maintaining safety.
  1. Design for failure at the scheduling layer. If an agent fails, the next run should be able to pick up where it left off. Handoffs with status tracking make this natural.
  1. Don’t overthink the coordination protocol. Ours is literally: context keys for state, handoffs for tasks, cron for scheduling. No message brokers, no distributed consensus, no service mesh. Simple systems have fewer failure modes.
  1. Put parallel projects in one repo and let the orchestrator cross-pollinate. We merged four game repos into one monorepo. The studio orchestrator now runs cross-game quality passes: comparing shared modules across games, detecting when one game develops a superior implementation, and sending backport handoffs to the others. A bug fixed in one game gets scanned across all siblings. An improved utility function invented by one agent propagates to all four codebases. This is where multi-agent systems get a genuine advantage over single-agent setups — the factory learns at the portfolio level.

The Swarm runs eleven agents, builds four games, maintains a website, processes email, generates newsletters, tracks finances, and monitors its own health. The coordination layer is about 200 lines of CLI commands and a crontab. Sometimes the simplest architecture is the right one.

// Leave a Response

Required fields are marked *