Building Multi-Agent Systems: Agent Management at Scale

The Challenge

Running one AI agent is straightforward. Running six production agents with different responsibilities, memory strategies, and heartbeat schedules? That requires infrastructure.

This weekend, I built a comprehensive agent management platform that now handles:

Metric	Feb 15	Feb 16
Total Tasks	169	8
Subtasks	1,252 done, 10 failed	-
Total Events	3,341	~150

That's 177 tasks processed across the weekend with full telemetry, memory management, and control plane access.

The Agent Fleet

Six production agents are now running, each with specific purposes:

1. oldham-intelligence

Purpose: Business intelligence for a client project
Memory: Isolated (separate memory namespace)
Heartbeat: Every 2 hours
Status: Production

2. oldham-lead-finder

Purpose: Automated lead discovery
Memory: Shared (accesses main context)
Heartbeat: On-demand
Status: Production

3. site-auditor

Purpose: Website analysis and SEO auditing
Memory: Shared
Heartbeat: On-demand
Status: Production

4. site-rebuilder

Purpose: Website recreation with modern stack
Memory: Shared
Heartbeat: On-demand
Status: Production

5. outreach-drafter

Purpose: Drafting personalized outreach emails
Memory: Shared
Heartbeat: On-demand
Status: Production

6. self-improver

Purpose: Autonomous system maintenance and bug fixing
Memory: Shared
Heartbeat: Daily at 00:17
Status: Production

The Control Plane API

The key to managing all this is a unified control plane that exposes operations via a single tool:

class ControlPlaneTool(Tool):
    """Access approvals, memory, and telemetry via one control-plane surface."""

    @property
    def parameters(self) -> dict[str, Any]:
        return {
            "type": "object",
            "properties": {
                "action": {
                    "type": "string",
                    "enum": [
                        "summary",
                        "telemetry",
                        "approvals_list",
                        "approvals_pending_outbound",
                        "approvals_get",
                        "approvals_events",
                        "approvals_approve",
                        "approvals_reject",
                        "memory_recall",
                        "memory_audit",
                        "memory_remember",
                        "memory_correct",
                        "memory_forget",
                    ],
                },
                # ... additional parameters
            },
            "required": ["action"],
        }

This gives any agent the ability to:

Query telemetry: See task success rates, error patterns, LLM call metrics
Manage approvals: Review and resolve pending outbound messages
Access memory: Read and write to the shared knowledge base

Memory Strategies

Not all agents need the same memory access:

Separate namespace prevents cross-contamination
Client data stays in its own silo
Daily notes are independent

Access to user preferences and system context
Can read from other agents' findings
Collaborative knowledge building

This is configured per-agent in their initialization:

self.context = ContextBuilder(workspace)
self.control_plane_service = ControlPlaneService(
    memory=self.context.memory,  # Shared or isolated
    telemetry=self.telemetry,
)

Dashboard and Monitoring

Every agent exposes a /summary endpoint that returns:

{
  "tasks": {
    "total": 169,
    "done": 156,
    "failed": 13,
    "running": 0
  },
  "subtasks": {
    "total": 1262,
    "done": 1252,
    "failed": 10
  },
  "llm_calls": 1847,
  "events": 3341
}

This aggregates into a central dashboard showing fleet health at a glance.

The Architecture

hyperbot/
├── agents/
│   ├── oldham-intelligence/
│   │   ├── workspace/
│   │   └── memory/          # Isolated
│   ├── self-improver/
│   │   ├── workspace/
│   │   └── (shared memory)
│   └── ...
├── control_plane/
│   ├── service.py           # Unified API
│   └── dashboard.py         # Fleet monitoring
├── telemetry/
│   └── store.py             # SQLite event log
└── bus/
    └── queue.py             # Inter-agent messaging

Scale Numbers

From the commit that brought this together:

1,692 lines added across:

nanobot/control_plane/service.py (new)
nanobot/agent/tools/control_plane.py (new)
nanobot/telemetry/store.py (enhanced)
Dashboard and sync utilities

The system now comfortably handles:

6 concurrent agents
169 tasks/day at peak
3,341 events tracked
Sub-100ms control plane queries

Lessons from Scaling

Unified control surface: One API for telemetry, memory, and approvals reduces complexity
Memory isolation by default: Only share when necessary
Heartbeat-driven agents: Scheduled tasks for routine operations
On-demand agents: Spawn for specific tasks, shut down when done
Telemetry everywhere: You can't optimize what you don't measure

What's Next

The platform is stable, but there's room to grow:

Agent-to-agent communication: Let agents delegate to each other
Resource quotas: Prevent one agent from consuming all LLM budget
Priority scheduling: Critical agents get first access to resources
Auto-scaling: Spawn additional instances under load

For now, the system handles everything I throw at it. 177 tasks in a weekend without breaking a sweat.