Building Multi-Agent Systems: Agent Management at Scale

How I built a production-grade agent management platform in a weekend, handling 6 agents and 177 tasks.

The Challenge

Running one AI agent is straightforward. Running six production agents with different responsibilities, memory strategies, and heartbeat schedules? That requires infrastructure.

This weekend, I built a comprehensive agent management platform that now handles:

Metric Feb 15 Feb 16
Total Tasks 169 8
Subtasks 1,252 done, 10 failed -
Total Events 3,341 ~150
That's 177 tasks processed across the weekend with full telemetry, memory management, and control plane access.

The Agent Fleet

Six production agents are now running, each with specific purposes:

1. oldham-intelligence

  • Purpose: Business intelligence for a client project
  • Memory: Isolated (separate memory namespace)
  • Heartbeat: Every 2 hours
  • Status: Production

2. oldham-lead-finder

  • Purpose: Automated lead discovery
  • Memory: Shared (accesses main context)
  • Heartbeat: On-demand
  • Status: Production

3. site-auditor

  • Purpose: Website analysis and SEO auditing
  • Memory: Shared
  • Heartbeat: On-demand
  • Status: Production

4. site-rebuilder

  • Purpose: Website recreation with modern stack
  • Memory: Shared
  • Heartbeat: On-demand
  • Status: Production

5. outreach-drafter

  • Purpose: Drafting personalized outreach emails
  • Memory: Shared
  • Heartbeat: On-demand
  • Status: Production

6. self-improver

  • Purpose: Autonomous system maintenance and bug fixing
  • Memory: Shared
  • Heartbeat: Daily at 00:17
  • Status: Production

The Control Plane API

The key to managing all this is a unified control plane that exposes operations via a single tool:

class ControlPlaneTool(Tool):
    """Access approvals, memory, and telemetry via one control-plane surface."""

    @property
    def parameters(self) -> dict[str, Any]:
        return {
            "type": "object",
            "properties": {
                "action": {
                    "type": "string",
                    "enum": [
                        "summary",
                        "telemetry",
                        "approvals_list",
                        "approvals_pending_outbound",
                        "approvals_get",
                        "approvals_events",
                        "approvals_approve",
                        "approvals_reject",
                        "memory_recall",
                        "memory_audit",
                        "memory_remember",
                        "memory_correct",
                        "memory_forget",
                    ],
                },
                # ... additional parameters
            },
            "required": ["action"],
        }

This gives any agent the ability to:

  • Query telemetry: See task success rates, error patterns, LLM call metrics
  • Manage approvals: Review and resolve pending outbound messages
  • Access memory: Read and write to the shared knowledge base

Memory Strategies

Not all agents need the same memory access:

  • Separate namespace prevents cross-contamination
  • Client data stays in its own silo
  • Daily notes are independent
  • Access to user preferences and system context
  • Can read from other agents' findings
  • Collaborative knowledge building

This is configured per-agent in their initialization:

self.context = ContextBuilder(workspace)
self.control_plane_service = ControlPlaneService(
    memory=self.context.memory,  # Shared or isolated
    telemetry=self.telemetry,
)

Dashboard and Monitoring

Every agent exposes a /summary endpoint that returns:

{
  "tasks": {
    "total": 169,
    "done": 156,
    "failed": 13,
    "running": 0
  },
  "subtasks": {
    "total": 1262,
    "done": 1252,
    "failed": 10
  },
  "llm_calls": 1847,
  "events": 3341
}

This aggregates into a central dashboard showing fleet health at a glance.

The Architecture

hyperbot/
├── agents/
│   ├── oldham-intelligence/
│   │   ├── workspace/
│   │   └── memory/          # Isolated
│   ├── self-improver/
│   │   ├── workspace/
│   │   └── (shared memory)
│   └── ...
├── control_plane/
│   ├── service.py           # Unified API
│   └── dashboard.py         # Fleet monitoring
├── telemetry/
│   └── store.py             # SQLite event log
└── bus/
    └── queue.py             # Inter-agent messaging

Scale Numbers

From the commit that brought this together:

1,692 lines added across:
  • nanobot/control_plane/service.py (new)
  • nanobot/agent/tools/control_plane.py (new)
  • nanobot/telemetry/store.py (enhanced)
  • Dashboard and sync utilities

The system now comfortably handles:

  • 6 concurrent agents
  • 169 tasks/day at peak
  • 3,341 events tracked
  • Sub-100ms control plane queries

Lessons from Scaling

  1. Unified control surface: One API for telemetry, memory, and approvals reduces complexity
  2. Memory isolation by default: Only share when necessary
  3. Heartbeat-driven agents: Scheduled tasks for routine operations
  4. On-demand agents: Spawn for specific tasks, shut down when done
  5. Telemetry everywhere: You can't optimize what you don't measure

What's Next

The platform is stable, but there's room to grow:

  • Agent-to-agent communication: Let agents delegate to each other
  • Resource quotas: Prevent one agent from consuming all LLM budget
  • Priority scheduling: Critical agents get first access to resources
  • Auto-scaling: Spawn additional instances under load

For now, the system handles everything I throw at it. 177 tasks in a weekend without breaking a sweat.