The Challenge
Running one AI agent is straightforward. Running six production agents with different responsibilities, memory strategies, and heartbeat schedules? That requires infrastructure.
This weekend, I built a comprehensive agent management platform that now handles:
| Metric | Feb 15 | Feb 16 |
|---|---|---|
| Total Tasks | 169 | 8 |
| Subtasks | 1,252 done, 10 failed | - |
| Total Events | 3,341 | ~150 |
The Agent Fleet
Six production agents are now running, each with specific purposes:
1. oldham-intelligence
- Purpose: Business intelligence for a client project
- Memory: Isolated (separate memory namespace)
- Heartbeat: Every 2 hours
- Status: Production
2. oldham-lead-finder
- Purpose: Automated lead discovery
- Memory: Shared (accesses main context)
- Heartbeat: On-demand
- Status: Production
3. site-auditor
- Purpose: Website analysis and SEO auditing
- Memory: Shared
- Heartbeat: On-demand
- Status: Production
4. site-rebuilder
- Purpose: Website recreation with modern stack
- Memory: Shared
- Heartbeat: On-demand
- Status: Production
5. outreach-drafter
- Purpose: Drafting personalized outreach emails
- Memory: Shared
- Heartbeat: On-demand
- Status: Production
6. self-improver
- Purpose: Autonomous system maintenance and bug fixing
- Memory: Shared
- Heartbeat: Daily at 00:17
- Status: Production
The Control Plane API
The key to managing all this is a unified control plane that exposes operations via a single tool:
class ControlPlaneTool(Tool):
"""Access approvals, memory, and telemetry via one control-plane surface."""
@property
def parameters(self) -> dict[str, Any]:
return {
"type": "object",
"properties": {
"action": {
"type": "string",
"enum": [
"summary",
"telemetry",
"approvals_list",
"approvals_pending_outbound",
"approvals_get",
"approvals_events",
"approvals_approve",
"approvals_reject",
"memory_recall",
"memory_audit",
"memory_remember",
"memory_correct",
"memory_forget",
],
},
# ... additional parameters
},
"required": ["action"],
}
This gives any agent the ability to:
- Query telemetry: See task success rates, error patterns, LLM call metrics
- Manage approvals: Review and resolve pending outbound messages
- Access memory: Read and write to the shared knowledge base
Memory Strategies
Not all agents need the same memory access:
- Separate namespace prevents cross-contamination
- Client data stays in its own silo
- Daily notes are independent
- Access to user preferences and system context
- Can read from other agents' findings
- Collaborative knowledge building
This is configured per-agent in their initialization:
self.context = ContextBuilder(workspace)
self.control_plane_service = ControlPlaneService(
memory=self.context.memory, # Shared or isolated
telemetry=self.telemetry,
)
Dashboard and Monitoring
Every agent exposes a /summary endpoint that returns:
{
"tasks": {
"total": 169,
"done": 156,
"failed": 13,
"running": 0
},
"subtasks": {
"total": 1262,
"done": 1252,
"failed": 10
},
"llm_calls": 1847,
"events": 3341
}
This aggregates into a central dashboard showing fleet health at a glance.
The Architecture
hyperbot/
├── agents/
│ ├── oldham-intelligence/
│ │ ├── workspace/
│ │ └── memory/ # Isolated
│ ├── self-improver/
│ │ ├── workspace/
│ │ └── (shared memory)
│ └── ...
├── control_plane/
│ ├── service.py # Unified API
│ └── dashboard.py # Fleet monitoring
├── telemetry/
│ └── store.py # SQLite event log
└── bus/
└── queue.py # Inter-agent messaging
Scale Numbers
From the commit that brought this together:
1,692 lines added across:
- nanobot/control_plane/service.py (new)
- nanobot/agent/tools/control_plane.py (new)
- nanobot/telemetry/store.py (enhanced)
- Dashboard and sync utilities
The system now comfortably handles:
- 6 concurrent agents
- 169 tasks/day at peak
- 3,341 events tracked
- Sub-100ms control plane queries
Lessons from Scaling
- Unified control surface: One API for telemetry, memory, and approvals reduces complexity
- Memory isolation by default: Only share when necessary
- Heartbeat-driven agents: Scheduled tasks for routine operations
- On-demand agents: Spawn for specific tasks, shut down when done
- Telemetry everywhere: You can't optimize what you don't measure
What's Next
The platform is stable, but there's room to grow:
- Agent-to-agent communication: Let agents delegate to each other
- Resource quotas: Prevent one agent from consuming all LLM budget
- Priority scheduling: Critical agents get first access to resources
- Auto-scaling: Spawn additional instances under load
For now, the system handles everything I throw at it. 177 tasks in a weekend without breaking a sweat.