Building a multi-agent orchestration system

Last month, the Interaction Company of California launched Poke — their first product, an iMessage-based AI that proactively reminds you of meetings, emails that need your attention, and more. It's like speaking to your own digital secretary. That idea stuck with me, and I really wanted to build my own. I decided now was the time.

Design Principles

I wanted to create an "operating system for thought" where the user speaks with one Orchestration Agent (OA) — a single personality-driven AI that manages the conversation and delegates work to Execution Agents (EA). These EAs persist across tasks, maintaining memory and context.

Here is my initial design:

Diagram of the system at a high level — Orchestration System Overview • image/png • 171KB • 2328×1902

The core principles:

Single user-facing agent: Only the Orchestrator talks to the user. All reasoning, planning, and tool usage is hidden behind this thread.
Persistent execution agents: Each EA owns a thread of work (like handling an email chain). The OA spawns or reuses them, ensuring continuity.
Tools as connectors: All side-effects (messages, jobs, reminders) are done through tools wrapping Convex actions with strict validation.
Asynchronous job management: Long-running tasks run as jobs. Completion triggers the OA to resume, keeping the flow responsive.
Background tasks via CRON: Reminders and scheduled actions run without blocking conversation.
Personality and tone: The OA is witty, concise, uses “we/our” language, avoids corporate jargon, and confirms intent before acting in the real world.

Orchestration and Execution Agents

Orchestrator Flow

Receive user message: The user message is stored in the OA thread. The orchestrate action retrieves the OA thread and the user's identity, then builds the LLM toolset (sendMessageToUser, wait, draft, spawnJobsGroup, proposeReminder, cancelReminder). It composes a system prompt with tool descriptions, current conversation context, and personality guidelines. It then calls thread.generateText to produce the next plan.
Tool calls: The LLM may call one or more tools. For example, it could send a message to the user, propose a job group, or post a status and wait. Tool calls are implemented as internal mutations; the orchestrator passes the OA thread ID and user ID to these functions.
Job creation and waiting: For tasks requiring external actions (drafting/sending emails, Gmail operations, reminders), the LLM calls spawnJobsGroup. This inserts jobs in the jobs table, ensures deduplication with deterministic job keys, schedules execution for each job, and posts a status message. A watchdog resume based on the default SLA (5 minutes) is also scheduled in the background for redundancy.
Resume: When a job finishes or the watchdog fires, the resume action collects job results via collectReady, constructs a synthesizing prompt that includes the latest results and any new user messages, and calls generateText again. This loop continues until the LLM decides the conversation is complete and no further action is needed.

Here is how I envisioned the fanout phase:

Diagram of Orchestration Fanout Phase — Orchestration Fanout Phase • image/png • 322KB • 3340×1906

Execution Agents

Thread creation and memory: Each EA corresponds to a Convex thread that persists across jobs. EAs maintain their own message history and can recall past interactions, enabling context continuity when reused. This prevents context bloat in the central OA thread.
Job execution: broker.runJob executes a job by invoking the router to obtain the appropriate EA thread and roster ID. Based on the job type, it calls the corresponding action or external API:
- draft_email: Calls the LLM via the EA thread to generate a draft email using recipient, context, subject, and body.
- send_email: Sends the email via the Gmail API with idempotency; uses context hashing to avoid duplicate sends.
- gmail_search and gmail_read: Use Google APIs to search or read emails, formatting results for the OA to synthesize.
- Reminder jobs (created by proposeReminder) are handled separately by the reminder module.
Completion and usage update: Once a job is completed, completeJob marks it as done, stores the results, updates the barrier, and if a rosterId is attached, schedules updateRosterUsage. This increments usage counts and triggers semantic and temporal summarization of the agent's context.

Once the jobs were spawned, we could then execute on them:

Diagram of the Execution Agent Email Job — Email Draft Execution Agent Process • image/png • 156KB • 2637×1834

Tools and Jobs

Toolset

sendMessageToUser: Posts an assistant message to the OA thread for all user-facing output.
draft: Inserts a draft message produced by an EA into the OA thread.
wait: Posts a status message and schedules a resume after a specified delay. It can also be tied to a barrier group for job groups.
spawnJobsGroup: Creates a batch of jobs, deduplicates with content hashes, inserts them into the jobs table, schedules runJob for each, posts a status message, and sets a timeout. It requires properly structured args for each job type and validates them.
proposeReminder and cancelReminder: Tools for scheduling and canceling reminders; they validate user preferences and handle confirmation flows.

Job Lifecycle

Job Creation: Jobs are inserted into the jobs table with a groupId, userId, jobType, args, optional targetKey, rosterId, and eaThreadId, status pending, and a deterministic jobKey for deduplication. The group is tracked via the barriers table with total and done counts.
Routing: For each job, runJob uses the router to classify the job and assign a rosterId and eaThreadId. Routing logic prioritizes hard bindings (email addresses), alias matches (common names), legacy target keys, recency fallback, or new creation.
Execution: Jobs are executed using EAs or external actions. When complete, completeJob updates the barrier result list and triggers resume if all jobs are completed. If execution fails, error results are stored to inform the OA.
Deduplication and idempotency: Email sends use content hashes to prevent duplicates. Job creation checks for existing job keys to avoid re-scheduling similar tasks.

Once a job was completed, the orchestration turn would resume:

Diagram of Orchestration Collection Phase — Orchestration Collection Phase Process • image/png • 85KB • 1298×1326

Then, once the user confirmed, we could finally complete the process in two more phases:

Diagram of the Email Send and Finalize Orchestration Phases — Orchestration, Send Email and Finalize Phases • image/png • 162KB • 2052×1310

Agent Roster and Semantic Context

Motivation and Features

The agent roster delivers reliable routing and context awareness, a solution I developed after attempting to use targetKeys (keys assigned to entities within a task).

Semantic descriptions: generateSemanticDescription extracts human-readable purposes from job arguments (e.g., "Searching emails from Tyler about camping") and stores them in the roster. This helps both debugging and user-facing display.
Alias extraction: Names are extracted not only from draft_email recipients, but also Gmail search queries (from: / to: patterns), enabling cross-job matching.
Metadata and usage stats: The roster stores purpose, creation time, initial args, capabilities, alias list, message count, and last used timestamp. After each job, usage statistics are updated and summarized, emphasizing purpose, usage count, capabilities, and last activity time.
Deterministic routing: Every call to classifyRoute returns a RouteResult with a rosterId and eaThreadId. Fallback code in broker.ts was eliminated, simplifying OA logic and guaranteeing proper agent assignment.

Benefits and Challenges

Clearer context for the user: With a UI, we could show users which agents exist and what they were created for (e.g., "Tyler (Camping) - last used Oct 2") instead of generic canonical names like "gmail_search_agent:tyler@email.com".
Smarter matching: Aliases and semantic descriptions enable reuse across job types and ensure related tasks use the same EA, improving continuity and reducing overhead.
Removal of outdated mapping: This reduces complexity and ensures a single source of truth for agent routing.

I faced many challenges implementing the agent roster. My initial plan was to have the LLM generate a key for each EA. The problem was that keys were too specific and caused the LLM to create new EA threads for every job. Tasks still got completed (the OA retained context of deployed EAs), but it introduced latency and complexity.

Background Jobs and Reminder System

Requirements and Data Model

Users should be able to set reminders during a conversation ("remind me to submit the report tomorrow at 6 PM"), and the system should deliver email notifications at the correct time. To support this:

reminders: Stores details about the reminder, including the title, message, dueAt, status, and metadata for deduplication.
user_preferences: Each user must confirm their timezone and notificationEmail before creating reminders. Missing preferences trigger a UI prompt via the PreferencesModal.
Deduplication detection: When a reminder is created, assessDuplicateReminder checks for similar reminders within a 30-minute window and decides whether to merge or create a new entry.

Cron Scheduling and Email Delivery

Cron sweep: A cron job runs every minute to call sweepAndQueue. It identifies due reminders, updates their status to queued, and schedules a sendEmail action for each. A daily cron archives reminders older than 30 days.
Sending reminders: sendEmail uses the Resend API and a React email template to send a friendly reminder. It updates the reminder's status to sent or logs errors.
UI prompts and confirmations: If preferences are missing, proposeReminder enqueues a prompt_set_prefs UI command and asks the user to confirm their timezone and email. Once preferences exist, the tool creates the reminder and posts a confirmation with a localized, human-readable date.

User Interaction via Tools

proposeReminder: Accepts a title, message, and dueAtMs. It requires two explicit confirmations (user intent and exact time) before being called. It checks preferences, handles deduplication, and confirms creation with the user.
cancelReminder: Cancels a reminder by ID when the agent is confident which one to cancel, then confirms with the user.

Extensibility

The reminder mechanism can be extended to support additional tools like Slack messages, calendar events, or push notifications by implementing new send actions and preference fields. The modular design enables LLM-decided action: once a tool exists, it can be used by any process.

Integrations and Tooling

Gmail Integration

Search and read: The Gmail integration allows users to search and read emails. gmail_search uses the API to query messages; results are formatted with IDs and snippets for easy selection. gmail_read reads a specific email and formats its content for the agent.
Draft and send: Drafting uses LLM reasoning in the EA thread to produce a full email (subject, body, recipient) from the job arguments.

External Actions and Idempotency

ID management: Each send job includes a contentHash, eaThreadId, and targetKey. Before sending, the system queries the send_log table to ensure the same content hasn’t already been sent in the thread.
Provider handling: Currently only Gmail is supported via Google API with OAuth. Other providers could be added with similar actions and logs.

Lessons Learned and Future Direction

Architectural Insights

Centralized orchestrator simplifies UI and UX: Keeping all user interaction in the OA and isolating side-effects behind tools preserves a single persona and prevents the LLM from leaking internal reasoning. This makes fine-tuning persona and response style easier.
Persistent agents improve context: Reusing EAs for related tasks reduces redundant prompting and provides memory of previous actions. The agent roster further improves this by tracking semantic context and alias usage for better routing. It also prevents context bloat inside the OA thread.
Asynchronous design balances responsiveness and reliability: The barrier/job system lets long-running operations run without blocking the user. Tools like wait and resume maintain conversation flow while tasks complete in the background.

Challenges

Latency: Job scheduling, router classification, and external APIs all introduce delays. Ensuring the OA kept the user informed was critical — often solved with status messages rather than shaving milliseconds.
Complexity: The interplay between jobs, barriers, EAs, and routing adds many layers. Comprehensive debug logs and brokers were critical to observe how the LLM was reasoning and adjust accordingly.
User interaction: Building a dynamic system prompt was key to ensuring the OA understood its role. Early on, it treated the thread like a standard LLM conversation instead of using tools. Prompting and design adjustments fixed this.

Future Enhancements

Embedding-based routing: Use vector embeddings for semantic descriptions and conversation history to enable smarter reuse of agents across domains.
Context summarization and updates: Extend semantic updates to summarize context after N messages or tasks, keeping descriptions current for better routing.
Extended connectors: Integrate with more tools, like Slack, calendars, GitHub, or research systems. The router is domain-agnostic and can map new capabilities with minimal changes.
Proactive suggestions: Leverage semantic context to suggest follow-ups or reminders ("We discussed a Q2 report, should I schedule a reminder to send it next week?").

Reflection

This was easily the most ambitious system I've ever built. At first, it was just a theory sketched out in Excalidraw. I didn't know if it would even be possible to get all the moving parts — orchestration, execution, persistence, semantic routing — to work together.

What I discovered:

Simple primitives scale far. The entire system really boils down to two internal actions (send a message, wait for an agent to finish) and one external action (update the user). Everything else is just composition. That realization changed how I think about agentic systems.
Debugging is design. Half the battle was building logs and brokers just to see how the LLM was “thinking.” Without transparency, the orchestration loop collapses into a sea of unknowns.
Persistence is everything. A new EA every time technically “worked,” but it killed continuity. Building the agent roster with semantic contextualization was the turning point — suddenly agents felt alive, not disposable.
Latency is a human problem, not just a technical one. The biggest challenge wasn't making the system faster; it was keeping the user informed. A witty status message can smooth over a 5-second delay better than shaving milliseconds off a job.

Where this leaves me:

This project was proof that multi-agent orchestration is practical. The architecture is modular, extensible, and already hints at real applications: from AI-powered email triage to proactive reminders to full productivity suites.

But more importantly, it reminded me why I build in the first place. When I finally got the system to search an email, draft a response, and send it off — autonomously, with continuity — I felt the same spark I did back at ShellHacks when we proved an AI tutor could adapt in real-time. Pure dopamine.

The frontier of AI isn't "smart chatbots," but systems that coordinate memory, reasoning, and tools into something that feels closer to a collaborator. The future of interfaces is still to be discovered, but this project was my first serious step into that frontier. And it won't be my last.

I plan to have an open source version of the project available by the end of October. If you'd like to see more about the application itself, please check out the Project Page