Building a multi-agent orchestration system
Last month, the Interaction Company of California launched Poke — their first product, an iMessage-based AI that proactively reminds you of meetings, emails that need your attention, and more. It's like speaking to your own digital secretary. That idea stuck with me, and I really wanted to build my own. I decided now was the time.
Design Principles
I wanted to create an "operating system for thought" where the user speaks with one Orchestration Agent (OA) — a single personality-driven AI that manages the conversation and delegates work to Execution Agents (EA). These EAs persist across tasks, maintaining memory and context.
Here is my initial design:
The core principles:
- Single user-facing agent: Only the Orchestrator talks to the user. All reasoning, planning, and tool usage is hidden behind this thread.
- Persistent execution agents: Each EA owns a thread of work (like handling an email chain). The OA spawns or reuses them, ensuring continuity.
- Tools as connectors: All side-effects (messages, jobs, reminders) are done through tools wrapping Convex actions with strict validation.
- Asynchronous job management: Long-running tasks run as jobs. Completion triggers the OA to resume, keeping the flow responsive.
- Background tasks via CRON: Reminders and scheduled actions run without blocking conversation.
- Personality and tone: The OA is witty, concise, uses “we/our” language, avoids corporate jargon, and confirms intent before acting in the real world.
Orchestration and Execution Agents
Orchestrator Flow
- Receive user message: The user message is stored in the OA thread. The
orchestrate
action retrieves the OA thread and the user's identity, then builds the LLM toolset (sendMessageToUser, wait, draft, spawnJobsGroup, proposeReminder, cancelReminder). It composes a system prompt with tool descriptions, current conversation context, and personality guidelines. It then callsthread.generateText
to produce the next plan. - Tool calls: The LLM may call one or more tools. For example, it could send a message to the user, propose a job group, or post a status and wait. Tool calls are implemented as internal mutations; the orchestrator passes the OA thread ID and user ID to these functions.
- Job creation and waiting: For tasks requiring external actions (drafting/sending emails, Gmail operations, reminders), the LLM calls
spawnJobsGroup
. This inserts jobs in thejobs
table, ensures deduplication with deterministic job keys, schedules execution for each job, and posts a status message. A watchdog resume based on the default SLA (5 minutes) is also scheduled in the background for redundancy. - Resume: When a job finishes or the watchdog fires, the
resume
action collects job results viacollectReady
, constructs a synthesizing prompt that includes the latest results and any new user messages, and callsgenerateText
again. This loop continues until the LLM decides the conversation is complete and no further action is needed.
Here is how I envisioned the fanout phase:
Execution Agents
- Thread creation and memory: Each EA corresponds to a Convex thread that persists across jobs. EAs maintain their own message history and can recall past interactions, enabling context continuity when reused. This prevents context bloat in the central OA thread.
- Job execution:
broker.runJob
executes a job by invoking the router to obtain the appropriate EA thread and roster ID. Based on the job type, it calls the corresponding action or external API:draft_email
: Calls the LLM via the EA thread to generate a draft email using recipient, context, subject, and body.send_email
: Sends the email via the Gmail API with idempotency; uses context hashing to avoid duplicate sends.gmail_search
andgmail_read
: Use Google APIs to search or read emails, formatting results for the OA to synthesize.- Reminder jobs (created by
proposeReminder
) are handled separately by the reminder module.
- Completion and usage update: Once a job is completed,
completeJob
marks it as done, stores the results, updates the barrier, and if arosterId
is attached, schedulesupdateRosterUsage
. This increments usage counts and triggers semantic and temporal summarization of the agent's context.
Once the jobs were spawned, we could then execute on them:
Tools and Jobs
Toolset
sendMessageToUser
: Posts an assistant message to the OA thread for all user-facing output.draft
: Inserts a draft message produced by an EA into the OA thread.wait
: Posts a status message and schedules a resume after a specified delay. It can also be tied to a barrier group for job groups.spawnJobsGroup
: Creates a batch of jobs, deduplicates with content hashes, inserts them into thejobs
table, schedulesrunJob
for each, posts a status message, and sets a timeout. It requires properly structured args for each job type and validates them.proposeReminder
andcancelReminder
: Tools for scheduling and canceling reminders; they validate user preferences and handle confirmation flows.
Job Lifecycle
- Job Creation: Jobs are inserted into the
jobs
table with agroupId
,userId
,jobType
,args
, optionaltargetKey
,rosterId
, andeaThreadId
, statuspending
, and a deterministicjobKey
for deduplication. The group is tracked via thebarriers
table withtotal
anddone
counts. - Routing: For each job,
runJob
uses the router to classify the job and assign arosterId
andeaThreadId
. Routing logic prioritizes hard bindings (email addresses), alias matches (common names), legacy target keys, recency fallback, or new creation. - Execution: Jobs are executed using EAs or external actions. When complete,
completeJob
updates the barrier result list and triggers resume if all jobs are completed. If execution fails, error results are stored to inform the OA. - Deduplication and idempotency: Email sends use content hashes to prevent duplicates. Job creation checks for existing job keys to avoid re-scheduling similar tasks.
Once a job was completed, the orchestration turn would resume:
Then, once the user confirmed, we could finally complete the process in two more phases:
Agent Roster and Semantic Context
Motivation and Features
The agent roster delivers reliable routing and context awareness, a solution I developed after attempting to use targetKeys (keys assigned to entities within a task).
- Semantic descriptions:
generateSemanticDescription
extracts human-readable purposes from job arguments (e.g., "Searching emails from Tyler about camping") and stores them in the roster. This helps both debugging and user-facing display. - Alias extraction: Names are extracted not only from
draft_email
recipients, but also Gmail search queries (from:
/to:
patterns), enabling cross-job matching. - Metadata and usage stats: The roster stores purpose, creation time, initial args, capabilities, alias list, message count, and last used timestamp. After each job, usage statistics are updated and summarized, emphasizing purpose, usage count, capabilities, and last activity time.
- Deterministic routing: Every call to
classifyRoute
returns aRouteResult
with arosterId
andeaThreadId
. Fallback code inbroker.ts
was eliminated, simplifying OA logic and guaranteeing proper agent assignment.
Benefits and Challenges
- Clearer context for the user: With a UI, we could show users which agents exist and what they were created for (e.g., "Tyler (Camping) - last used Oct 2") instead of generic canonical names like "gmail_search_agent:tyler@email.com".
- Smarter matching: Aliases and semantic descriptions enable reuse across job types and ensure related tasks use the same EA, improving continuity and reducing overhead.
- Removal of outdated mapping: This reduces complexity and ensures a single source of truth for agent routing.
I faced many challenges implementing the agent roster. My initial plan was to have the LLM generate a key for each EA. The problem was that keys were too specific and caused the LLM to create new EA threads for every job. Tasks still got completed (the OA retained context of deployed EAs), but it introduced latency and complexity.
Background Jobs and Reminder System
Requirements and Data Model
Users should be able to set reminders during a conversation ("remind me to submit the report tomorrow at 6 PM"), and the system should deliver email notifications at the correct time. To support this:
reminders
: Stores details about the reminder, including the title, message, dueAt, status, and metadata for deduplication.user_preferences
: Each user must confirm theirtimezone
andnotificationEmail
before creating reminders. Missing preferences trigger a UI prompt via thePreferencesModal
.- Deduplication detection: When a reminder is created,
assessDuplicateReminder
checks for similar reminders within a 30-minute window and decides whether to merge or create a new entry.
Cron Scheduling and Email Delivery
- Cron sweep: A cron job runs every minute to call
sweepAndQueue
. It identifies due reminders, updates their status toqueued
, and schedules asendEmail
action for each. A daily cron archives reminders older than 30 days. - Sending reminders:
sendEmail
uses the Resend API and a React email template to send a friendly reminder. It updates the reminder's status tosent
or logs errors. - UI prompts and confirmations: If preferences are missing,
proposeReminder
enqueues aprompt_set_prefs
UI command and asks the user to confirm their timezone and email. Once preferences exist, the tool creates the reminder and posts a confirmation with a localized, human-readable date.
User Interaction via Tools
proposeReminder
: Accepts a title, message, and dueAtMs. It requires two explicit confirmations (user intent and exact time) before being called. It checks preferences, handles deduplication, and confirms creation with the user.cancelReminder
: Cancels a reminder by ID when the agent is confident which one to cancel, then confirms with the user.
Extensibility
The reminder mechanism can be extended to support additional tools like Slack messages, calendar events, or push notifications by implementing new send actions and preference fields. The modular design enables LLM-decided action: once a tool exists, it can be used by any process.
Integrations and Tooling
Gmail Integration
- Search and read: The Gmail integration allows users to search and read emails.
gmail_search
uses the API to query messages; results are formatted with IDs and snippets for easy selection.gmail_read
reads a specific email and formats its content for the agent. - Draft and send: Drafting uses LLM reasoning in the EA thread to produce a full email (subject, body, recipient) from the job arguments.
External Actions and Idempotency
- ID management: Each send job includes a
contentHash
,eaThreadId
, andtargetKey
. Before sending, the system queries thesend_log
table to ensure the same content hasn’t already been sent in the thread. - Provider handling: Currently only Gmail is supported via Google API with OAuth. Other providers could be added with similar actions and logs.
Lessons Learned and Future Direction
Architectural Insights
- Centralized orchestrator simplifies UI and UX: Keeping all user interaction in the OA and isolating side-effects behind tools preserves a single persona and prevents the LLM from leaking internal reasoning. This makes fine-tuning persona and response style easier.
- Persistent agents improve context: Reusing EAs for related tasks reduces redundant prompting and provides memory of previous actions. The agent roster further improves this by tracking semantic context and alias usage for better routing. It also prevents context bloat inside the OA thread.
- Asynchronous design balances responsiveness and reliability: The barrier/job system lets long-running operations run without blocking the user. Tools like
wait
andresume
maintain conversation flow while tasks complete in the background.
Challenges
- Latency: Job scheduling, router classification, and external APIs all introduce delays. Ensuring the OA kept the user informed was critical — often solved with status messages rather than shaving milliseconds.
- Complexity: The interplay between jobs, barriers, EAs, and routing adds many layers. Comprehensive debug logs and brokers were critical to observe how the LLM was reasoning and adjust accordingly.
- User interaction: Building a dynamic system prompt was key to ensuring the OA understood its role. Early on, it treated the thread like a standard LLM conversation instead of using tools. Prompting and design adjustments fixed this.
Future Enhancements
- Embedding-based routing: Use vector embeddings for semantic descriptions and conversation history to enable smarter reuse of agents across domains.
- Context summarization and updates: Extend semantic updates to summarize context after N messages or tasks, keeping descriptions current for better routing.
- Extended connectors: Integrate with more tools, like Slack, calendars, GitHub, or research systems. The router is domain-agnostic and can map new capabilities with minimal changes.
- Proactive suggestions: Leverage semantic context to suggest follow-ups or reminders ("We discussed a Q2 report, should I schedule a reminder to send it next week?").
Reflection
This was easily the most ambitious system I've ever built. At first, it was just a theory sketched out in Excalidraw. I didn't know if it would even be possible to get all the moving parts — orchestration, execution, persistence, semantic routing — to work together.
What I discovered:
-
Simple primitives scale far. The entire system really boils down to two internal actions (send a message, wait for an agent to finish) and one external action (update the user). Everything else is just composition. That realization changed how I think about agentic systems.
-
Debugging is design. Half the battle was building logs and brokers just to see how the LLM was “thinking.” Without transparency, the orchestration loop collapses into a sea of unknowns.
-
Persistence is everything. A new EA every time technically “worked,” but it killed continuity. Building the agent roster with semantic contextualization was the turning point — suddenly agents felt alive, not disposable.
-
Latency is a human problem, not just a technical one. The biggest challenge wasn't making the system faster; it was keeping the user informed. A witty status message can smooth over a 5-second delay better than shaving milliseconds off a job.
Where this leaves me:
This project was proof that multi-agent orchestration is practical. The architecture is modular, extensible, and already hints at real applications: from AI-powered email triage to proactive reminders to full productivity suites.
But more importantly, it reminded me why I build in the first place. When I finally got the system to search an email, draft a response, and send it off — autonomously, with continuity — I felt the same spark I did back at ShellHacks when we proved an AI tutor could adapt in real-time. Pure dopamine.
The frontier of AI isn't "smart chatbots," but systems that coordinate memory, reasoning, and tools into something that feels closer to a collaborator. The future of interfaces is still to be discovered, but this project was my first serious step into that frontier. And it won't be my last.
I plan to have an open source version of the project available by the end of October. If you'd like to see more about the application itself, please check out the Project Page
© 2025 Kevin Willoughby. All rights reserved.