Cosmo voice calls — research + early plan
Findings, not yet a build brief.
executeClaudeCode() in src/agent.js is not modified beyond the bare minimum to (a) optionally enable token-level streaming for TTS and (b) accept a per-turn voice-mode system prompt overlay. No swap to Realtime models, no swap to OpenClaw, no swap to a different LLM. OpenClaw and friends are inspiration for the transport and the loop, not the brain.
- The shape we're building
- OpenClaw + reference voice loops, briefly
- What Cosmo already has
- The unified sessions question
- Making it feel like a call (PWA, audio, UI)
- Picking the loop (STT, brain, TTS, transport)
- Initiation surfaces — Telegram, Claude Code, later
- What v1 looks like end-to-end
- What's locked, what's open
- Decisions you need to make
- Phase 2 — phone number (note only)
- References
1. The shape we're building
One sentence: a /call command mints a tokenised URL → you open it in a browser → it feels like a call → Cosmo on the other end is the same Cosmo you already talk to in Telegram, but voice-tuned.
Telegram or Claude Code"] --> B[mint token] B --> C[(sessions doc
holds context)] B --> D[https://cosmo-call.../c/<token>] D --> E[browser opens
call screen] E --> F[mic + WebRTC/WS] F --> G[STT stream] G --> H[Cosmo agent
same brain] H --> I[TTS stream] I --> E C --> H
The interesting word here is token. The token doesn't just authenticate the call — it resolves to a session server-side. The session knows which Telegram chat (or which Claude Code cwd + branch) the call was initiated from, the recent turns, the project context, the memory scope. So when you `/call` from your H2OS chat mid-conversation about a dispenser, Cosmo on the call already knows.
This means a separate decision is forced into the open: what is a "session" in Cosmo, exactly? The codebase already overloads the word three ways. We'll get to it (§4).
2. OpenClaw + reference voice loops, briefly
OpenClaw the project doesn't ship a voice mode in core. The community add-on is Purple-Horizons/openclaw-voice. Worth knowing the shape:
| Reference loop | Transport | STT | TTS | VAD / barge-in | Notable |
|---|---|---|---|---|---|
| openclaw-voice | WebSocket | Whisper (local, faster-whisper) | ElevenLabs Turbo v2.5, sentence streaming | Silero VAD server-side; barge-in not documented | "Voice never leaves your machine." Closest reference architecture. |
| Anthropic cookbook | WS end-to-end | ElevenLabs Scribe | ElevenLabs Flash WS (text-stream-in, audio-stream-out) | Sentence-aligned chunks | Recommends WebSocket TTS over HTTP TTS for prosodic continuity. ~31% TTFT reduction streaming Haiku. |
| LiveKit Agents + Anthropic plugin | WebRTC | Pluggable (Deepgram default) | Pluggable (Cartesia default) | Framework-handled. Real barge-in. | Production-grade. Decouples STT/LLM/TTS as plugins. Closest to "what if we wanted it bulletproof." |
| Pipecat | WebRTC or WS | Pluggable | Pluggable | Silero + their SmartTurn turn-taking model | Frame-based. Best-in-class turn detection. |
What's worth borrowing from any of these:
- Silero VAD in the browser via
@ricky0123/vad-web. Runs as ONNX in an AudioWorklet, ~1ms per frame.onSpeechStartis the barge-in signal;onSpeechEndfinalises the user turn. - WebSocket TTS over HTTP TTS when the provider supports it — text-stream in, audio-stream out, no sentence buffering, prosody preserved across chunks.
- Sentence-boundary streaming from LLM tokens to TTS, so the user starts hearing Cosmo before Cosmo is done thinking.
- Filler-while-tool-runs — when a tool call begins mid-turn, emit a short "looking at your calendar..." either via the prompt or via pre-baked audio.
What we explicitly are not borrowing:
- OpenClaw's brain. Cosmo's brain stays.
- OpenClaw's local Whisper. We already pay for OpenAI; we're not running a Whisper server. (Streaming STT provider is a separate decision — see §6.)
- LiveKit / Pipecat as a runtime dependency. We're a single-user agent; the framework overhead doesn't earn its keep at our scale. We copy the patterns, we don't adopt the framework.
3. What Cosmo already has
This is important to document because it changes what's "new build" vs "wire it up differently."
The existing voice path
/voicemode(bot.js:1763-1773) — toggles a per-user in-memory flag. Not persisted. Only changes the output mode.- Inbound voice notes (bot.js:1300-1342) — Telegram .ogg → OpenAI Whisper API → text → existing request queue.
- Outbound voice replies (bot.js:1058-1087) — full text → OpenAI
tts-1with voicenova→ .opus → TelegramsendVoice.
The existing path is file-based and half-duplex. Whole utterance up, whole utterance down. Zero streaming. Whisper API is batch-only — no streaming endpoint exists. This is fine for voice notes, fatal for a call.
One bit of leftover code worth noting: src/voice/ is a half-built Python "Hey Cosmo" wake-word menu-bar daemon. Parked, never wired up. Different problem (text injection into focused app), not relevant to /call.
What's reusable for v1
| Existing piece | Reusable in /call? | Why |
|---|---|---|
transcribeVoice() (whole-file Whisper) | no | Whisper API has no streaming mode. Need a different STT. |
textToSpeech() (whole-text OpenAI TTS) | partially | OpenAI tts-1 has HTTP chunked streaming but no WebSocket. Acceptable for v1, suboptimal vs ElevenLabs WS. |
voiceMode Map | no | In-memory per-user flag is the wrong shape — sessions are server-side, multi-surface. |
OPENAI_API_KEY | yes | Already provisioned for Whisper + TTS. |
executeClaudeCode() agent loop | yes — unchanged | The constraint. Add includePartialMessages: true for streaming, prepend voice-mode system prompt, otherwise untouched. |
| Firestore request queue | conceptually | The onSnapshot-listener pattern works for asynchronous queues, not for a live audio loop. The call needs a tighter direct path. |
buildTurnTopicsContext(message) (memory router) | yes | Same memory loading per turn. |
4. The unified sessions question
The token-resolves-to-a-session premise forces a question we've been avoiding: what is a session? The codebase overloads the word at least three ways already:
- Claude Agent SDK session — the
sessionIdstring returned byquery(). Stored asclaudeSessionIdon the chat doc. - Telegram chat state — the
chatsFirestore collection, with backwards-compat aliasesgetSession/saveSessionin db.js:69-72. - Claude Code transcripts — JSONL files in
~/.claude/projects/<slug>/, each with their ownsessionId.
A web call has nowhere to live in any of these. It needs at minimum a doc somewhere that holds: which originating context, when it started, which Cosmo session ID is active, optionally a name.
Two paths fall out of the research:
sessions collection that references the existing things- New collection. Surface enum:
telegram | claude-code | call. - Telegram sessions reference a
chatsdoc by chatKey. No data moved. - Claude Code sessions reference a transcript path + cwd + branch.
- Call sessions stand alone, optionally with a
parent_session_idpointing back to the originator. - Existing code paths untouched. Messages still keyed by chatId.
chats + chat_contexts into sessions- One collection, one concept.
- Touches the messages-by-
chatIdquery in db.js:99-103 and thechatKeycompat logic. - Real migration of historical messages and
chatIdMigration.jscompatibility paths. - Cleaner long-term but a meaningful side-quest. Don't bundle it with voice v1 unless it's already on the roadmap.
And there's the naming problem. Adding sessions as a fourth meaning of "session" is asking for grief. Calling them conversations or threads would disambiguate. The user's call: do we colonise the word back, or pick a new one?
Naming the conversations
Whatever we call them, they need human-readable names. Sidebar showing abc123 · 2 days ago is unsearchable. Showing "Memory v2 dashboard work · 2 days ago" is not. Three options:
- User-set —
/name <text>. Same pattern as/project. Always available, always overrideable. Default fallback to the auto-name if user hasn't set one. - Auto-named — Cosmo titles the conversation after the first 2-3 turns. Same trick ChatGPT does.
- Hybrid — auto-suggest after first few turns, user can override anytime. Probably the right answer.
The Claude Code session-start hook
Claude Code SessionStart hooks receive a JSON payload with session_id, transcript_path, cwd, source (startup | resume | clear | compact), and the project dir as $CLAUDE_PROJECT_DIR. Hooks can write context back into the conversation via stdout or via { hookSpecificOutput: { additionalContext: "..." } }. So a hook can:
- POST
{ surface: 'claude-code', cwd, branch, transcriptPath, claudeSessionId }to a local Cosmo endpoint. - Endpoint mints / fetches the session doc, returns its ID.
- Hook prints
"Cosmo session: <id>. To call, run /call."back to the prompt context.
That gives Claude Code first-class membership in the same session universe Telegram has. The slash command /call then just shells out, looks at the most recent transcript line for the current sessionId, posts to /call/mint with that as the parent session, and prints the URL.
5. Making it feel like a call
This is where the user wanted depth. The request was: even on web, make it feel like FaceTime / Google Meet, not like a Zoom side-tab.
PWA capabilities — what works in May 2026
| Feature | Status | Notes |
|---|---|---|
| Add-to-home-screen | stable | iOS Safari 16.4+. iOS 26 makes it default for installed sites. |
display: "standalone" (no browser chrome) | stable | Required for the native look. Note: "fullscreen" is not supported on iOS Safari. |
navigator.wakeLock.request("screen") | stable since iOS 18.4 | Was broken in installed PWAs on iOS until 18.4. Now reliable. Lock the screen on during a call. |
| Background audio when screen locks | broken on iOS PWAs | WebKit bug 198277. Audio cuts when the phone backgrounds or screen locks. Real limitation for v1. |
| Media Session API | stable | iOS Safari 14.5+. Lock-screen controls + metadata. Doesn't fix the background-audio gap. |
Theme color (<meta name="theme-color">) | stable | Update dynamically to tint the iOS status bar during a call. |
| Safe-area insets | stable | Needed so end-call doesn't sit under the home indicator. |
Haptic feedback (navigator.vibrate) | never iOS | Android-only. Mute-tap haptic only works on Android web. |
| Web Push (incoming-call notification) | PWA-only | iOS 16.4+ for installed PWAs. Required for "Cosmo is calling you." |
Manifest pattern:
{
"name": "Cosmo Calls",
"short_name": "Cosmo",
"display": "standalone",
"orientation": "portrait",
"theme_color": "#0f1115",
"background_color": "#0f1115",
"start_url": "/?source=pwa",
"icons": [{ "src": "/icon-512.png", "sizes": "512x512", "type": "image/png" }]
}
The audio loop — WebRTC vs WebSocket+Opus
The honest take from the research is that for a single-user agent on a good network, WebSocket + Opus-in-AudioWorklet works. WebRTC is the better tech, but the floor latency improvement (~200-500ms) is dwarfed by LLM TTFT (~700ms+). Most production voice agents are migrating to WebRTC anyway, but it's a real infra step (STUN/TURN, optionally an SFU).
- Browser
WebSocket+AudioWorklet. No STUN/TURN, no SFU. - Mic → AudioWorklet → Opus encode → WS → server.
- TTS chunks → WS → AudioWorklet ring buffer → speakers.
- 200-500ms transport latency, dwarfed by LLM.
- Works fine for one user. Falls over on lossy mobile.
RTCPeerConnection, mandatory Opus, built-in jitter buffer + AEC + noise suppression.- 60-120ms transport latency.
- Needs ICE/STUN/TURN; can fall back to TURN-over-TCP/443.
- Production posts split: some swore the move to WebRTC was worth it, others switched back to WS for simplicity at single-user scale.
- If we're going to be on bad mobile networks regularly (in cars, on the go), this is the answer.
VAD — barge-in is what makes it feel like a call
The single biggest "this feels alive" feature is interruption. You start to talk, Cosmo stops talking immediately. Without that, every call feels like a walkie-talkie.
The solved pattern is browser-side Silero VAD via @ricky0123/vad-web:
const myvad = await vad.MicVAD.new({
onSpeechStart: () => { /* user started — clear TTS ring buffer */ },
onSpeechEnd: (audio) => { /* Float32Array @ 16kHz — finalise STT */ },
positiveSpeechThreshold: 0.5,
negativeSpeechThreshold: 0.35,
minSpeechFrames: 3,
redemptionFrames: 8,
frameSamples: 1536,
});
~1ms per frame, runs as ONNX in an AudioWorklet thread, doesn't block the UI.
Audio playback — the AudioWorklet ring buffer pattern
For low-latency streaming TTS chunks, the right pattern is Web Audio AudioWorklet with a ring buffer. PCM chunks come in over the WS, get pushed into a SharedArrayBuffer, the worklet drains it sample-by-sample. Cancellation = drain the buffer, one message. Same pattern OpenAI Realtime, LiveKit, and Pipecat browser clients all use.
The legacy options (HTMLAudioElement per chunk, MSE) all have audible gaps or codec-prefix requirements that don't suit our use.
The visual call screen
The bits that make a web page feel like a call screen, not like a settings page:
- Big avatar with audio-driven pulse. Route the TTS playback stream through an
AnalyserNode, compute RMS inrequestAnimationFrame, scale a CSS transform on the avatar. The avatar pulses when Cosmo is talking. Same for the user's mic ring when they're talking. This is the single highest-impact visual. - Connection-state stages.
dialing → ringing → connectedwith a subtle animation per stage. Fake it slightly if needed — even 400ms of "connecting" feels honest. - Bright red end-call button, large, bottom-right. Material 3 style: oversized, unmistakable. Far from mute (which goes left).
- Mute button with strong visual feedback. No haptic on iOS, so the visual must carry it.
- Persistent call timer in MM:SS. Tiny touch, big effect.
- Wake lock on the whole time the call is active. Screen never dims.
- Status-bar tinted via
theme-colorso the iOS bar matches the call screen instead of clashing. - Safe-area padding on the bottom so end-call sits above the home indicator.
Mic permissions UX
Mic permissions are fragile, especially on iOS Safari. Two non-negotiable rules:
- HTTPS only.
getUserMediadoesn't prompt onhttp://exceptlocalhost. - Synchronous user-gesture. On iOS,
getUserMediamust be called synchronously inside a tap handler. Don'tawaitanything before it. Pattern: tap button → callgetUserMediafirst →awaiteverything else after.
Best practice: a "Tap to start the call" splash that explicitly says mic access will be requested. No auto-prompt on page load — looks dodgy and gets denied.
6. Picking the loop
STT — streaming, not Whisper
Whisper API is batch-only. For a real call we need streaming. The realistic options:
| Provider | Endpoint | TTFT P50 | Pricing | Notes |
|---|---|---|---|---|
| Deepgram Nova-3 / Flux | WS | <300ms (Flux: <150ms) | $0.0077/min PAYG | Workhorse. Industry default. Workers AI also exposes it. |
| AssemblyAI Universal-3 Pro Streaming | WS | ~150ms | $0.0025/min | Best entity accuracy. Cheapest realistic option. |
| OpenAI Whisper | HTTPS (batch) | ~1s+ for 5s clip | $0.006/min | What Cosmo uses today. No streaming. Fatal for a call. |
| OpenAI Realtime API | WS or WebRTC | ~500ms | Token-based audio | Different model. Speech-to-speech, conflicts with "brain stays as-is." |
Streaming STT is non-negotiable for the call to feel right. Deepgram is the safe pick; AssemblyAI is cheaper. Whisper is out for the call path (still fine for voice notes).
TTS — the ElevenLabs vs OpenAI fork
This is the most interesting tradeoff in the stack.
tts-1 with voice "nova" continuity
- Same voice the existing Telegram voice notes use. Cosmo sounds like Cosmo across surfaces.
- HTTP chunked streaming only — no WebSocket TTS. Confirmed via OpenAI docs and community thread.
- ~500ms TTFB; sentence-buffering at our end to keep playback smooth.
- Already on the bill. Already provisioned. Zero new vendor.
- To get true WebSocket TTS from OpenAI you have to use the full Realtime speech-to-speech API, which conflicts with the brain-stays-as-is rule.
- True bidirectional WebSocket: text-stream in, audio-stream out. Prosody preserved across chunks.
- ~50-75ms model TTFB. ~400-500ms total with network in real conditions.
- Recommended by Anthropic's own cookbook for low-latency Claude voice.
- New voice for Cosmo. Different sound across surfaces. Voice cloning is an option (Cosmo could sound like a custom voice we pick).
- ~$0.30/1k chars on Pro tier. New monthly bill.
The decision is: continuity vs latency vs voice quality vs new bill. There isn't a wrong answer; there's a preference.
Side note: Cartesia Sonic Turbo has the lowest model TTFB in the comparison set (~40ms) and is a strong technical choice, but the user explicitly named ElevenLabs and OpenAI as the two providers in scope, so it's a footnote, not a contender.
Brain — keeping it unchanged, with one switch
The Agent SDK supports token-level streaming via includePartialMessages: true. Today's executeClaudeCode() doesn't set it. With it on, the iterator additionally yields stream_event chunks containing raw Anthropic API streaming events. Per the docs:
for await (const message of query({ prompt, options: { includePartialMessages: true, ...existing } })) {
if (message.type === "stream_event") {
const event = message.event;
if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
// ship event.delta.text to TTS sentence buffer
}
if (event.type === "content_block_start" && event.content_block.type === "tool_use") {
// emit "looking at your <tool name>..." filler audio
}
}
// existing AssistantMessage and result handling unchanged
}
Telegram path is untouched (it ignores stream_event); call path adds the new branch. Per-turn voice-mode system prompt overlay is one extra append on the system prompt when the surface is call.
StreamEvent messages are not emitted when explicit maxThinkingTokens is set. Cosmo currently uses thinking: { type: 'adaptive' }. The docs further say "thinking is disabled by default in the SDK, so streaming works unless you enable it" — which suggests adaptive thinking may suppress stream events. This needs a 30-min spike to verify before we commit to token-level streaming. If incompatible, voice-mode either disables adaptive thinking on call turns, or accepts whole-block streaming with sentence-buffering at TTS time (slightly higher TTFB, still works).
Voice-mode system prompt
From the prompting research (ElevenLabs guide, LiveKit prompting voice agents, Vapi guide), voice-tuned prompts produce responses 60-70% shorter than text equivalents. The overlay we'd append:
- You're on a phone call. Speak in 1-3 sentence turns. One question per turn.
- No markdown — no
**bold**, no asterisks, no bullets, no headers, no code blocks. - Spell out numbers when natural ("twenty-three dollars" not "$23"). Spell out small numbers, time-of-day, and money.
- Don't narrate the user's question back to them.
- Before any tool call, emit a short ack first ("looking now..." or similar), then call the tool.
- Use natural fillers ("hmm", "let me see") only when they emerge organically. Don't force it.
- If interrupted, stop cleanly and listen.
Aborting mid-stream
Today's interrupt path goes through Firestore (cancelRequest(), polled by the agent). For a call, the abort needs to be sub-100ms — too slow if it round-trips Firestore. The Agent SDK accepts an AbortController; the call path should hold it locally and abort directly when VAD fires onSpeechStart. Worth confirming in the SDK source before locking it in.
7. Initiation surfaces
The same minting endpoint serves all surfaces. The only thing that varies is what context the surface ships in the mint request.
Telegram /call
- User types
/callin any chat. - Bot reads chat context (existing
getChatContextWithMigration), POSTs{ surface: 'telegram', telegramChatKey, recentTurns, project }to/call/mint. - Endpoint returns a single-use URL (TTL: 5 min to claim, then call session lives for the call).
- Bot replies with the URL as a clickable link.
- User taps → browser opens the call screen.
Claude Code /call
- Slash command (
~/.claude/commands/call.md) shells out to a node script. - Script reads the most recent transcript line for current
sessionId,cwd,gitBranch. - POSTs
{ surface: 'claude-code', claudeSessionId, cwd, gitBranch, transcriptPath }to/call/mint. - Prints the URL into the response. User cmd-clicks.
An optional SessionStart hook (off by default until we've been running it a while) can pre-mint a session doc on Claude Code launch so even if you don't /call, your CC sessions are first-class in the same Firestore collection — useful for cross-session memory and "what was I doing yesterday" queries.
Mac menu-bar / native (later)
A small menu-bar app could open a fixed URL like cosmo-call.../c/menubar?fresh=1 to start an "ambient" call with no parent session — pure new conversation, no context inherited. Or it can read the system clipboard for a token URL the user just got and open straight to it. Out of scope for v1, easy to add later.
8. What v1 looks like end-to-end
(sessions) participant Browser participant Voice as cosmo-voice
(new process) participant STT as Streaming STT participant Brain as cosmo-agent
(unchanged) participant TTS as Streaming TTS User->>Telegram: /call Telegram->>Bot: command Bot->>Mint: POST { surface, context } Mint->>FS: create session doc Mint-->>Bot: { url, token } Bot-->>User: clickable URL User->>Browser: tap URL Browser->>Voice: WS connect with token Voice->>FS: resolve session loop per turn User->>Browser: speak Browser->>Voice: PCM chunks (WS) Voice->>STT: stream STT-->>Voice: transcript deltas Voice->>Brain: query() with includePartialMessages Brain-->>Voice: token deltas Voice->>TTS: text stream TTS-->>Voice: audio chunks Voice-->>Browser: audio chunks (WS) Browser->>User: speakers User->>Browser: barge-in (VAD onSpeechStart) Browser->>Voice: cancel Voice->>Brain: AbortController.abort() Voice->>TTS: close Voice->>Browser: clear ring buffer end User->>Browser: end call Browser->>Voice: close WS Voice->>FS: session ended, transcript saved
New components
cosmo-voice— new PM2 process. Holds the WS connections. Runs the per-call loop. Talks to the existingcosmo-agentin-process (Node import) or over a tighter local IPC than Firestore (decision below).- Mint endpoint — small HTTP API. Lives in
cosmo-voiceorcosmo-web. - Web app — static site (Cloudflare Pages or served from
cosmo-web). Manifest, service worker, the call screen UI, the AudioWorklet, the VAD setup. sessionsFirestore collection (or whatever we end up calling it)./callin bot.js — registered command, mint + reply./callClaude Code slash command — under~/.claude/commands/call.md.
What stays exactly the same
executeClaudeCode(). AddsincludePartialMessages: truewhen called from voice surface. Prepends a voice-mode system prompt block when surface iscall. Otherwise unchanged.- Memory router (
buildTurnTopicsContext). Same. - All tools, all integrations, all skills.
- Telegram voice notes. The existing
/voicemodepath is untouched.
9. What's locked, what's open
| Decision | Status | Notes |
|---|---|---|
| Brain stays as Cosmo agent | locked | The constraint that defines the whole shape of v1. |
| Web is the v1 transport | locked | Telephony is phase 2. |
/call in Telegram + Claude Code | locked | Token-bearing URL pattern. Same mint endpoint serves both. |
| Tokens carry session context | locked | That's the whole point of the URL pattern. |
| Sessions collection introduced | locked in principle | Additive vs migration is open. Naming is open. |
| Browser-side Silero VAD for barge-in | locked | @ricky0123/vad-web. The pattern is too solved to relitigate. |
| AudioWorklet ring-buffer playback | locked | Same. The right pattern for low-latency streaming TTS. |
| PWA call screen with wake lock + theme color + safe areas | locked | The "feels like a call" tier. Background-audio gap on iOS is accepted (call works while screen on). |
| STT: streaming required (Whisper out) | locked | Whisper is batch only. Provider choice (Deepgram vs AssemblyAI) is open. |
| Voice-mode system prompt overlay | locked | Adds at getSystemPrompt() call site when surface = call. |
| TTS provider | open | OpenAI tts-1 nova (continuity) vs ElevenLabs Flash v2.5 WS (latency). |
| STT provider | open | Deepgram Nova-3 vs AssemblyAI Universal-3. |
| Adaptive thinking + streaming compatibility | open | Needs a 30-min spike. Real spec-blocker. |
| Transport: WebSocket vs WebRTC | open | WS is fine for v1; WebRTC is better for mobile / lossy networks. |
| Sessions: additive vs migration | open | Additive is recommended for v1; migration as separate later project. |
| Sessions naming | open | sessions overloads existing terms. conversations or threads disambiguate. |
| Auto-naming approach | open | User-set / auto / hybrid. Hybrid is probably right. |
| Hosting (cosmo-voice on Mac via tunnel vs CF Workers vs hybrid) | open | Mac-via-tunnel is the lowest-friction path that mirrors existing patterns. |
| Audio retention (privacy) | open | Transcripts go to messages. Raw audio: keep / drop / how long? |
| Concurrent call cap | open | One at a time is fine for v1; architecture should not preclude many. |
10. Decisions you need to make
In rough order of impact:
tts-1 with "nova" — Cosmo sounds the same across surfaces. HTTP chunked streaming, ~500ms TTFB. Already paid for.ElevenLabs Flash v2.5 WS — different (probably better) voice. True streaming. ~75ms model TTFB. New monthly bill (~$0.30/1k chars).
The cookbook recommends ElevenLabs. The continuity argument favours OpenAI. Your call.
AssemblyAI Universal-3: ~150ms TTFT, $0.0025/min (cheaper), best entity accuracy.
No wrong answer. AssemblyAI is the pragmatic pick if cost matters; Deepgram has more developer mindshare.
sessions overloads three existing meanings. conversations or threads disambiguate. Pick one.
thinking: { type: 'adaptive' } disables StreamEvent emission. Needs a 30-min spike before we commit to token-streaming. If incompatible: either turn off adaptive thinking on call turns (small quality dent) or accept whole-block streaming with sentence-buffering (slightly higher TTFB).Should we just do this spike now — before any of the above decisions are real? Yes.
(a) live on the Mac as a new PM2 process (
cosmo-voice), exposed via Cloudflare Tunnel — same pattern as deep-link redirector. Lowest friction.(b) live on Cloudflare Workers + Durable Objects, relaying via tunnel back to the Mac for the brain call. More moving parts but edge-native.
Recommend (a) for v1.
messages collection naturally. Do we keep raw mic audio? For how long? For what purpose? Privacy decision — your call.
11. Phase 2 — phone number (note only)
When we add a real phone number later, the cleanest path is Twilio ConversationRelay, not raw Media Streams:
- Twilio handles STT and TTS server-side. Connects to our WebSocket as text in / text out.
- Our WS receives
{ type: "prompt", voicePrompt: "<user transcript>" }and replies with streamed{ type: "text", token: "...", last: false }chunks. - Interruption arrives as
{ type: "interrupt", utteranceUntilInterrupt: "..." }so our local conversation history can be rewound to exactly where the user heard. - Architecturally identical to web-v1 from the server's perspective: same stream-of-tokens-from-Claude → stream-of-text-tokens-to-the-other-side loop. Twilio CR replaces the browser as the audio peer.
What would paint v1 into a corner:
- Doing TTS server-side and sending audio chunks to the browser. Twilio CR doesn't accept audio in that mode — it wants text. If we send text in v1 (browser does its own decode/playback) and a server-side TTS path lives parallel for browser audio output, both transports work cleanly later.
- Doing STT only in the browser. Twilio sends transcripts as text — server-side ingest path needs to exist regardless.
v1 should structure the agent-facing interface as (input: text-stream | audio-stream) → agent → (output: text-stream) so swapping browser ↔ Twilio is a transport adapter, not an agent change.
12. References
Full research findings at specs/research/voice-call-findings.md. Key external sources:
- openclaw-voice — community Claude voice chat add-on
- Anthropic cookbook — low-latency Claude voice with ElevenLabs
- LiveKit Agents — Anthropic plugin
- LiveKit — WebRTC vs WebSocket for voice agents
- Pipecat — Python voice agent framework
- Claude Agent SDK — streaming output docs
- Claude Code hooks reference
@ricky0123/vad-web— Silero VAD in browser- ElevenLabs WebSocket TTS
- ElevenLabs prompting guide for voice agents
- LiveKit — prompting voice agents to sound more realistic
- OpenAI Realtime WebRTC (for reference, not adoption)
- Twilio ConversationRelay — Anthropic streaming pattern
- WebKit bug 198277 — iOS PWA background audio
- WebKit bug 254545 — iOS PWA wake lock (resolved)