Cosmo voice calls — research + early plan

Findings, not yet a build brief. Sun 3 May 2026.

Listen — tap to start, auto-plays each section

What this doc is. The user wants to talk to Cosmo — real, two-way, conversational, distinct from today's voice-note ping-pong. We agreed v1 is a web-based call initiated from a slash command (in Telegram and in Claude Code), opening a unique URL whose token carries the originating session's context. Telephony (Twilio) is later. The brain stays exactly as it is — same Claude Agent SDK loop, same tools, same memory. We just put a voice transport in front of it. This doc captures what the deep research turned up, the decisions that fell out cleanly, and the forks the user still needs to call.

Hard rule. Cosmo's brain stays as-is. executeClaudeCode() in src/agent.js is not modified beyond the bare minimum to (a) optionally enable token-level streaming for TTS and (b) accept a per-turn voice-mode system prompt overlay. No swap to Realtime models, no swap to OpenClaw, no swap to a different LLM. OpenClaw and friends are inspiration for the transport and the loop, not the brain.

Contents

The shape we're building
OpenClaw + reference voice loops, briefly
What Cosmo already has
The unified sessions question
Making it feel like a call (PWA, audio, UI)
Picking the loop (STT, brain, TTS, transport)
Initiation surfaces — Telegram, Claude Code, later
What v1 looks like end-to-end
What's locked, what's open
Decisions you need to make
Phase 2 — phone number (note only)
References

1. The shape we're building

One sentence: a /call command mints a tokenised URL → you open it in a browser → it feels like a call → Cosmo on the other end is the same Cosmo you already talk to in Telegram, but voice-tuned.

flowchart LR A["/call from
Telegram or Claude Code"] --> B[mint token] B --> C[(sessions doc
holds context)] B --> D[https://cosmo-call.../c/<token>] D --> E[browser opens
call screen] E --> F[mic + WebRTC/WS] F --> G[STT stream] G --> H[Cosmo agent
same brain] H --> I[TTS stream] I --> E C --> H

The interesting word here is token. The token doesn't just authenticate the call — it resolves to a session server-side. The session knows which Telegram chat (or which Claude Code cwd + branch) the call was initiated from, the recent turns, the project context, the memory scope. So when you `/call` from your H2OS chat mid-conversation about a dispenser, Cosmo on the call already knows.

This means a separate decision is forced into the open: what is a "session" in Cosmo, exactly? The codebase already overloads the word three ways. We'll get to it (§4).

2. OpenClaw + reference voice loops, briefly

OpenClaw the project doesn't ship a voice mode in core. The community add-on is Purple-Horizons/openclaw-voice. Worth knowing the shape:

Reference loop	Transport	STT	TTS	VAD / barge-in	Notable
openclaw-voice	WebSocket	Whisper (local, faster-whisper)	ElevenLabs Turbo v2.5, sentence streaming	Silero VAD server-side; barge-in not documented	"Voice never leaves your machine." Closest reference architecture.
Anthropic cookbook (ElevenLabs low-latency)	WS end-to-end	ElevenLabs Scribe	ElevenLabs Flash WS (text-stream-in, audio-stream-out)	Sentence-aligned chunks	Recommends WebSocket TTS over HTTP TTS for prosodic continuity. ~31% TTFT reduction streaming Haiku.
LiveKit Agents + Anthropic plugin	WebRTC	Pluggable (Deepgram default)	Pluggable (Cartesia default)	Framework-handled. Real barge-in.	Production-grade. Decouples STT/LLM/TTS as plugins. Closest to "what if we wanted it bulletproof."
Pipecat	WebRTC or WS	Pluggable	Pluggable	Silero + their SmartTurn turn-taking model	Frame-based. Best-in-class turn detection.

What's worth borrowing from any of these:

Silero VAD in the browser via @ricky0123/vad-web. Runs as ONNX in an AudioWorklet, ~1ms per frame. onSpeechStart is the barge-in signal; onSpeechEnd finalises the user turn.
WebSocket TTS over HTTP TTS when the provider supports it — text-stream in, audio-stream out, no sentence buffering, prosody preserved across chunks.
Sentence-boundary streaming from LLM tokens to TTS, so the user starts hearing Cosmo before Cosmo is done thinking.
Filler-while-tool-runs — when a tool call begins mid-turn, emit a short "looking at your calendar..." either via the prompt or via pre-baked audio.

What we explicitly are not borrowing:

OpenClaw's brain. Cosmo's brain stays.
OpenClaw's local Whisper. We already pay for OpenAI; we're not running a Whisper server. (Streaming STT provider is a separate decision — see §6.)
LiveKit / Pipecat as a runtime dependency. We're a single-user agent; the framework overhead doesn't earn its keep at our scale. We copy the patterns, we don't adopt the framework.

3. What Cosmo already has

This is important to document because it changes what's "new build" vs "wire it up differently."

The existing voice path

/voicemode (bot.js:1763-1773) — toggles a per-user in-memory flag. Not persisted. Only changes the output mode.
Inbound voice notes (bot.js:1300-1342) — Telegram .ogg → OpenAI Whisper API → text → existing request queue.
Outbound voice replies (bot.js:1058-1087) — full text → OpenAI tts-1 with voice nova → .opus → Telegram sendVoice.

The existing path is file-based and half-duplex. Whole utterance up, whole utterance down. Zero streaming. Whisper API is batch-only — no streaming endpoint exists. This is fine for voice notes, fatal for a call.

One bit of leftover code worth noting: src/voice/ is a half-built Python "Hey Cosmo" wake-word menu-bar daemon. Parked, never wired up. Different problem (text injection into focused app), not relevant to /call.

What's reusable for v1

Existing piece	Reusable in `/call`?	Why
`transcribeVoice()` (whole-file Whisper)	no	Whisper API has no streaming mode. Need a different STT.
`textToSpeech()` (whole-text OpenAI TTS)	partially	OpenAI `tts-1` has HTTP chunked streaming but no WebSocket. Acceptable for v1, suboptimal vs ElevenLabs WS.
`voiceMode` Map	no	In-memory per-user flag is the wrong shape — sessions are server-side, multi-surface.
`OPENAI_API_KEY`	yes	Already provisioned for Whisper + TTS.
`executeClaudeCode()` agent loop	yes — unchanged	The constraint. Add `includePartialMessages: true` for streaming, prepend voice-mode system prompt, otherwise untouched.
Firestore request queue	conceptually	The `onSnapshot`-listener pattern works for asynchronous queues, not for a live audio loop. The call needs a tighter direct path.
`buildTurnTopicsContext(message)` (memory router)	yes	Same memory loading per turn.

4. The unified sessions question

The token-resolves-to-a-session premise forces a question we've been avoiding: what is a session? The codebase overloads the word at least three ways already:

Claude Agent SDK session — the sessionId string returned by query(). Stored as claudeSessionId on the chat doc.
Telegram chat state — the chats Firestore collection, with backwards-compat aliases getSession/saveSession in db.js:69-72.
Claude Code transcripts — JSONL files in ~/.claude/projects/<slug>/, each with their own sessionId.

A web call has nowhere to live in any of these. It needs at minimum a doc somewhere that holds: which originating context, when it started, which Cosmo session ID is active, optionally a name.

Two paths fall out of the research:

Additive — new sessions collection that references the existing things
Lower risk. Fast.

New collection. Surface enum: telegram | claude-code | call.
Telegram sessions reference a chats doc by chatKey. No data moved.
Claude Code sessions reference a transcript path + cwd + branch.
Call sessions stand alone, optionally with a parent_session_id pointing back to the originator.
Existing code paths untouched. Messages still keyed by chatId.

Migration — fold chats + chat_contexts into sessions
Higher value. Higher cost.

One collection, one concept.
Touches the messages-by-chatId query in db.js:99-103 and the chatKey compat logic.
Real migration of historical messages and chatIdMigration.js compatibility paths.
Cleaner long-term but a meaningful side-quest. Don't bundle it with voice v1 unless it's already on the roadmap.

And there's the naming problem. Adding sessions as a fourth meaning of "session" is asking for grief. Calling them conversations or threads would disambiguate. The user's call: do we colonise the word back, or pick a new one?

Naming the conversations

Whatever we call them, they need human-readable names. Sidebar showing abc123 · 2 days ago is unsearchable. Showing "Memory v2 dashboard work · 2 days ago" is not. Three options:

User-set — /name <text>. Same pattern as /project. Always available, always overrideable. Default fallback to the auto-name if user hasn't set one.
Auto-named — Cosmo titles the conversation after the first 2-3 turns. Same trick ChatGPT does.
Hybrid — auto-suggest after first few turns, user can override anytime. Probably the right answer.

The Claude Code session-start hook

Claude Code SessionStart hooks receive a JSON payload with session_id, transcript_path, cwd, source (startup | resume | clear | compact), and the project dir as $CLAUDE_PROJECT_DIR. Hooks can write context back into the conversation via stdout or via { hookSpecificOutput: { additionalContext: "..." } }. So a hook can:

POST { surface: 'claude-code', cwd, branch, transcriptPath, claudeSessionId } to a local Cosmo endpoint.
Endpoint mints / fetches the session doc, returns its ID.
Hook prints "Cosmo session: <id>. To call, run /call." back to the prompt context.

That gives Claude Code first-class membership in the same session universe Telegram has. The slash command /call then just shells out, looks at the most recent transcript line for the current sessionId, posts to /call/mint with that as the parent session, and prints the URL.

5. Making it feel like a call

This is where the user wanted depth. The request was: even on web, make it feel like FaceTime / Google Meet, not like a Zoom side-tab.

PWA capabilities — what works in May 2026

Feature	Status	Notes
Add-to-home-screen	stable	iOS Safari 16.4+. iOS 26 makes it default for installed sites.
`display: "standalone"` (no browser chrome)	stable	Required for the native look. Note: `"fullscreen"` is not supported on iOS Safari.
`navigator.wakeLock.request("screen")`	stable since iOS 18.4	Was broken in installed PWAs on iOS until 18.4. Now reliable. Lock the screen on during a call.
Background audio when screen locks	broken on iOS PWAs	WebKit bug 198277. Audio cuts when the phone backgrounds or screen locks. Real limitation for v1.
Media Session API	stable	iOS Safari 14.5+. Lock-screen controls + metadata. Doesn't fix the background-audio gap.
Theme color (`<meta name="theme-color">`)	stable	Update dynamically to tint the iOS status bar during a call.
Safe-area insets	stable	Needed so end-call doesn't sit under the home indicator.
Haptic feedback (`navigator.vibrate`)	never iOS	Android-only. Mute-tap haptic only works on Android web.
Web Push (incoming-call notification)	PWA-only	iOS 16.4+ for installed PWAs. Required for "Cosmo is calling you."

Manifest pattern:

{
  "name": "Cosmo Calls",
  "short_name": "Cosmo",
  "display": "standalone",
  "orientation": "portrait",
  "theme_color": "#0f1115",
  "background_color": "#0f1115",
  "start_url": "/?source=pwa",
  "icons": [{ "src": "/icon-512.png", "sizes": "512x512", "type": "image/png" }]
}

The audio loop — WebRTC vs WebSocket+Opus

The honest take from the research is that for a single-user agent on a good network, WebSocket + Opus-in-AudioWorklet works. WebRTC is the better tech, but the floor latency improvement (~200-500ms) is dwarfed by LLM TTFT (~700ms+). Most production voice agents are migrating to WebRTC anyway, but it's a real infra step (STUN/TURN, optionally an SFU).

WebSocket (lean v1)

Browser WebSocket + AudioWorklet. No STUN/TURN, no SFU.
Mic → AudioWorklet → Opus encode → WS → server.
TTS chunks → WS → AudioWorklet ring buffer → speakers.
200-500ms transport latency, dwarfed by LLM.
Works fine for one user. Falls over on lossy mobile.

WebRTC (production grade)

RTCPeerConnection, mandatory Opus, built-in jitter buffer + AEC + noise suppression.
60-120ms transport latency.
Needs ICE/STUN/TURN; can fall back to TURN-over-TCP/443.
Production posts split: some swore the move to WebRTC was worth it, others switched back to WS for simplicity at single-user scale.
If we're going to be on bad mobile networks regularly (in cars, on the go), this is the answer.

VAD — barge-in is what makes it feel like a call

The single biggest "this feels alive" feature is interruption. You start to talk, Cosmo stops talking immediately. Without that, every call feels like a walkie-talkie.

The solved pattern is browser-side Silero VAD via @ricky0123/vad-web:

const myvad = await vad.MicVAD.new({
  onSpeechStart: () => { /* user started — clear TTS ring buffer */ },
  onSpeechEnd:   (audio) => { /* Float32Array @ 16kHz — finalise STT */ },
  positiveSpeechThreshold: 0.5,
  negativeSpeechThreshold: 0.35,
  minSpeechFrames: 3,
  redemptionFrames: 8,
  frameSamples: 1536,
});

~1ms per frame, runs as ONNX in an AudioWorklet thread, doesn't block the UI.

Audio playback — the AudioWorklet ring buffer pattern

For low-latency streaming TTS chunks, the right pattern is Web Audio AudioWorklet with a ring buffer. PCM chunks come in over the WS, get pushed into a SharedArrayBuffer, the worklet drains it sample-by-sample. Cancellation = drain the buffer, one message. Same pattern OpenAI Realtime, LiveKit, and Pipecat browser clients all use.

The legacy options (HTMLAudioElement per chunk, MSE) all have audible gaps or codec-prefix requirements that don't suit our use.

The visual call screen

The bits that make a web page feel like a call screen, not like a settings page:

Big avatar with audio-driven pulse. Route the TTS playback stream through an AnalyserNode, compute RMS in requestAnimationFrame, scale a CSS transform on the avatar. The avatar pulses when Cosmo is talking. Same for the user's mic ring when they're talking. This is the single highest-impact visual.
Connection-state stages. dialing → ringing → connected with a subtle animation per stage. Fake it slightly if needed — even 400ms of "connecting" feels honest.
Bright red end-call button, large, bottom-right. Material 3 style: oversized, unmistakable. Far from mute (which goes left).
Mute button with strong visual feedback. No haptic on iOS, so the visual must carry it.
Persistent call timer in MM:SS. Tiny touch, big effect.
Wake lock on the whole time the call is active. Screen never dims.
Status-bar tinted via theme-color so the iOS bar matches the call screen instead of clashing.
Safe-area padding on the bottom so end-call sits above the home indicator.

Mic permissions UX

Mic permissions are fragile, especially on iOS Safari. Two non-negotiable rules:

HTTPS only. getUserMedia doesn't prompt on http:// except localhost.
Synchronous user-gesture. On iOS, getUserMedia must be called synchronously inside a tap handler. Don't await anything before it. Pattern: tap button → call getUserMedia first → await everything else after.

Best practice: a "Tap to start the call" splash that explicitly says mic access will be requested. No auto-prompt on page load — looks dodgy and gets denied.

6. Picking the loop

STT — streaming, not Whisper

Whisper API is batch-only. For a real call we need streaming. The realistic options:

Provider	Endpoint	TTFT P50	Pricing	Notes
Deepgram Nova-3 / Flux	WS	<300ms (Flux: <150ms)	$0.0077/min PAYG	Workhorse. Industry default. Workers AI also exposes it.
AssemblyAI Universal-3 Pro Streaming	WS	~150ms	$0.0025/min	Best entity accuracy. Cheapest realistic option.
OpenAI Whisper	HTTPS (batch)	~1s+ for 5s clip	$0.006/min	What Cosmo uses today. No streaming. Fatal for a call.
OpenAI Realtime API	WS or WebRTC	~500ms	Token-based audio	Different model. Speech-to-speech, conflicts with "brain stays as-is."

Streaming STT is non-negotiable for the call to feel right. Deepgram is the safe pick; AssemblyAI is cheaper. Whisper is out for the call path (still fine for voice notes).

TTS — the ElevenLabs vs OpenAI fork

This is the most interesting tradeoff in the stack.

OpenAI tts-1 with voice "nova" continuity

Same voice the existing Telegram voice notes use. Cosmo sounds like Cosmo across surfaces.
HTTP chunked streaming only — no WebSocket TTS. Confirmed via OpenAI docs and community thread.
~500ms TTFB; sentence-buffering at our end to keep playback smooth.
Already on the bill. Already provisioned. Zero new vendor.
To get true WebSocket TTS from OpenAI you have to use the full Realtime speech-to-speech API, which conflicts with the brain-stays-as-is rule.

ElevenLabs Flash v2.5 over WebSocket latency

True bidirectional WebSocket: text-stream in, audio-stream out. Prosody preserved across chunks.
~50-75ms model TTFB. ~400-500ms total with network in real conditions.
Recommended by Anthropic's own cookbook for low-latency Claude voice.
New voice for Cosmo. Different sound across surfaces. Voice cloning is an option (Cosmo could sound like a custom voice we pick).
~$0.30/1k chars on Pro tier. New monthly bill.

The decision is: continuity vs latency vs voice quality vs new bill. There isn't a wrong answer; there's a preference.

Side note: Cartesia Sonic Turbo has the lowest model TTFB in the comparison set (~40ms) and is a strong technical choice, but the user explicitly named ElevenLabs and OpenAI as the two providers in scope, so it's a footnote, not a contender.

Brain — keeping it unchanged, with one switch

The Agent SDK supports token-level streaming via includePartialMessages: true. Today's executeClaudeCode() doesn't set it. With it on, the iterator additionally yields stream_event chunks containing raw Anthropic API streaming events. Per the docs:

for await (const message of query({ prompt, options: { includePartialMessages: true, ...existing } })) {
  if (message.type === "stream_event") {
    const event = message.event;
    if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
      // ship event.delta.text to TTS sentence buffer
    }
    if (event.type === "content_block_start" && event.content_block.type === "tool_use") {
      // emit "looking at your <tool name>..." filler audio
    }
  }
  // existing AssistantMessage and result handling unchanged
}

Telegram path is untouched (it ignores stream_event); call path adds the new branch. Per-turn voice-mode system prompt overlay is one extra append on the system prompt when the surface is call.

Real spec-blocker discovered. The docs say StreamEvent messages are not emitted when explicit maxThinkingTokens is set. Cosmo currently uses thinking: { type: 'adaptive' }. The docs further say "thinking is disabled by default in the SDK, so streaming works unless you enable it" — which suggests adaptive thinking may suppress stream events. This needs a 30-min spike to verify before we commit to token-level streaming. If incompatible, voice-mode either disables adaptive thinking on call turns, or accepts whole-block streaming with sentence-buffering at TTS time (slightly higher TTFB, still works).

Voice-mode system prompt

From the prompting research (ElevenLabs guide, LiveKit prompting voice agents, Vapi guide), voice-tuned prompts produce responses 60-70% shorter than text equivalents. The overlay we'd append:

You're on a phone call. Speak in 1-3 sentence turns. One question per turn.
No markdown — no **bold**, no asterisks, no bullets, no headers, no code blocks.
Spell out numbers when natural ("twenty-three dollars" not "$23"). Spell out small numbers, time-of-day, and money.
Don't narrate the user's question back to them.
Before any tool call, emit a short ack first ("looking now..." or similar), then call the tool.
Use natural fillers ("hmm", "let me see") only when they emerge organically. Don't force it.
If interrupted, stop cleanly and listen.

Aborting mid-stream

Today's interrupt path goes through Firestore (cancelRequest(), polled by the agent). For a call, the abort needs to be sub-100ms — too slow if it round-trips Firestore. The Agent SDK accepts an AbortController; the call path should hold it locally and abort directly when VAD fires onSpeechStart. Worth confirming in the SDK source before locking it in.

7. Initiation surfaces

The same minting endpoint serves all surfaces. The only thing that varies is what context the surface ships in the mint request.

Telegram `/call`

User types /call in any chat.
Bot reads chat context (existing getChatContextWithMigration), POSTs { surface: 'telegram', telegramChatKey, recentTurns, project } to /call/mint.
Endpoint returns a single-use URL (TTL: 5 min to claim, then call session lives for the call).
Bot replies with the URL as a clickable link.
User taps → browser opens the call screen.

Claude Code `/call`

Slash command (~/.claude/commands/call.md) shells out to a node script.
Script reads the most recent transcript line for current sessionId, cwd, gitBranch.
POSTs { surface: 'claude-code', claudeSessionId, cwd, gitBranch, transcriptPath } to /call/mint.
Prints the URL into the response. User cmd-clicks.

An optional SessionStart hook (off by default until we've been running it a while) can pre-mint a session doc on Claude Code launch so even if you don't /call, your CC sessions are first-class in the same Firestore collection — useful for cross-session memory and "what was I doing yesterday" queries.

Mac menu-bar / native (later)

A small menu-bar app could open a fixed URL like cosmo-call.../c/menubar?fresh=1 to start an "ambient" call with no parent session — pure new conversation, no context inherited. Or it can read the system clipboard for a token URL the user just got and open straight to it. Out of scope for v1, easy to add later.

8. What v1 looks like end-to-end

sequenceDiagram participant User participant Telegram participant Bot as cosmo-bot participant Mint as /call/mint participant FS as Firestore
(sessions) participant Browser participant Voice as cosmo-voice
(new process) participant STT as Streaming STT participant Brain as cosmo-agent
(unchanged) participant TTS as Streaming TTS User->>Telegram: /call Telegram->>Bot: command Bot->>Mint: POST { surface, context } Mint->>FS: create session doc Mint-->>Bot: { url, token } Bot-->>User: clickable URL User->>Browser: tap URL Browser->>Voice: WS connect with token Voice->>FS: resolve session loop per turn User->>Browser: speak Browser->>Voice: PCM chunks (WS) Voice->>STT: stream STT-->>Voice: transcript deltas Voice->>Brain: query() with includePartialMessages Brain-->>Voice: token deltas Voice->>TTS: text stream TTS-->>Voice: audio chunks Voice-->>Browser: audio chunks (WS) Browser->>User: speakers User->>Browser: barge-in (VAD onSpeechStart) Browser->>Voice: cancel Voice->>Brain: AbortController.abort() Voice->>TTS: close Voice->>Browser: clear ring buffer end User->>Browser: end call Browser->>Voice: close WS Voice->>FS: session ended, transcript saved

New components

cosmo-voice — new PM2 process. Holds the WS connections. Runs the per-call loop. Talks to the existing cosmo-agent in-process (Node import) or over a tighter local IPC than Firestore (decision below).
Mint endpoint — small HTTP API. Lives in cosmo-voice or cosmo-web.
Web app — static site (Cloudflare Pages or served from cosmo-web). Manifest, service worker, the call screen UI, the AudioWorklet, the VAD setup.
sessions Firestore collection (or whatever we end up calling it).
/call in bot.js — registered command, mint + reply.
/call Claude Code slash command — under ~/.claude/commands/call.md.

What stays exactly the same

executeClaudeCode(). Adds includePartialMessages: true when called from voice surface. Prepends a voice-mode system prompt block when surface is call. Otherwise unchanged.
Memory router (buildTurnTopicsContext). Same.
All tools, all integrations, all skills.
Telegram voice notes. The existing /voicemode path is untouched.

9. What's locked, what's open

Decision	Status	Notes
Brain stays as Cosmo agent	locked	The constraint that defines the whole shape of v1.
Web is the v1 transport	locked	Telephony is phase 2.
`/call` in Telegram + Claude Code	locked	Token-bearing URL pattern. Same mint endpoint serves both.
Tokens carry session context	locked	That's the whole point of the URL pattern.
Sessions collection introduced	locked in principle	Additive vs migration is open. Naming is open.
Browser-side Silero VAD for barge-in	locked	`@ricky0123/vad-web`. The pattern is too solved to relitigate.
AudioWorklet ring-buffer playback	locked	Same. The right pattern for low-latency streaming TTS.
PWA call screen with wake lock + theme color + safe areas	locked	The "feels like a call" tier. Background-audio gap on iOS is accepted (call works while screen on).
STT: streaming required (Whisper out)	locked	Whisper is batch only. Provider choice (Deepgram vs AssemblyAI) is open.
Voice-mode system prompt overlay	locked	Adds at `getSystemPrompt()` call site when surface = call.
TTS provider	open	OpenAI `tts-1` nova (continuity) vs ElevenLabs Flash v2.5 WS (latency).
STT provider	open	Deepgram Nova-3 vs AssemblyAI Universal-3.
Adaptive thinking + streaming compatibility	open	Needs a 30-min spike. Real spec-blocker.
Transport: WebSocket vs WebRTC	open	WS is fine for v1; WebRTC is better for mobile / lossy networks.
Sessions: additive vs migration	open	Additive is recommended for v1; migration as separate later project.
Sessions naming	open	`sessions` overloads existing terms. `conversations` or `threads` disambiguate.
Auto-naming approach	open	User-set / auto / hybrid. Hybrid is probably right.
Hosting (cosmo-voice on Mac via tunnel vs CF Workers vs hybrid)	open	Mac-via-tunnel is the lowest-friction path that mirrors existing patterns.
Audio retention (privacy)	open	Transcripts go to messages. Raw audio: keep / drop / how long?
Concurrent call cap	open	One at a time is fine for v1; architecture should not preclude many.

10. Decisions you need to make

In rough order of impact:

Q1. TTS provider — continuity or latency?

OpenAI tts-1 with "nova" — Cosmo sounds the same across surfaces. HTTP chunked streaming, ~500ms TTFB. Already paid for.

ElevenLabs Flash v2.5 WS — different (probably better) voice. True streaming. ~75ms model TTFB. New monthly bill (~$0.30/1k chars).

The cookbook recommends ElevenLabs. The continuity argument favours OpenAI. Your call.

Q2. STT provider — Deepgram or AssemblyAI?

Deepgram Nova-3 / Flux: industry default, <300ms TTFT, $0.0077/min.
AssemblyAI Universal-3: ~150ms TTFT, $0.0025/min (cheaper), best entity accuracy.

No wrong answer. AssemblyAI is the pragmatic pick if cost matters; Deepgram has more developer mindshare.

Q3. Sessions — collection name and scope?

Two questions in one: (a) additive or migration? Recommend additive for v1, migration later if at all. (b) name? sessions overloads three existing meanings. conversations or threads disambiguate. Pick one.

Q4. Transport — WebSocket or WebRTC?

WS is simpler and fine on good networks. WebRTC is the production path with proper jitter buffer and AEC, important for mobile / lossy networks. For "feels like a call" on a phone in the kitchen with average wifi, WS is good enough. For taking calls in the car on cell, WebRTC earns its keep.

Q5. Adaptive thinking — does it suppress streaming?

Real risk that thinking: { type: 'adaptive' } disables StreamEvent emission. Needs a 30-min spike before we commit to token-streaming. If incompatible: either turn off adaptive thinking on call turns (small quality dent) or accept whole-block streaming with sentence-buffering (slightly higher TTFB).

Should we just do this spike now — before any of the above decisions are real? Yes.

Q6. Hosting — Mac via tunnel, or pure Cloudflare?

Cosmo's brain runs on the Mac. The voice loop can either:
(a) live on the Mac as a new PM2 process (cosmo-voice), exposed via Cloudflare Tunnel — same pattern as deep-link redirector. Lowest friction.
(b) live on Cloudflare Workers + Durable Objects, relaying via tunnel back to the Mac for the brain call. More moving parts but edge-native.
Recommend (a) for v1.

Q7. Audio retention

STT transcripts will land in the existing messages collection naturally. Do we keep raw mic audio? For how long? For what purpose? Privacy decision — your call.

Q8. iOS background-audio limitation — accept or hold for native?

PWA audio dies when iOS screen locks (WebKit bug 198277). v1 either accepts that calls only work screen-on (with wake lock holding the screen on the whole time), or we wrap in a native shell later. Accepting it is fine for "talk to Cosmo at the desk / on a call screen actively in use." Calls in your pocket need the native shell.

Q9. Concurrent calls — one at a time is fine for v1?

A single user almost never has two calls open. Locking it to one keeps the architecture simple. The session-doc design doesn't preclude many; it's just one fewer thing to think about.

11. Phase 2 — phone number (note only)

When we add a real phone number later, the cleanest path is Twilio ConversationRelay, not raw Media Streams:

Twilio handles STT and TTS server-side. Connects to our WebSocket as text in / text out.
Our WS receives { type: "prompt", voicePrompt: "<user transcript>" } and replies with streamed { type: "text", token: "...", last: false } chunks.
Interruption arrives as { type: "interrupt", utteranceUntilInterrupt: "..." } so our local conversation history can be rewound to exactly where the user heard.
Architecturally identical to web-v1 from the server's perspective: same stream-of-tokens-from-Claude → stream-of-text-tokens-to-the-other-side loop. Twilio CR replaces the browser as the audio peer.

What would paint v1 into a corner:

Doing TTS server-side and sending audio chunks to the browser. Twilio CR doesn't accept audio in that mode — it wants text. If we send text in v1 (browser does its own decode/playback) and a server-side TTS path lives parallel for browser audio output, both transports work cleanly later.
Doing STT only in the browser. Twilio sends transcripts as text — server-side ingest path needs to exist regardless.

v1 should structure the agent-facing interface as (input: text-stream | audio-stream) → agent → (output: text-stream) so swapping browser ↔ Twilio is a transport adapter, not an agent change.

12. References

Full research findings at specs/research/voice-call-findings.md. Key external sources:

Doc lives at plans/voice-call.html. Generated Sun 3 May 2026 from research findings.