Cosmo voice calls — research + early plan

Findings, not yet a build brief. Sun 3 May 2026.

Listen — tap to start, auto-plays each section
What this doc is. The user wants to talk to Cosmo — real, two-way, conversational, distinct from today's voice-note ping-pong. We agreed v1 is a web-based call initiated from a slash command (in Telegram and in Claude Code), opening a unique URL whose token carries the originating session's context. Telephony (Twilio) is later. The brain stays exactly as it is — same Claude Agent SDK loop, same tools, same memory. We just put a voice transport in front of it. This doc captures what the deep research turned up, the decisions that fell out cleanly, and the forks the user still needs to call.
Hard rule. Cosmo's brain stays as-is. executeClaudeCode() in src/agent.js is not modified beyond the bare minimum to (a) optionally enable token-level streaming for TTS and (b) accept a per-turn voice-mode system prompt overlay. No swap to Realtime models, no swap to OpenClaw, no swap to a different LLM. OpenClaw and friends are inspiration for the transport and the loop, not the brain.
Contents
  1. The shape we're building
  2. OpenClaw + reference voice loops, briefly
  3. What Cosmo already has
  4. The unified sessions question
  5. Making it feel like a call (PWA, audio, UI)
  6. Picking the loop (STT, brain, TTS, transport)
  7. Initiation surfaces — Telegram, Claude Code, later
  8. What v1 looks like end-to-end
  9. What's locked, what's open
  10. Decisions you need to make
  11. Phase 2 — phone number (note only)
  12. References

1. The shape we're building

One sentence: a /call command mints a tokenised URL → you open it in a browser → it feels like a call → Cosmo on the other end is the same Cosmo you already talk to in Telegram, but voice-tuned.

flowchart LR A["/call from
Telegram or Claude Code"] --> B[mint token] B --> C[(sessions doc
holds context)] B --> D[https://cosmo-call.../c/<token>] D --> E[browser opens
call screen] E --> F[mic + WebRTC/WS] F --> G[STT stream] G --> H[Cosmo agent
same brain] H --> I[TTS stream] I --> E C --> H

The interesting word here is token. The token doesn't just authenticate the call — it resolves to a session server-side. The session knows which Telegram chat (or which Claude Code cwd + branch) the call was initiated from, the recent turns, the project context, the memory scope. So when you `/call` from your H2OS chat mid-conversation about a dispenser, Cosmo on the call already knows.

This means a separate decision is forced into the open: what is a "session" in Cosmo, exactly? The codebase already overloads the word three ways. We'll get to it (§4).

2. OpenClaw + reference voice loops, briefly

OpenClaw the project doesn't ship a voice mode in core. The community add-on is Purple-Horizons/openclaw-voice. Worth knowing the shape:

Reference loopTransportSTTTTSVAD / barge-inNotable
openclaw-voice WebSocket Whisper (local, faster-whisper) ElevenLabs Turbo v2.5, sentence streaming Silero VAD server-side; barge-in not documented "Voice never leaves your machine." Closest reference architecture.
Anthropic cookbook (ElevenLabs low-latency) WS end-to-end ElevenLabs Scribe ElevenLabs Flash WS (text-stream-in, audio-stream-out) Sentence-aligned chunks Recommends WebSocket TTS over HTTP TTS for prosodic continuity. ~31% TTFT reduction streaming Haiku.
LiveKit Agents + Anthropic plugin WebRTC Pluggable (Deepgram default) Pluggable (Cartesia default) Framework-handled. Real barge-in. Production-grade. Decouples STT/LLM/TTS as plugins. Closest to "what if we wanted it bulletproof."
Pipecat WebRTC or WS Pluggable Pluggable Silero + their SmartTurn turn-taking model Frame-based. Best-in-class turn detection.

What's worth borrowing from any of these:

What we explicitly are not borrowing:

3. What Cosmo already has

This is important to document because it changes what's "new build" vs "wire it up differently."

The existing voice path

The existing path is file-based and half-duplex. Whole utterance up, whole utterance down. Zero streaming. Whisper API is batch-only — no streaming endpoint exists. This is fine for voice notes, fatal for a call.

One bit of leftover code worth noting: src/voice/ is a half-built Python "Hey Cosmo" wake-word menu-bar daemon. Parked, never wired up. Different problem (text injection into focused app), not relevant to /call.

What's reusable for v1

Existing pieceReusable in /call?Why
transcribeVoice() (whole-file Whisper)noWhisper API has no streaming mode. Need a different STT.
textToSpeech() (whole-text OpenAI TTS)partiallyOpenAI tts-1 has HTTP chunked streaming but no WebSocket. Acceptable for v1, suboptimal vs ElevenLabs WS.
voiceMode MapnoIn-memory per-user flag is the wrong shape — sessions are server-side, multi-surface.
OPENAI_API_KEYyesAlready provisioned for Whisper + TTS.
executeClaudeCode() agent loopyes — unchangedThe constraint. Add includePartialMessages: true for streaming, prepend voice-mode system prompt, otherwise untouched.
Firestore request queueconceptuallyThe onSnapshot-listener pattern works for asynchronous queues, not for a live audio loop. The call needs a tighter direct path.
buildTurnTopicsContext(message) (memory router)yesSame memory loading per turn.

4. The unified sessions question

The token-resolves-to-a-session premise forces a question we've been avoiding: what is a session? The codebase overloads the word at least three ways already:

  1. Claude Agent SDK session — the sessionId string returned by query(). Stored as claudeSessionId on the chat doc.
  2. Telegram chat state — the chats Firestore collection, with backwards-compat aliases getSession/saveSession in db.js:69-72.
  3. Claude Code transcripts — JSONL files in ~/.claude/projects/<slug>/, each with their own sessionId.

A web call has nowhere to live in any of these. It needs at minimum a doc somewhere that holds: which originating context, when it started, which Cosmo session ID is active, optionally a name.

Two paths fall out of the research:

Additive — new sessions collection that references the existing things
Lower risk. Fast.
  • New collection. Surface enum: telegram | claude-code | call.
  • Telegram sessions reference a chats doc by chatKey. No data moved.
  • Claude Code sessions reference a transcript path + cwd + branch.
  • Call sessions stand alone, optionally with a parent_session_id pointing back to the originator.
  • Existing code paths untouched. Messages still keyed by chatId.
Migration — fold chats + chat_contexts into sessions
Higher value. Higher cost.
  • One collection, one concept.
  • Touches the messages-by-chatId query in db.js:99-103 and the chatKey compat logic.
  • Real migration of historical messages and chatIdMigration.js compatibility paths.
  • Cleaner long-term but a meaningful side-quest. Don't bundle it with voice v1 unless it's already on the roadmap.

And there's the naming problem. Adding sessions as a fourth meaning of "session" is asking for grief. Calling them conversations or threads would disambiguate. The user's call: do we colonise the word back, or pick a new one?

Naming the conversations

Whatever we call them, they need human-readable names. Sidebar showing abc123 · 2 days ago is unsearchable. Showing "Memory v2 dashboard work · 2 days ago" is not. Three options:

The Claude Code session-start hook

Claude Code SessionStart hooks receive a JSON payload with session_id, transcript_path, cwd, source (startup | resume | clear | compact), and the project dir as $CLAUDE_PROJECT_DIR. Hooks can write context back into the conversation via stdout or via { hookSpecificOutput: { additionalContext: "..." } }. So a hook can:

  1. POST { surface: 'claude-code', cwd, branch, transcriptPath, claudeSessionId } to a local Cosmo endpoint.
  2. Endpoint mints / fetches the session doc, returns its ID.
  3. Hook prints "Cosmo session: <id>. To call, run /call." back to the prompt context.

That gives Claude Code first-class membership in the same session universe Telegram has. The slash command /call then just shells out, looks at the most recent transcript line for the current sessionId, posts to /call/mint with that as the parent session, and prints the URL.

5. Making it feel like a call

This is where the user wanted depth. The request was: even on web, make it feel like FaceTime / Google Meet, not like a Zoom side-tab.

PWA capabilities — what works in May 2026

FeatureStatusNotes
Add-to-home-screenstableiOS Safari 16.4+. iOS 26 makes it default for installed sites.
display: "standalone" (no browser chrome)stableRequired for the native look. Note: "fullscreen" is not supported on iOS Safari.
navigator.wakeLock.request("screen")stable since iOS 18.4Was broken in installed PWAs on iOS until 18.4. Now reliable. Lock the screen on during a call.
Background audio when screen locksbroken on iOS PWAsWebKit bug 198277. Audio cuts when the phone backgrounds or screen locks. Real limitation for v1.
Media Session APIstableiOS Safari 14.5+. Lock-screen controls + metadata. Doesn't fix the background-audio gap.
Theme color (<meta name="theme-color">)stableUpdate dynamically to tint the iOS status bar during a call.
Safe-area insetsstableNeeded so end-call doesn't sit under the home indicator.
Haptic feedback (navigator.vibrate)never iOSAndroid-only. Mute-tap haptic only works on Android web.
Web Push (incoming-call notification)PWA-onlyiOS 16.4+ for installed PWAs. Required for "Cosmo is calling you."

Manifest pattern:

{
  "name": "Cosmo Calls",
  "short_name": "Cosmo",
  "display": "standalone",
  "orientation": "portrait",
  "theme_color": "#0f1115",
  "background_color": "#0f1115",
  "start_url": "/?source=pwa",
  "icons": [{ "src": "/icon-512.png", "sizes": "512x512", "type": "image/png" }]
}

The audio loop — WebRTC vs WebSocket+Opus

The honest take from the research is that for a single-user agent on a good network, WebSocket + Opus-in-AudioWorklet works. WebRTC is the better tech, but the floor latency improvement (~200-500ms) is dwarfed by LLM TTFT (~700ms+). Most production voice agents are migrating to WebRTC anyway, but it's a real infra step (STUN/TURN, optionally an SFU).

WebSocket (lean v1)
  • Browser WebSocket + AudioWorklet. No STUN/TURN, no SFU.
  • Mic → AudioWorklet → Opus encode → WS → server.
  • TTS chunks → WS → AudioWorklet ring buffer → speakers.
  • 200-500ms transport latency, dwarfed by LLM.
  • Works fine for one user. Falls over on lossy mobile.
WebRTC (production grade)
  • RTCPeerConnection, mandatory Opus, built-in jitter buffer + AEC + noise suppression.
  • 60-120ms transport latency.
  • Needs ICE/STUN/TURN; can fall back to TURN-over-TCP/443.
  • Production posts split: some swore the move to WebRTC was worth it, others switched back to WS for simplicity at single-user scale.
  • If we're going to be on bad mobile networks regularly (in cars, on the go), this is the answer.

VAD — barge-in is what makes it feel like a call

The single biggest "this feels alive" feature is interruption. You start to talk, Cosmo stops talking immediately. Without that, every call feels like a walkie-talkie.

The solved pattern is browser-side Silero VAD via @ricky0123/vad-web:

const myvad = await vad.MicVAD.new({
  onSpeechStart: () => { /* user started — clear TTS ring buffer */ },
  onSpeechEnd:   (audio) => { /* Float32Array @ 16kHz — finalise STT */ },
  positiveSpeechThreshold: 0.5,
  negativeSpeechThreshold: 0.35,
  minSpeechFrames: 3,
  redemptionFrames: 8,
  frameSamples: 1536,
});

~1ms per frame, runs as ONNX in an AudioWorklet thread, doesn't block the UI.

Audio playback — the AudioWorklet ring buffer pattern

For low-latency streaming TTS chunks, the right pattern is Web Audio AudioWorklet with a ring buffer. PCM chunks come in over the WS, get pushed into a SharedArrayBuffer, the worklet drains it sample-by-sample. Cancellation = drain the buffer, one message. Same pattern OpenAI Realtime, LiveKit, and Pipecat browser clients all use.

The legacy options (HTMLAudioElement per chunk, MSE) all have audible gaps or codec-prefix requirements that don't suit our use.

The visual call screen

The bits that make a web page feel like a call screen, not like a settings page:

Mic permissions UX

Mic permissions are fragile, especially on iOS Safari. Two non-negotiable rules:

Best practice: a "Tap to start the call" splash that explicitly says mic access will be requested. No auto-prompt on page load — looks dodgy and gets denied.

6. Picking the loop

STT — streaming, not Whisper

Whisper API is batch-only. For a real call we need streaming. The realistic options:

ProviderEndpointTTFT P50PricingNotes
Deepgram Nova-3 / FluxWS<300ms (Flux: <150ms)$0.0077/min PAYGWorkhorse. Industry default. Workers AI also exposes it.
AssemblyAI Universal-3 Pro StreamingWS~150ms$0.0025/minBest entity accuracy. Cheapest realistic option.
OpenAI WhisperHTTPS (batch)~1s+ for 5s clip$0.006/minWhat Cosmo uses today. No streaming. Fatal for a call.
OpenAI Realtime APIWS or WebRTC~500msToken-based audioDifferent model. Speech-to-speech, conflicts with "brain stays as-is."

Streaming STT is non-negotiable for the call to feel right. Deepgram is the safe pick; AssemblyAI is cheaper. Whisper is out for the call path (still fine for voice notes).

TTS — the ElevenLabs vs OpenAI fork

This is the most interesting tradeoff in the stack.

OpenAI tts-1 with voice "nova" continuity
  • Same voice the existing Telegram voice notes use. Cosmo sounds like Cosmo across surfaces.
  • HTTP chunked streaming only — no WebSocket TTS. Confirmed via OpenAI docs and community thread.
  • ~500ms TTFB; sentence-buffering at our end to keep playback smooth.
  • Already on the bill. Already provisioned. Zero new vendor.
  • To get true WebSocket TTS from OpenAI you have to use the full Realtime speech-to-speech API, which conflicts with the brain-stays-as-is rule.
ElevenLabs Flash v2.5 over WebSocket latency
  • True bidirectional WebSocket: text-stream in, audio-stream out. Prosody preserved across chunks.
  • ~50-75ms model TTFB. ~400-500ms total with network in real conditions.
  • Recommended by Anthropic's own cookbook for low-latency Claude voice.
  • New voice for Cosmo. Different sound across surfaces. Voice cloning is an option (Cosmo could sound like a custom voice we pick).
  • ~$0.30/1k chars on Pro tier. New monthly bill.

The decision is: continuity vs latency vs voice quality vs new bill. There isn't a wrong answer; there's a preference.

Side note: Cartesia Sonic Turbo has the lowest model TTFB in the comparison set (~40ms) and is a strong technical choice, but the user explicitly named ElevenLabs and OpenAI as the two providers in scope, so it's a footnote, not a contender.

Brain — keeping it unchanged, with one switch

The Agent SDK supports token-level streaming via includePartialMessages: true. Today's executeClaudeCode() doesn't set it. With it on, the iterator additionally yields stream_event chunks containing raw Anthropic API streaming events. Per the docs:

for await (const message of query({ prompt, options: { includePartialMessages: true, ...existing } })) {
  if (message.type === "stream_event") {
    const event = message.event;
    if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
      // ship event.delta.text to TTS sentence buffer
    }
    if (event.type === "content_block_start" && event.content_block.type === "tool_use") {
      // emit "looking at your <tool name>..." filler audio
    }
  }
  // existing AssistantMessage and result handling unchanged
}

Telegram path is untouched (it ignores stream_event); call path adds the new branch. Per-turn voice-mode system prompt overlay is one extra append on the system prompt when the surface is call.

Real spec-blocker discovered. The docs say StreamEvent messages are not emitted when explicit maxThinkingTokens is set. Cosmo currently uses thinking: { type: 'adaptive' }. The docs further say "thinking is disabled by default in the SDK, so streaming works unless you enable it" — which suggests adaptive thinking may suppress stream events. This needs a 30-min spike to verify before we commit to token-level streaming. If incompatible, voice-mode either disables adaptive thinking on call turns, or accepts whole-block streaming with sentence-buffering at TTS time (slightly higher TTFB, still works).

Voice-mode system prompt

From the prompting research (ElevenLabs guide, LiveKit prompting voice agents, Vapi guide), voice-tuned prompts produce responses 60-70% shorter than text equivalents. The overlay we'd append:

Aborting mid-stream

Today's interrupt path goes through Firestore (cancelRequest(), polled by the agent). For a call, the abort needs to be sub-100ms — too slow if it round-trips Firestore. The Agent SDK accepts an AbortController; the call path should hold it locally and abort directly when VAD fires onSpeechStart. Worth confirming in the SDK source before locking it in.

7. Initiation surfaces

The same minting endpoint serves all surfaces. The only thing that varies is what context the surface ships in the mint request.

Telegram /call

  1. User types /call in any chat.
  2. Bot reads chat context (existing getChatContextWithMigration), POSTs { surface: 'telegram', telegramChatKey, recentTurns, project } to /call/mint.
  3. Endpoint returns a single-use URL (TTL: 5 min to claim, then call session lives for the call).
  4. Bot replies with the URL as a clickable link.
  5. User taps → browser opens the call screen.

Claude Code /call

  1. Slash command (~/.claude/commands/call.md) shells out to a node script.
  2. Script reads the most recent transcript line for current sessionId, cwd, gitBranch.
  3. POSTs { surface: 'claude-code', claudeSessionId, cwd, gitBranch, transcriptPath } to /call/mint.
  4. Prints the URL into the response. User cmd-clicks.

An optional SessionStart hook (off by default until we've been running it a while) can pre-mint a session doc on Claude Code launch so even if you don't /call, your CC sessions are first-class in the same Firestore collection — useful for cross-session memory and "what was I doing yesterday" queries.

Mac menu-bar / native (later)

A small menu-bar app could open a fixed URL like cosmo-call.../c/menubar?fresh=1 to start an "ambient" call with no parent session — pure new conversation, no context inherited. Or it can read the system clipboard for a token URL the user just got and open straight to it. Out of scope for v1, easy to add later.

8. What v1 looks like end-to-end

sequenceDiagram participant User participant Telegram participant Bot as cosmo-bot participant Mint as /call/mint participant FS as Firestore
(sessions) participant Browser participant Voice as cosmo-voice
(new process) participant STT as Streaming STT participant Brain as cosmo-agent
(unchanged) participant TTS as Streaming TTS User->>Telegram: /call Telegram->>Bot: command Bot->>Mint: POST { surface, context } Mint->>FS: create session doc Mint-->>Bot: { url, token } Bot-->>User: clickable URL User->>Browser: tap URL Browser->>Voice: WS connect with token Voice->>FS: resolve session loop per turn User->>Browser: speak Browser->>Voice: PCM chunks (WS) Voice->>STT: stream STT-->>Voice: transcript deltas Voice->>Brain: query() with includePartialMessages Brain-->>Voice: token deltas Voice->>TTS: text stream TTS-->>Voice: audio chunks Voice-->>Browser: audio chunks (WS) Browser->>User: speakers User->>Browser: barge-in (VAD onSpeechStart) Browser->>Voice: cancel Voice->>Brain: AbortController.abort() Voice->>TTS: close Voice->>Browser: clear ring buffer end User->>Browser: end call Browser->>Voice: close WS Voice->>FS: session ended, transcript saved

New components

What stays exactly the same

9. What's locked, what's open

DecisionStatusNotes
Brain stays as Cosmo agentlockedThe constraint that defines the whole shape of v1.
Web is the v1 transportlockedTelephony is phase 2.
/call in Telegram + Claude CodelockedToken-bearing URL pattern. Same mint endpoint serves both.
Tokens carry session contextlockedThat's the whole point of the URL pattern.
Sessions collection introducedlocked in principleAdditive vs migration is open. Naming is open.
Browser-side Silero VAD for barge-inlocked@ricky0123/vad-web. The pattern is too solved to relitigate.
AudioWorklet ring-buffer playbacklockedSame. The right pattern for low-latency streaming TTS.
PWA call screen with wake lock + theme color + safe areaslockedThe "feels like a call" tier. Background-audio gap on iOS is accepted (call works while screen on).
STT: streaming required (Whisper out)lockedWhisper is batch only. Provider choice (Deepgram vs AssemblyAI) is open.
Voice-mode system prompt overlaylockedAdds at getSystemPrompt() call site when surface = call.
TTS provideropenOpenAI tts-1 nova (continuity) vs ElevenLabs Flash v2.5 WS (latency).
STT provideropenDeepgram Nova-3 vs AssemblyAI Universal-3.
Adaptive thinking + streaming compatibilityopenNeeds a 30-min spike. Real spec-blocker.
Transport: WebSocket vs WebRTCopenWS is fine for v1; WebRTC is better for mobile / lossy networks.
Sessions: additive vs migrationopenAdditive is recommended for v1; migration as separate later project.
Sessions namingopensessions overloads existing terms. conversations or threads disambiguate.
Auto-naming approachopenUser-set / auto / hybrid. Hybrid is probably right.
Hosting (cosmo-voice on Mac via tunnel vs CF Workers vs hybrid)openMac-via-tunnel is the lowest-friction path that mirrors existing patterns.
Audio retention (privacy)openTranscripts go to messages. Raw audio: keep / drop / how long?
Concurrent call capopenOne at a time is fine for v1; architecture should not preclude many.

10. Decisions you need to make

In rough order of impact:

Q1. TTS provider — continuity or latency?
OpenAI tts-1 with "nova" — Cosmo sounds the same across surfaces. HTTP chunked streaming, ~500ms TTFB. Already paid for.

ElevenLabs Flash v2.5 WS — different (probably better) voice. True streaming. ~75ms model TTFB. New monthly bill (~$0.30/1k chars).

The cookbook recommends ElevenLabs. The continuity argument favours OpenAI. Your call.
Q2. STT provider — Deepgram or AssemblyAI?
Deepgram Nova-3 / Flux: industry default, <300ms TTFT, $0.0077/min.
AssemblyAI Universal-3: ~150ms TTFT, $0.0025/min (cheaper), best entity accuracy.

No wrong answer. AssemblyAI is the pragmatic pick if cost matters; Deepgram has more developer mindshare.
Q3. Sessions — collection name and scope?
Two questions in one: (a) additive or migration? Recommend additive for v1, migration later if at all. (b) name? sessions overloads three existing meanings. conversations or threads disambiguate. Pick one.
Q4. Transport — WebSocket or WebRTC?
WS is simpler and fine on good networks. WebRTC is the production path with proper jitter buffer and AEC, important for mobile / lossy networks. For "feels like a call" on a phone in the kitchen with average wifi, WS is good enough. For taking calls in the car on cell, WebRTC earns its keep.
Q5. Adaptive thinking — does it suppress streaming?
Real risk that thinking: { type: 'adaptive' } disables StreamEvent emission. Needs a 30-min spike before we commit to token-streaming. If incompatible: either turn off adaptive thinking on call turns (small quality dent) or accept whole-block streaming with sentence-buffering (slightly higher TTFB).

Should we just do this spike now — before any of the above decisions are real? Yes.
Q6. Hosting — Mac via tunnel, or pure Cloudflare?
Cosmo's brain runs on the Mac. The voice loop can either:
(a) live on the Mac as a new PM2 process (cosmo-voice), exposed via Cloudflare Tunnel — same pattern as deep-link redirector. Lowest friction.
(b) live on Cloudflare Workers + Durable Objects, relaying via tunnel back to the Mac for the brain call. More moving parts but edge-native.
Recommend (a) for v1.
Q7. Audio retention
STT transcripts will land in the existing messages collection naturally. Do we keep raw mic audio? For how long? For what purpose? Privacy decision — your call.
Q8. iOS background-audio limitation — accept or hold for native?
PWA audio dies when iOS screen locks (WebKit bug 198277). v1 either accepts that calls only work screen-on (with wake lock holding the screen on the whole time), or we wrap in a native shell later. Accepting it is fine for "talk to Cosmo at the desk / on a call screen actively in use." Calls in your pocket need the native shell.
Q9. Concurrent calls — one at a time is fine for v1?
A single user almost never has two calls open. Locking it to one keeps the architecture simple. The session-doc design doesn't preclude many; it's just one fewer thing to think about.

11. Phase 2 — phone number (note only)

When we add a real phone number later, the cleanest path is Twilio ConversationRelay, not raw Media Streams:

What would paint v1 into a corner:

v1 should structure the agent-facing interface as (input: text-stream | audio-stream) → agent → (output: text-stream) so swapping browser ↔ Twilio is a transport adapter, not an agent change.

12. References

Full research findings at specs/research/voice-call-findings.md. Key external sources:

Doc lives at plans/voice-call.html. Generated Sun 3 May 2026 from research findings.

Now playing