wellnua-lite-Robert/specs/FEATURE-002-livekit-voice-call.md
Sergei 4b97689dd3 UI improvements: voice call layout and chat keyboard
- Remove speaker button empty space (2-button centered layout)
- Remove "Asteria voice" text from voice call screen
- Fix chat input visibility with keyboard
- Add keyboard show listener for auto-scroll

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-20 11:28:24 -08:00

13 KiB
Raw Permalink Blame History

FEATURE-002: LiveKit Voice Call with Julia AI

Summary

Полноценный голосовой звонок с Julia AI через LiveKit Cloud. Пользователь нажимает кнопку "Start Voice Call", открывается экран звонка в стиле телефона, и он может разговаривать с Julia AI голосом.

Status: 🔴 Not Started (требуется полная переделка)

Priority: Critical

Problem Statement

Текущая реализация имеет следующие проблемы:

  1. STT (Speech-to-Text) работает нестабильно — микрофон иногда детектируется, иногда нет
  2. TTS работает — голос Julia слышен
  3. Код сложный и запутанный — много legacy кода, полифиллов, хаков
  4. Нет четкой архитектуры — все в одном файле voice-call.tsx

Root Cause Analysis

Почему микрофон работает нестабильно:

  1. iOS AudioSession — неправильная конфигурация или race condition при настройке
  2. registerGlobals() — WebRTC polyfills могут не успевать инициализироваться
  3. Permissions — микрофон может быть не разрешен или занят другим процессом
  4. Event handling — события LiveKit могут теряться

Что работает:

  • LiveKit Cloud connection
  • Token generation
  • TTS (Deepgram Asteria)
  • Backend agent (Julia AI)

Architecture

System Overview

┌─────────────────────────────────────────────────────────────────────┐
│                        WellNuo Lite App (iOS)                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────┐    ┌──────────────────┐    ┌──────────────────┐  │
│  │  Voice Tab   │───▶│  VoiceCallScreen │───▶│ LiveKit Room     │  │
│  │  (entry)     │    │  (fullscreen)    │    │ (WebRTC)         │  │
│  └──────────────┘    └──────────────────┘    └──────────────────┘  │
│                              │                        │             │
│                              ▼                        ▼             │
│                      ┌──────────────┐         ┌──────────────┐     │
│                      │useLiveKitRoom│         │ AudioSession │     │
│                      │   (hook)     │         │ (iOS native) │     │
│                      └──────────────┘         └──────────────┘     │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    │ WebSocket + WebRTC
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                        LiveKit Cloud                                 │
├─────────────────────────────────────────────────────────────────────┤
│  Room: wellnuo-{userId}-{timestamp}                                  │
│  Participants: user + julia-agent                                    │
│  Audio Tracks: bidirectional                                        │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    │ Agent dispatch
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     Julia AI Agent (Python)                         │
├─────────────────────────────────────────────────────────────────────┤
│  STT: Deepgram Nova-2                                               │
│  LLM: WellNuo voice_ask API                                         │
│  TTS: Deepgram Aura Asteria                                         │
│  Framework: LiveKit Agents SDK 1.3.11                               │
└─────────────────────────────────────────────────────────────────────┘

Data Flow

User speaks → iOS Mic → WebRTC → LiveKit Cloud → Agent → Deepgram STT
                                                            │
                                                            ▼
                                                    WellNuo API (LLM)
                                                            │
                                                            ▼
Agent receives text ← LiveKit Cloud ← WebRTC ← Deepgram TTS (audio)
                │
                ▼
        iOS Speaker → User hears Julia

Technical Requirements

Dependencies (package.json)

{
  "@livekit/react-native": "^2.x",
  "livekit-client": "^2.x",
  "expo-keep-awake": "^14.x"
}

iOS Permissions (app.json)

{
  "ios": {
    "infoPlist": {
      "NSMicrophoneUsageDescription": "WellNuo needs microphone access for voice calls with Julia AI",
      "UIBackgroundModes": ["audio", "voip"]
    }
  }
}

Token Server (already exists)

  • URL: https://wellnuo.smartlaunchhub.com/julia/token
  • Method: POST
  • Body: { "userId": "string" }
  • Response: { "success": true, "data": { "token", "roomName", "wsUrl" } }

Implementation Steps

Phase 1: Cleanup (DELETE old code)

  • 1.1. Delete app/voice-call.tsx (current broken implementation)
  • 1.2. Keep app/(tabs)/voice.tsx (entry point) but simplify
  • 1.3. Keep services/livekitService.ts (token fetching)
  • 1.4. Keep contexts/VoiceTranscriptContext.tsx (transcript storage)
  • 1.5. Delete components/VoiceIndicator.tsx (unused)
  • 1.6. Delete polyfills/livekit-globals.ts (not needed with proper setup)

Phase 2: New Architecture

  • 2.1. Create hooks/useLiveKitRoom.ts — encapsulate all LiveKit logic
  • 2.2. Create app/voice-call.tsx — simple UI component using the hook
  • 2.3. Create utils/audioSession.ts — iOS AudioSession helper

Phase 3: useLiveKitRoom Hook

File: hooks/useLiveKitRoom.ts

interface UseLiveKitRoomOptions {
  userId: string;
  onTranscript?: (role: 'user' | 'assistant', text: string) => void;
}

interface UseLiveKitRoomReturn {
  // Connection state
  state: 'idle' | 'connecting' | 'connected' | 'reconnecting' | 'disconnected' | 'error';
  error: string | null;

  // Call info
  roomName: string | null;
  callDuration: number; // seconds

  // Audio state
  isMuted: boolean;
  isSpeaking: boolean; // agent is speaking

  // Actions
  connect: () => Promise<void>;
  disconnect: () => Promise<void>;
  toggleMute: () => void;
}

Implementation requirements:

  1. MUST call registerGlobals() BEFORE importing livekit-client
  2. MUST configure iOS AudioSession BEFORE connecting to room
  3. MUST handle all RoomEvents properly
  4. MUST cleanup on unmount (disconnect, stop audio session)
  5. MUST handle background/foreground transitions

Phase 4: iOS AudioSession Configuration

Critical for microphone to work!

// utils/audioSession.ts
import { AudioSession } from '@livekit/react-native';
import { Platform } from 'react-native';

export async function configureAudioForVoiceCall(): Promise<void> {
  if (Platform.OS !== 'ios') return;

  // Step 1: Set Apple audio configuration
  await AudioSession.setAppleAudioConfiguration({
    audioCategory: 'playAndRecord',
    audioCategoryOptions: [
      'allowBluetooth',
      'allowBluetoothA2DP',
      'defaultToSpeaker',
      'mixWithOthers',
    ],
    audioMode: 'voiceChat',
  });

  // Step 2: Configure output
  await AudioSession.configureAudio({
    ios: {
      defaultOutput: 'speaker',
    },
  });

  // Step 3: Start session
  await AudioSession.startAudioSession();
}

export async function stopAudioSession(): Promise<void> {
  if (Platform.OS !== 'ios') return;
  await AudioSession.stopAudioSession();
}

Phase 5: Voice Call Screen UI

File: app/voice-call.tsx

Simple, clean UI:

  • Avatar with Julia "J" letter
  • Call duration timer
  • Status text (Connecting... / Connected / Julia is speaking...)
  • Mute button
  • End call button
  • Debug logs toggle (for development)

NO complex logic in this file — all LiveKit logic in the hook!

Phase 6: Testing Checklist

  • 6.1. Fresh app launch → Start call → Can hear Julia greeting
  • 6.2. Speak → Julia responds → Conversation works
  • 6.3. Mute → Unmute → Still works
  • 6.4. End call → Clean disconnect
  • 6.5. App to background → Audio continues
  • 6.6. App to foreground → Still connected
  • 6.7. Multiple calls in a row → No memory leaks
  • 6.8. No microphone permission → Shows error

Files to Create/Modify

File Action Description
hooks/useLiveKitRoom.ts CREATE Main LiveKit hook with all logic
utils/audioSession.ts CREATE iOS AudioSession helpers
app/voice-call.tsx REPLACE Simple UI using the hook
app/(tabs)/voice.tsx SIMPLIFY Just entry point, remove debug UI
services/livekitService.ts KEEP Token fetching (already works)
contexts/VoiceTranscriptContext.tsx KEEP Transcript storage
components/VoiceIndicator.tsx DELETE Not needed
polyfills/livekit-globals.ts DELETE Not needed

Key Principles

1. Separation of Concerns

  • Hook handles ALL LiveKit/WebRTC logic
  • Screen only renders UI based on hook state
  • Utils for platform-specific code (AudioSession)

2. Proper Initialization Order

1. registerGlobals() — WebRTC polyfills
2. configureAudioForVoiceCall() — iOS audio
3. getToken() — fetch from server
4. room.connect() — connect to LiveKit
5. room.localParticipant.setMicrophoneEnabled(true) — enable mic

3. Proper Cleanup Order

1. room.disconnect() — leave room
2. stopAudioSession() — release iOS audio
3. Clear all refs and state

4. Error Handling

  • Every async operation wrapped in try/catch
  • User-friendly error messages
  • Automatic retry for network issues
  • Graceful degradation

Success Criteria

  1. User can start voice call and hear Julia greeting
  2. User can speak and Julia understands (STT works reliably)
  3. Julia responds with voice (TTS works)
  4. Conversation can continue back and forth
  5. Mute/unmute works
  6. End call cleanly disconnects
  7. No console errors or warnings
  8. Works on iOS device (not just simulator)


Notes

Why previous approach failed:

  1. Too much code in one file — voice-call.tsx had 900+ lines with all logic mixed
  2. Polyfills applied wrong — Event class polyfill was inside the component
  3. AudioSession configured too late — sometimes after connect() already started
  4. No proper error boundaries — errors silently failed
  5. Race conditions — multiple async operations without proper sequencing

What's different this time:

  1. Hook-based architecture — single source of truth for state
  2. Proper initialization sequence — documented and enforced
  3. Clean separation — UI knows nothing about WebRTC
  4. Comprehensive logging — every step logged for debugging
  5. Test-driven — write tests before implementation