wellnua-lite/specs/FEATURE-001-voice-integration.md
Sergei cde44adc5c Add TTS model metadata and documentation
TTS Model (Piper VITS):
- MODEL_CARD: Voice model information
- tokens.txt: Phoneme tokenization
- onnx.json: Model configuration
- Model: en_US-lessac-medium (60MB ONNX - not in git)

Documentation:
- APP_REVIEW_NOTES.txt: App Store review notes
- specs/: Feature specifications
- plugins/: Expo config plugins

.gitignore updates:
- Exclude large ONNX models (60MB+)
- Exclude espeak-ng-data (phoneme data)
- Exclude credentials.json
- Exclude store-screenshots/

Note: TTS models must be downloaded separately.
See specs/ for setup instructions.
2026-01-14 19:10:13 -08:00

15 KiB

FEATURE-001: Voice Integration for Chat

Summary

Integrate voice communication with AI in the chat screen - speech recognition for input and text-to-speech for AI responses.

Status: 🟡 In Progress

Priority: High

Dependencies

  • expo-speech-recognition (STT)
  • expo-speech (fallback TTS)
  • react-native-sherpa-onnx-offline-tts (neural TTS - cross-platform iOS/Android)

Requirements

Functional

  1. Voice Input (STT)

    • Tap microphone button to start listening
    • Real-time transcript display
    • Auto-send when user stops speaking OR tap again to stop
    • Visual indicator when listening (pulsing animation)
  2. Voice Output (TTS)

    • AI responses are spoken automatically
    • Visual indicator when speaking
    • Stop button to interrupt speech
    • Multiple voice options (Lessac/Ryan/Alba)
  3. States & Indicators

    • isListening - microphone active, user speaking
    • isSpeaking - AI voice response playing
    • ttsInitialized - TTS engine ready
    • Animated pulse on microphone when listening

Non-Functional

  • Works offline (SherpaTTS uses local neural models)
  • Cross-platform: iOS and Android
  • Low latency speech synthesis

Technical Design

Architecture

┌─────────────────────────────────────────────────────────┐
│                    chat.tsx                              │
├─────────────────────────────────────────────────────────┤
│  State:                                                  │
│  - isListening (from useSpeechRecognition)              │
│  - recognizedText (from useSpeechRecognition)           │
│  - isSpeaking                                            │
│  - ttsInitialized                                        │
│  - pulseAnim (Animated.Value)                           │
├─────────────────────────────────────────────────────────┤
│  Handlers:                                               │
│  - handleVoiceToggle() - start/stop listening           │
│  - handleVoiceSend() - send recognized text             │
│  - speakText(text) - speak AI response                  │
│  - stopSpeaking() - interrupt speech                    │
└─────────────────────────────────────────────────────────┘
           │                        │
           ▼                        ▼
┌─────────────────────┐  ┌─────────────────────────────────┐
│ useSpeechRecognition│  │       sherpaTTS.ts              │
│ (hooks/)            │  │       (services/)               │
├─────────────────────┤  ├─────────────────────────────────┤
│ - startListening()  │  │ - initializeSherpaTTS()        │
│ - stopListening()   │  │ - speak(text, options)         │
│ - recognizedText    │  │ - stop()                        │
│ - isListening       │  │ - isAvailable()                 │
│ - isAvailable       │  │ - setVoice(voiceId)            │
└─────────────────────┘  └─────────────────────────────────┘
           │                        │
           ▼                        ▼
┌─────────────────────┐  ┌─────────────────────────────────┐
│expo-speech-         │  │ react-native-sherpa-onnx-       │
│recognition          │  │ offline-tts (Piper VITS)        │
│(native module)      │  │ (native module)                 │
└─────────────────────┘  └─────────────────────────────────┘

Available Piper Voices

ID Name Gender Accent Model
lessac Lessac Female US en_US-lessac-medium
ryan Ryan Male US en_US-ryan-medium
alba Alba Female UK en_GB-alba-medium

Voice Flow

User taps mic button
        │
        ▼
  handleVoiceToggle()
        │
  ┌─────┴─────┐
  │ isListening?│
  └─────┬─────┘
        │
   NO   │   YES
   │    │    │
   │    │    ▼
   │    │ stopListening()
   │    │ handleVoiceSend()
   │    │
   ▼    │
startListening()
   │
   ▼
Speech Recognition active
(recognizedText updates)
   │
   ▼
User stops speaking / taps again
   │
   ▼
handleVoiceSend()
   │
   ▼
sendMessage(recognizedText)
   │
   ▼
AI responds
   │
   ▼
speakText(response)
   │
   ▼
SherpaTTS plays audio

Implementation Steps

Phase 1: Setup (DONE)

  • Add dependencies to package.json
  • Create sherpaTTS.ts service
  • Create useSpeechRecognition.ts hook
  • Add voice imports to chat.tsx
  • Add voice states (isListening, isSpeaking, ttsInitialized, pulseAnim)

Phase 2: Logic (DONE)

  • Implement handleVoiceToggle()
  • Implement handleVoiceSend()
  • Implement speakText()
  • Implement stopSpeaking()
  • TTS initialization on component mount
  • Auto-speak AI responses

Phase 3: UI (DONE)

  • Add microphone button to input area
  • Add voice status indicator (Listening.../Speaking...)
  • Add stop button for speech
  • Add pulse animation for listening state
  • Add styles for voice UI elements

Phase 4: Build & Test (IN PROGRESS)

  • Run npm install
  • Run expo prebuild --clean
  • Build iOS (native modules required)
  • Test on iOS simulator
  • Test on Android (emulator or device)

Phase 5: Polish (TODO)

  • Handle permissions properly (microphone access)
  • Add voice picker UI
  • Add speech rate control
  • Test edge cases (no network, no mic permission)

Files Modified/Created

File Status Description
package.json Modified Added voice dependencies
services/sherpaTTS.ts Created SherpaTTS service for offline TTS
hooks/useSpeechRecognition.ts Created Speech recognition hook
app/(tabs)/chat.tsx Modified Voice integration in chat

Testing Checklist

Manual Testing

  • Tap mic button - starts listening
  • Speak - text appears in input field
  • Tap again - sends message
  • AI responds - voice speaks response
  • Tap stop - speech stops immediately
  • Mic button disabled during sending
  • Visual indicators show correct state

Edge Cases

  • No microphone permission - shows alert
  • TTS not available - falls back to expo-speech
  • Empty speech recognition - doesn't send
  • Long AI response - speech handles gracefully
  • Interrupt speech and start new input

Notes

SherpaTTS Cross-Platform Support

  • iOS: Uses native module via bridged ObjC/Swift
  • Android: Uses native module via JNI/Kotlin
  • Model files: Must be bundled in app (assets/tts-models/)
  • Size: ~20MB per voice model

Known Limitations

  • Speech recognition requires device microphone permission
  • SherpaTTS requires native build (not Expo Go)
  • Model download may be needed on first launch

Voice Interaction Scenarios (All Cases)

State Machine

┌─────────────────────────────────────────────────────────────┐
│                     VOICE STATE MACHINE                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│    ┌──────────┐                                             │
│    │   IDLE   │◄────────────────────────────────────┐       │
│    └────┬─────┘                                     │       │
│         │ tap mic                                   │       │
│         ▼                                           │       │
│    ┌──────────┐                                     │       │
│    │LISTENING │───── user stops / tap ─────────┐   │       │
│    └────┬─────┘                                 │   │       │
│         │ recognized text                      │   │       │
│         ▼                                      │   │       │
│    ┌──────────┐                                ▼   │       │
│    │PROCESSING│─────────────────────────► SENDING  │       │
│    └────┬─────┘                              │      │       │
│         │ AI responded                       │      │       │
│         ▼                                    │      │       │
│    ┌──────────┐                              │      │       │
│    │ SPEAKING │◄─────────────────────────────┘      │       │
│    └────┬─────┘                                     │       │
│         │ finished / user tap stop                  │       │
│         └───────────────────────────────────────────┘       │
│                                                              │
└─────────────────────────────────────────────────────────────┘

A. Happy Path Scenarios

# Scenario Expected Behavior Status
A1 User taps mic → speaks → taps again Text recognized → sent → AI responds → spoken
A2 User listens to full AI response TTS finishes → returns to IDLE
A3 User stops TTS with stop button TTS interrupted → can tap mic again
A4 User types text manually Message sent → AI responds → spoken

B. Interruptions & Conflicts

# Scenario Problem Solution Status
B1 Tap mic while AI speaking Mic would hear TTS Block mic while isSpeaking DONE
B2 AI speaking, user wants to stop No way to interrupt Stop button (red) DONE
B3 User speaking, changes mind Need to cancel without sending Tap again = cancel (no text = don't send) DONE
B4 AI speaking, user switches tab Should TTS stop? Stop TTS on blur ⚠️ TODO
B5 App goes to background during TTS TTS continues in background? Platform-specific behavior ⚠️ TODO
B6 Double/triple tap on mic States get confused Debounce + transition lock ⚠️ TODO

C. Speech Recognition Errors (STT)

# Scenario Problem Solution Status
C1 No microphone permission Speech recognition fails Show permission alert + Open Settings DONE
C2 Microphone busy (other app) Can't start recording Show "Microphone busy" error ⚠️ TODO
C3 User silent for 5+ seconds No text to send Auto-cancel with hint ⚠️ TODO
C4 Speech recognition returns empty Nothing recognized Show "Didn't catch that" + auto-hide DONE
C5 Network unavailable (Android) Recognition doesn't work Expo STT needs network on Android ⚠️ NOTE
C6 Unsupported language Recognition works poorly Hardcode 'en-US' DONE

D. Text-to-Speech Errors (TTS)

# Scenario Problem Solution Status
D1 SherpaTTS not initialized Model not loaded Fallback to expo-speech ⚠️ TODO
D2 SherpaTTS crashes mid-playback Speech interrupted Handle error, reset state ⚠️ TODO
D3 Very long AI response TTS plays for 2+ minutes Show progress or split ⚠️ TODO
D4 TTS model not downloaded First launch without network Bundle model or pre-download ⚠️ NOTE
D5 Voice sounds bad Model quality issue Voice picker (Lessac/Ryan/Alba) ⚠️ TODO

E. UI Edge Cases

# Scenario Problem Solution Status
E1 TextInput focused + tap mic Keyboard in the way Hide keyboard when listening ⚠️ TODO
E2 User typing + taps mic What to do with typed text? Keep or replace? ⚠️ TODO
E3 Scroll chat during TTS Unclear which message is playing Highlight speaking message ⚠️ TODO
E4 Multiple messages queued Which one to speak? Only latest AI message DONE
E5 AI responds in chunks (streaming) When to start TTS? After full response DONE

F. Permission Scenarios

# Scenario Action Status
F1 First launch - no permission Show custom UI → request ⚠️ TODO
F2 Permission denied before Open Settings app ⚠️ TODO
F3 Permission "Ask Every Time" (iOS) Request each time ⚠️ TODO
F4 Permission revoked during session Graceful degradation ⚠️ TODO

Implementation Priority

🔴 Critical (voice won't work without these):

  • B1: Block mic during speaking DONE
  • B2: Stop button DONE
  • C1: Permission handling
  • D1: TTS fallback

🟡 Important (UX suffers without these):

  • B3: Cancel recording without sending
  • C3: Timeout on silence
  • C4: "Didn't catch that" feedback
  • E1: Hide keyboard
  • E3: Visual indicator for speaking message

🟢 Nice to have:

  • B4-B5: Background behavior
  • E5: Streaming TTS
  • Voice picker UI