# FEATURE-001: Voice Integration for Chat ## Summary Integrate voice communication with AI in the chat screen - speech recognition for input and text-to-speech for AI responses. ## Status: 🟑 In Progress ## Priority: High ## Dependencies - expo-speech-recognition (STT) - expo-speech (fallback TTS) - react-native-sherpa-onnx-offline-tts (neural TTS - cross-platform iOS/Android) --- ## Requirements ### Functional 1. **Voice Input (STT)** - Tap microphone button to start listening - Real-time transcript display - Auto-send when user stops speaking OR tap again to stop - Visual indicator when listening (pulsing animation) 2. **Voice Output (TTS)** - AI responses are spoken automatically - Visual indicator when speaking - Stop button to interrupt speech - Multiple voice options (Lessac/Ryan/Alba) 3. **States & Indicators** - `isListening` - microphone active, user speaking - `isSpeaking` - AI voice response playing - `ttsInitialized` - TTS engine ready - Animated pulse on microphone when listening ### Non-Functional - Works offline (SherpaTTS uses local neural models) - Cross-platform: iOS and Android - Low latency speech synthesis --- ## Technical Design ### Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ chat.tsx β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ State: β”‚ β”‚ - isListening (from useSpeechRecognition) β”‚ β”‚ - recognizedText (from useSpeechRecognition) β”‚ β”‚ - isSpeaking β”‚ β”‚ - ttsInitialized β”‚ β”‚ - pulseAnim (Animated.Value) β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Handlers: β”‚ β”‚ - handleVoiceToggle() - start/stop listening β”‚ β”‚ - handleVoiceSend() - send recognized text β”‚ β”‚ - speakText(text) - speak AI response β”‚ β”‚ - stopSpeaking() - interrupt speech β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ useSpeechRecognitionβ”‚ β”‚ sherpaTTS.ts β”‚ β”‚ (hooks/) β”‚ β”‚ (services/) β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ - startListening() β”‚ β”‚ - initializeSherpaTTS() β”‚ β”‚ - stopListening() β”‚ β”‚ - speak(text, options) β”‚ β”‚ - recognizedText β”‚ β”‚ - stop() β”‚ β”‚ - isListening β”‚ β”‚ - isAvailable() β”‚ β”‚ - isAvailable β”‚ β”‚ - setVoice(voiceId) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚expo-speech- β”‚ β”‚ react-native-sherpa-onnx- β”‚ β”‚recognition β”‚ β”‚ offline-tts (Piper VITS) β”‚ β”‚(native module) β”‚ β”‚ (native module) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ### Available Piper Voices | ID | Name | Gender | Accent | Model | |----|------|--------|--------|-------| | lessac | Lessac | Female | US | en_US-lessac-medium | | ryan | Ryan | Male | US | en_US-ryan-medium | | alba | Alba | Female | UK | en_GB-alba-medium | ### Voice Flow ``` User taps mic button β”‚ β–Ό handleVoiceToggle() β”‚ β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β” β”‚ isListening?β”‚ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚ NO β”‚ YES β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ stopListening() β”‚ β”‚ handleVoiceSend() β”‚ β”‚ β–Ό β”‚ startListening() β”‚ β–Ό Speech Recognition active (recognizedText updates) β”‚ β–Ό User stops speaking / taps again β”‚ β–Ό handleVoiceSend() β”‚ β–Ό sendMessage(recognizedText) β”‚ β–Ό AI responds β”‚ β–Ό speakText(response) β”‚ β–Ό SherpaTTS plays audio ``` --- ## Implementation Steps ### Phase 1: Setup (DONE) - [x] Add dependencies to package.json - [x] Create sherpaTTS.ts service - [x] Create useSpeechRecognition.ts hook - [x] Add voice imports to chat.tsx - [x] Add voice states (isListening, isSpeaking, ttsInitialized, pulseAnim) ### Phase 2: Logic (DONE) - [x] Implement handleVoiceToggle() - [x] Implement handleVoiceSend() - [x] Implement speakText() - [x] Implement stopSpeaking() - [x] TTS initialization on component mount - [x] Auto-speak AI responses ### Phase 3: UI (DONE) - [x] Add microphone button to input area - [x] Add voice status indicator (Listening.../Speaking...) - [x] Add stop button for speech - [x] Add pulse animation for listening state - [x] Add styles for voice UI elements ### Phase 4: Build & Test (IN PROGRESS) - [ ] Run npm install - [ ] Run expo prebuild --clean - [ ] Build iOS (native modules required) - [ ] Test on iOS simulator - [ ] Test on Android (emulator or device) ### Phase 5: Polish (TODO) - [ ] Handle permissions properly (microphone access) - [ ] Add voice picker UI - [ ] Add speech rate control - [ ] Test edge cases (no network, no mic permission) --- ## Files Modified/Created | File | Status | Description | |------|--------|-------------| | `package.json` | Modified | Added voice dependencies | | `services/sherpaTTS.ts` | Created | SherpaTTS service for offline TTS | | `hooks/useSpeechRecognition.ts` | Created | Speech recognition hook | | `app/(tabs)/chat.tsx` | Modified | Voice integration in chat | --- ## Testing Checklist ### Manual Testing - [ ] Tap mic button - starts listening - [ ] Speak - text appears in input field - [ ] Tap again - sends message - [ ] AI responds - voice speaks response - [ ] Tap stop - speech stops immediately - [ ] Mic button disabled during sending - [ ] Visual indicators show correct state ### Edge Cases - [ ] No microphone permission - shows alert - [ ] TTS not available - falls back to expo-speech - [ ] Empty speech recognition - doesn't send - [ ] Long AI response - speech handles gracefully - [ ] Interrupt speech and start new input --- ## Notes ### SherpaTTS Cross-Platform Support - **iOS**: Uses native module via bridged ObjC/Swift - **Android**: Uses native module via JNI/Kotlin - **Model files**: Must be bundled in app (assets/tts-models/) - **Size**: ~20MB per voice model ### Known Limitations - Speech recognition requires device microphone permission - SherpaTTS requires native build (not Expo Go) - Model download may be needed on first launch --- ## Voice Interaction Scenarios (All Cases) ### State Machine ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ VOICE STATE MACHINE β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ IDLE │◄────────────────────────────────────┐ β”‚ β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ tap mic β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”‚LISTENING │───── user stops / tap ─────────┐ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ recognized text β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β–Ό β”‚ β”‚ β”‚ β”‚PROCESSING│─────────────────────────► SENDING β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ AI responded β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”‚ β”‚ SPEAKING β”‚β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ finished / user tap stop β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ### A. Happy Path Scenarios | # | Scenario | Expected Behavior | Status | |---|----------|-------------------|--------| | A1 | User taps mic β†’ speaks β†’ taps again | Text recognized β†’ sent β†’ AI responds β†’ spoken | βœ… | | A2 | User listens to full AI response | TTS finishes β†’ returns to IDLE | βœ… | | A3 | User stops TTS with stop button | TTS interrupted β†’ can tap mic again | βœ… | | A4 | User types text manually | Message sent β†’ AI responds β†’ spoken | βœ… | ### B. Interruptions & Conflicts | # | Scenario | Problem | Solution | Status | |---|----------|---------|----------|--------| | B1 | Tap mic while AI speaking | Mic would hear TTS | Block mic while `isSpeaking` | βœ… DONE | | B2 | AI speaking, user wants to stop | No way to interrupt | Stop button (red) | βœ… DONE | | B3 | User speaking, changes mind | Need to cancel without sending | Tap again = cancel (no text = don't send) | βœ… DONE | | B4 | AI speaking, user switches tab | Should TTS stop? | Stop TTS on blur | ⚠️ TODO | | B5 | App goes to background during TTS | TTS continues in background? | Platform-specific behavior | ⚠️ TODO | | B6 | Double/triple tap on mic | States get confused | Debounce + transition lock | ⚠️ TODO | ### C. Speech Recognition Errors (STT) | # | Scenario | Problem | Solution | Status | |---|----------|---------|----------|--------| | C1 | No microphone permission | Speech recognition fails | Show permission alert + Open Settings | βœ… DONE | | C2 | Microphone busy (other app) | Can't start recording | Show "Microphone busy" error | ⚠️ TODO | | C3 | User silent for 5+ seconds | No text to send | Auto-cancel with hint | ⚠️ TODO | | C4 | Speech recognition returns empty | Nothing recognized | Show "Didn't catch that" + auto-hide | βœ… DONE | | C5 | Network unavailable (Android) | Recognition doesn't work | Expo STT needs network on Android | ⚠️ NOTE | | C6 | Unsupported language | Recognition works poorly | Hardcode 'en-US' | βœ… DONE | ### D. Text-to-Speech Errors (TTS) | # | Scenario | Problem | Solution | Status | |---|----------|---------|----------|--------| | D1 | SherpaTTS not initialized | Model not loaded | Fallback to expo-speech | ⚠️ TODO | | D2 | SherpaTTS crashes mid-playback | Speech interrupted | Handle error, reset state | ⚠️ TODO | | D3 | Very long AI response | TTS plays for 2+ minutes | Show progress or split | ⚠️ TODO | | D4 | TTS model not downloaded | First launch without network | Bundle model or pre-download | ⚠️ NOTE | | D5 | Voice sounds bad | Model quality issue | Voice picker (Lessac/Ryan/Alba) | ⚠️ TODO | ### E. UI Edge Cases | # | Scenario | Problem | Solution | Status | |---|----------|---------|----------|--------| | E1 | TextInput focused + tap mic | Keyboard in the way | Hide keyboard when listening | ⚠️ TODO | | E2 | User typing + taps mic | What to do with typed text? | Keep or replace? | ⚠️ TODO | | E3 | Scroll chat during TTS | Unclear which message is playing | Highlight speaking message | ⚠️ TODO | | E4 | Multiple messages queued | Which one to speak? | Only latest AI message | βœ… DONE | | E5 | AI responds in chunks (streaming) | When to start TTS? | After full response | βœ… DONE | ### F. Permission Scenarios | # | Scenario | Action | Status | |---|----------|--------|--------| | F1 | First launch - no permission | Show custom UI β†’ request | ⚠️ TODO | | F2 | Permission denied before | Open Settings app | ⚠️ TODO | | F3 | Permission "Ask Every Time" (iOS) | Request each time | ⚠️ TODO | | F4 | Permission revoked during session | Graceful degradation | ⚠️ TODO | ### Implementation Priority **πŸ”΄ Critical (voice won't work without these):** - B1: Block mic during speaking βœ… DONE - B2: Stop button βœ… DONE - C1: Permission handling - D1: TTS fallback **🟑 Important (UX suffers without these):** - B3: Cancel recording without sending - C3: Timeout on silence - C4: "Didn't catch that" feedback - E1: Hide keyboard - E3: Visual indicator for speaking message **🟒 Nice to have:** - B4-B5: Background behavior - E5: Streaming TTS - Voice picker UI --- ## Related - Main WellNuo voice.tsx (reference implementation) - [expo-speech-recognition docs](https://docs.expo.dev/versions/latest/sdk/speech-recognition/) - [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)