TTS Model (Piper VITS): - MODEL_CARD: Voice model information - tokens.txt: Phoneme tokenization - onnx.json: Model configuration - Model: en_US-lessac-medium (60MB ONNX - not in git) Documentation: - APP_REVIEW_NOTES.txt: App Store review notes - specs/: Feature specifications - plugins/: Expo config plugins .gitignore updates: - Exclude large ONNX models (60MB+) - Exclude espeak-ng-data (phoneme data) - Exclude credentials.json - Exclude store-screenshots/ Note: TTS models must be downloaded separately. See specs/ for setup instructions.
15 KiB
15 KiB
FEATURE-001: Voice Integration for Chat
Summary
Integrate voice communication with AI in the chat screen - speech recognition for input and text-to-speech for AI responses.
Status: 🟡 In Progress
Priority: High
Dependencies
- expo-speech-recognition (STT)
- expo-speech (fallback TTS)
- react-native-sherpa-onnx-offline-tts (neural TTS - cross-platform iOS/Android)
Requirements
Functional
-
Voice Input (STT)
- Tap microphone button to start listening
- Real-time transcript display
- Auto-send when user stops speaking OR tap again to stop
- Visual indicator when listening (pulsing animation)
-
Voice Output (TTS)
- AI responses are spoken automatically
- Visual indicator when speaking
- Stop button to interrupt speech
- Multiple voice options (Lessac/Ryan/Alba)
-
States & Indicators
isListening- microphone active, user speakingisSpeaking- AI voice response playingttsInitialized- TTS engine ready- Animated pulse on microphone when listening
Non-Functional
- Works offline (SherpaTTS uses local neural models)
- Cross-platform: iOS and Android
- Low latency speech synthesis
Technical Design
Architecture
┌─────────────────────────────────────────────────────────┐
│ chat.tsx │
├─────────────────────────────────────────────────────────┤
│ State: │
│ - isListening (from useSpeechRecognition) │
│ - recognizedText (from useSpeechRecognition) │
│ - isSpeaking │
│ - ttsInitialized │
│ - pulseAnim (Animated.Value) │
├─────────────────────────────────────────────────────────┤
│ Handlers: │
│ - handleVoiceToggle() - start/stop listening │
│ - handleVoiceSend() - send recognized text │
│ - speakText(text) - speak AI response │
│ - stopSpeaking() - interrupt speech │
└─────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────────────────┐
│ useSpeechRecognition│ │ sherpaTTS.ts │
│ (hooks/) │ │ (services/) │
├─────────────────────┤ ├─────────────────────────────────┤
│ - startListening() │ │ - initializeSherpaTTS() │
│ - stopListening() │ │ - speak(text, options) │
│ - recognizedText │ │ - stop() │
│ - isListening │ │ - isAvailable() │
│ - isAvailable │ │ - setVoice(voiceId) │
└─────────────────────┘ └─────────────────────────────────┘
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────────────────┐
│expo-speech- │ │ react-native-sherpa-onnx- │
│recognition │ │ offline-tts (Piper VITS) │
│(native module) │ │ (native module) │
└─────────────────────┘ └─────────────────────────────────┘
Available Piper Voices
| ID | Name | Gender | Accent | Model |
|---|---|---|---|---|
| lessac | Lessac | Female | US | en_US-lessac-medium |
| ryan | Ryan | Male | US | en_US-ryan-medium |
| alba | Alba | Female | UK | en_GB-alba-medium |
Voice Flow
User taps mic button
│
▼
handleVoiceToggle()
│
┌─────┴─────┐
│ isListening?│
└─────┬─────┘
│
NO │ YES
│ │ │
│ │ ▼
│ │ stopListening()
│ │ handleVoiceSend()
│ │
▼ │
startListening()
│
▼
Speech Recognition active
(recognizedText updates)
│
▼
User stops speaking / taps again
│
▼
handleVoiceSend()
│
▼
sendMessage(recognizedText)
│
▼
AI responds
│
▼
speakText(response)
│
▼
SherpaTTS plays audio
Implementation Steps
Phase 1: Setup (DONE)
- Add dependencies to package.json
- Create sherpaTTS.ts service
- Create useSpeechRecognition.ts hook
- Add voice imports to chat.tsx
- Add voice states (isListening, isSpeaking, ttsInitialized, pulseAnim)
Phase 2: Logic (DONE)
- Implement handleVoiceToggle()
- Implement handleVoiceSend()
- Implement speakText()
- Implement stopSpeaking()
- TTS initialization on component mount
- Auto-speak AI responses
Phase 3: UI (DONE)
- Add microphone button to input area
- Add voice status indicator (Listening.../Speaking...)
- Add stop button for speech
- Add pulse animation for listening state
- Add styles for voice UI elements
Phase 4: Build & Test (IN PROGRESS)
- Run npm install
- Run expo prebuild --clean
- Build iOS (native modules required)
- Test on iOS simulator
- Test on Android (emulator or device)
Phase 5: Polish (TODO)
- Handle permissions properly (microphone access)
- Add voice picker UI
- Add speech rate control
- Test edge cases (no network, no mic permission)
Files Modified/Created
| File | Status | Description |
|---|---|---|
package.json |
Modified | Added voice dependencies |
services/sherpaTTS.ts |
Created | SherpaTTS service for offline TTS |
hooks/useSpeechRecognition.ts |
Created | Speech recognition hook |
app/(tabs)/chat.tsx |
Modified | Voice integration in chat |
Testing Checklist
Manual Testing
- Tap mic button - starts listening
- Speak - text appears in input field
- Tap again - sends message
- AI responds - voice speaks response
- Tap stop - speech stops immediately
- Mic button disabled during sending
- Visual indicators show correct state
Edge Cases
- No microphone permission - shows alert
- TTS not available - falls back to expo-speech
- Empty speech recognition - doesn't send
- Long AI response - speech handles gracefully
- Interrupt speech and start new input
Notes
SherpaTTS Cross-Platform Support
- iOS: Uses native module via bridged ObjC/Swift
- Android: Uses native module via JNI/Kotlin
- Model files: Must be bundled in app (assets/tts-models/)
- Size: ~20MB per voice model
Known Limitations
- Speech recognition requires device microphone permission
- SherpaTTS requires native build (not Expo Go)
- Model download may be needed on first launch
Voice Interaction Scenarios (All Cases)
State Machine
┌─────────────────────────────────────────────────────────────┐
│ VOICE STATE MACHINE │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ │
│ │ IDLE │◄────────────────────────────────────┐ │
│ └────┬─────┘ │ │
│ │ tap mic │ │
│ ▼ │ │
│ ┌──────────┐ │ │
│ │LISTENING │───── user stops / tap ─────────┐ │ │
│ └────┬─────┘ │ │ │
│ │ recognized text │ │ │
│ ▼ │ │ │
│ ┌──────────┐ ▼ │ │
│ │PROCESSING│─────────────────────────► SENDING │ │
│ └────┬─────┘ │ │ │
│ │ AI responded │ │ │
│ ▼ │ │ │
│ ┌──────────┐ │ │ │
│ │ SPEAKING │◄─────────────────────────────┘ │ │
│ └────┬─────┘ │ │
│ │ finished / user tap stop │ │
│ └───────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
A. Happy Path Scenarios
| # | Scenario | Expected Behavior | Status |
|---|---|---|---|
| A1 | User taps mic → speaks → taps again | Text recognized → sent → AI responds → spoken | ✅ |
| A2 | User listens to full AI response | TTS finishes → returns to IDLE | ✅ |
| A3 | User stops TTS with stop button | TTS interrupted → can tap mic again | ✅ |
| A4 | User types text manually | Message sent → AI responds → spoken | ✅ |
B. Interruptions & Conflicts
| # | Scenario | Problem | Solution | Status |
|---|---|---|---|---|
| B1 | Tap mic while AI speaking | Mic would hear TTS | Block mic while isSpeaking |
✅ DONE |
| B2 | AI speaking, user wants to stop | No way to interrupt | Stop button (red) | ✅ DONE |
| B3 | User speaking, changes mind | Need to cancel without sending | Tap again = cancel (no text = don't send) | ✅ DONE |
| B4 | AI speaking, user switches tab | Should TTS stop? | Stop TTS on blur | ⚠️ TODO |
| B5 | App goes to background during TTS | TTS continues in background? | Platform-specific behavior | ⚠️ TODO |
| B6 | Double/triple tap on mic | States get confused | Debounce + transition lock | ⚠️ TODO |
C. Speech Recognition Errors (STT)
| # | Scenario | Problem | Solution | Status |
|---|---|---|---|---|
| C1 | No microphone permission | Speech recognition fails | Show permission alert + Open Settings | ✅ DONE |
| C2 | Microphone busy (other app) | Can't start recording | Show "Microphone busy" error | ⚠️ TODO |
| C3 | User silent for 5+ seconds | No text to send | Auto-cancel with hint | ⚠️ TODO |
| C4 | Speech recognition returns empty | Nothing recognized | Show "Didn't catch that" + auto-hide | ✅ DONE |
| C5 | Network unavailable (Android) | Recognition doesn't work | Expo STT needs network on Android | ⚠️ NOTE |
| C6 | Unsupported language | Recognition works poorly | Hardcode 'en-US' | ✅ DONE |
D. Text-to-Speech Errors (TTS)
| # | Scenario | Problem | Solution | Status |
|---|---|---|---|---|
| D1 | SherpaTTS not initialized | Model not loaded | Fallback to expo-speech | ⚠️ TODO |
| D2 | SherpaTTS crashes mid-playback | Speech interrupted | Handle error, reset state | ⚠️ TODO |
| D3 | Very long AI response | TTS plays for 2+ minutes | Show progress or split | ⚠️ TODO |
| D4 | TTS model not downloaded | First launch without network | Bundle model or pre-download | ⚠️ NOTE |
| D5 | Voice sounds bad | Model quality issue | Voice picker (Lessac/Ryan/Alba) | ⚠️ TODO |
E. UI Edge Cases
| # | Scenario | Problem | Solution | Status |
|---|---|---|---|---|
| E1 | TextInput focused + tap mic | Keyboard in the way | Hide keyboard when listening | ⚠️ TODO |
| E2 | User typing + taps mic | What to do with typed text? | Keep or replace? | ⚠️ TODO |
| E3 | Scroll chat during TTS | Unclear which message is playing | Highlight speaking message | ⚠️ TODO |
| E4 | Multiple messages queued | Which one to speak? | Only latest AI message | ✅ DONE |
| E5 | AI responds in chunks (streaming) | When to start TTS? | After full response | ✅ DONE |
F. Permission Scenarios
| # | Scenario | Action | Status |
|---|---|---|---|
| F1 | First launch - no permission | Show custom UI → request | ⚠️ TODO |
| F2 | Permission denied before | Open Settings app | ⚠️ TODO |
| F3 | Permission "Ask Every Time" (iOS) | Request each time | ⚠️ TODO |
| F4 | Permission revoked during session | Graceful degradation | ⚠️ TODO |
Implementation Priority
🔴 Critical (voice won't work without these):
- B1: Block mic during speaking ✅ DONE
- B2: Stop button ✅ DONE
- C1: Permission handling
- D1: TTS fallback
🟡 Important (UX suffers without these):
- B3: Cancel recording without sending
- C3: Timeout on silence
- C4: "Didn't catch that" feedback
- E1: Hide keyboard
- E3: Visual indicator for speaking message
🟢 Nice to have:
- B4-B5: Background behavior
- E5: Streaming TTS
- Voice picker UI
Related
- Main WellNuo voice.tsx (reference implementation)
- expo-speech-recognition docs
- sherpa-onnx