TTS Model (Piper VITS): - MODEL_CARD: Voice model information - tokens.txt: Phoneme tokenization - onnx.json: Model configuration - Model: en_US-lessac-medium (60MB ONNX - not in git) Documentation: - APP_REVIEW_NOTES.txt: App Store review notes - specs/: Feature specifications - plugins/: Expo config plugins .gitignore updates: - Exclude large ONNX models (60MB+) - Exclude espeak-ng-data (phoneme data) - Exclude credentials.json - Exclude store-screenshots/ Note: TTS models must be downloaded separately. See specs/ for setup instructions.
343 lines
15 KiB
Markdown
343 lines
15 KiB
Markdown
# FEATURE-001: Voice Integration for Chat
|
|
|
|
## Summary
|
|
Integrate voice communication with AI in the chat screen - speech recognition for input and text-to-speech for AI responses.
|
|
|
|
## Status: 🟡 In Progress
|
|
|
|
## Priority: High
|
|
|
|
## Dependencies
|
|
- expo-speech-recognition (STT)
|
|
- expo-speech (fallback TTS)
|
|
- react-native-sherpa-onnx-offline-tts (neural TTS - cross-platform iOS/Android)
|
|
|
|
---
|
|
|
|
## Requirements
|
|
|
|
### Functional
|
|
1. **Voice Input (STT)**
|
|
- Tap microphone button to start listening
|
|
- Real-time transcript display
|
|
- Auto-send when user stops speaking OR tap again to stop
|
|
- Visual indicator when listening (pulsing animation)
|
|
|
|
2. **Voice Output (TTS)**
|
|
- AI responses are spoken automatically
|
|
- Visual indicator when speaking
|
|
- Stop button to interrupt speech
|
|
- Multiple voice options (Lessac/Ryan/Alba)
|
|
|
|
3. **States & Indicators**
|
|
- `isListening` - microphone active, user speaking
|
|
- `isSpeaking` - AI voice response playing
|
|
- `ttsInitialized` - TTS engine ready
|
|
- Animated pulse on microphone when listening
|
|
|
|
### Non-Functional
|
|
- Works offline (SherpaTTS uses local neural models)
|
|
- Cross-platform: iOS and Android
|
|
- Low latency speech synthesis
|
|
|
|
---
|
|
|
|
## Technical Design
|
|
|
|
### Architecture
|
|
```
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ chat.tsx │
|
|
├─────────────────────────────────────────────────────────┤
|
|
│ State: │
|
|
│ - isListening (from useSpeechRecognition) │
|
|
│ - recognizedText (from useSpeechRecognition) │
|
|
│ - isSpeaking │
|
|
│ - ttsInitialized │
|
|
│ - pulseAnim (Animated.Value) │
|
|
├─────────────────────────────────────────────────────────┤
|
|
│ Handlers: │
|
|
│ - handleVoiceToggle() - start/stop listening │
|
|
│ - handleVoiceSend() - send recognized text │
|
|
│ - speakText(text) - speak AI response │
|
|
│ - stopSpeaking() - interrupt speech │
|
|
└─────────────────────────────────────────────────────────┘
|
|
│ │
|
|
▼ ▼
|
|
┌─────────────────────┐ ┌─────────────────────────────────┐
|
|
│ useSpeechRecognition│ │ sherpaTTS.ts │
|
|
│ (hooks/) │ │ (services/) │
|
|
├─────────────────────┤ ├─────────────────────────────────┤
|
|
│ - startListening() │ │ - initializeSherpaTTS() │
|
|
│ - stopListening() │ │ - speak(text, options) │
|
|
│ - recognizedText │ │ - stop() │
|
|
│ - isListening │ │ - isAvailable() │
|
|
│ - isAvailable │ │ - setVoice(voiceId) │
|
|
└─────────────────────┘ └─────────────────────────────────┘
|
|
│ │
|
|
▼ ▼
|
|
┌─────────────────────┐ ┌─────────────────────────────────┐
|
|
│expo-speech- │ │ react-native-sherpa-onnx- │
|
|
│recognition │ │ offline-tts (Piper VITS) │
|
|
│(native module) │ │ (native module) │
|
|
└─────────────────────┘ └─────────────────────────────────┘
|
|
```
|
|
|
|
### Available Piper Voices
|
|
| ID | Name | Gender | Accent | Model |
|
|
|----|------|--------|--------|-------|
|
|
| lessac | Lessac | Female | US | en_US-lessac-medium |
|
|
| ryan | Ryan | Male | US | en_US-ryan-medium |
|
|
| alba | Alba | Female | UK | en_GB-alba-medium |
|
|
|
|
### Voice Flow
|
|
|
|
```
|
|
User taps mic button
|
|
│
|
|
▼
|
|
handleVoiceToggle()
|
|
│
|
|
┌─────┴─────┐
|
|
│ isListening?│
|
|
└─────┬─────┘
|
|
│
|
|
NO │ YES
|
|
│ │ │
|
|
│ │ ▼
|
|
│ │ stopListening()
|
|
│ │ handleVoiceSend()
|
|
│ │
|
|
▼ │
|
|
startListening()
|
|
│
|
|
▼
|
|
Speech Recognition active
|
|
(recognizedText updates)
|
|
│
|
|
▼
|
|
User stops speaking / taps again
|
|
│
|
|
▼
|
|
handleVoiceSend()
|
|
│
|
|
▼
|
|
sendMessage(recognizedText)
|
|
│
|
|
▼
|
|
AI responds
|
|
│
|
|
▼
|
|
speakText(response)
|
|
│
|
|
▼
|
|
SherpaTTS plays audio
|
|
```
|
|
|
|
---
|
|
|
|
## Implementation Steps
|
|
|
|
### Phase 1: Setup (DONE)
|
|
- [x] Add dependencies to package.json
|
|
- [x] Create sherpaTTS.ts service
|
|
- [x] Create useSpeechRecognition.ts hook
|
|
- [x] Add voice imports to chat.tsx
|
|
- [x] Add voice states (isListening, isSpeaking, ttsInitialized, pulseAnim)
|
|
|
|
### Phase 2: Logic (DONE)
|
|
- [x] Implement handleVoiceToggle()
|
|
- [x] Implement handleVoiceSend()
|
|
- [x] Implement speakText()
|
|
- [x] Implement stopSpeaking()
|
|
- [x] TTS initialization on component mount
|
|
- [x] Auto-speak AI responses
|
|
|
|
### Phase 3: UI (DONE)
|
|
- [x] Add microphone button to input area
|
|
- [x] Add voice status indicator (Listening.../Speaking...)
|
|
- [x] Add stop button for speech
|
|
- [x] Add pulse animation for listening state
|
|
- [x] Add styles for voice UI elements
|
|
|
|
### Phase 4: Build & Test (IN PROGRESS)
|
|
- [ ] Run npm install
|
|
- [ ] Run expo prebuild --clean
|
|
- [ ] Build iOS (native modules required)
|
|
- [ ] Test on iOS simulator
|
|
- [ ] Test on Android (emulator or device)
|
|
|
|
### Phase 5: Polish (TODO)
|
|
- [ ] Handle permissions properly (microphone access)
|
|
- [ ] Add voice picker UI
|
|
- [ ] Add speech rate control
|
|
- [ ] Test edge cases (no network, no mic permission)
|
|
|
|
---
|
|
|
|
## Files Modified/Created
|
|
|
|
| File | Status | Description |
|
|
|------|--------|-------------|
|
|
| `package.json` | Modified | Added voice dependencies |
|
|
| `services/sherpaTTS.ts` | Created | SherpaTTS service for offline TTS |
|
|
| `hooks/useSpeechRecognition.ts` | Created | Speech recognition hook |
|
|
| `app/(tabs)/chat.tsx` | Modified | Voice integration in chat |
|
|
|
|
---
|
|
|
|
## Testing Checklist
|
|
|
|
### Manual Testing
|
|
- [ ] Tap mic button - starts listening
|
|
- [ ] Speak - text appears in input field
|
|
- [ ] Tap again - sends message
|
|
- [ ] AI responds - voice speaks response
|
|
- [ ] Tap stop - speech stops immediately
|
|
- [ ] Mic button disabled during sending
|
|
- [ ] Visual indicators show correct state
|
|
|
|
### Edge Cases
|
|
- [ ] No microphone permission - shows alert
|
|
- [ ] TTS not available - falls back to expo-speech
|
|
- [ ] Empty speech recognition - doesn't send
|
|
- [ ] Long AI response - speech handles gracefully
|
|
- [ ] Interrupt speech and start new input
|
|
|
|
---
|
|
|
|
## Notes
|
|
|
|
### SherpaTTS Cross-Platform Support
|
|
- **iOS**: Uses native module via bridged ObjC/Swift
|
|
- **Android**: Uses native module via JNI/Kotlin
|
|
- **Model files**: Must be bundled in app (assets/tts-models/)
|
|
- **Size**: ~20MB per voice model
|
|
|
|
### Known Limitations
|
|
- Speech recognition requires device microphone permission
|
|
- SherpaTTS requires native build (not Expo Go)
|
|
- Model download may be needed on first launch
|
|
|
|
---
|
|
|
|
## Voice Interaction Scenarios (All Cases)
|
|
|
|
### State Machine
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ VOICE STATE MACHINE │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌──────────┐ │
|
|
│ │ IDLE │◄────────────────────────────────────┐ │
|
|
│ └────┬─────┘ │ │
|
|
│ │ tap mic │ │
|
|
│ ▼ │ │
|
|
│ ┌──────────┐ │ │
|
|
│ │LISTENING │───── user stops / tap ─────────┐ │ │
|
|
│ └────┬─────┘ │ │ │
|
|
│ │ recognized text │ │ │
|
|
│ ▼ │ │ │
|
|
│ ┌──────────┐ ▼ │ │
|
|
│ │PROCESSING│─────────────────────────► SENDING │ │
|
|
│ └────┬─────┘ │ │ │
|
|
│ │ AI responded │ │ │
|
|
│ ▼ │ │ │
|
|
│ ┌──────────┐ │ │ │
|
|
│ │ SPEAKING │◄─────────────────────────────┘ │ │
|
|
│ └────┬─────┘ │ │
|
|
│ │ finished / user tap stop │ │
|
|
│ └───────────────────────────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### A. Happy Path Scenarios
|
|
|
|
| # | Scenario | Expected Behavior | Status |
|
|
|---|----------|-------------------|--------|
|
|
| A1 | User taps mic → speaks → taps again | Text recognized → sent → AI responds → spoken | ✅ |
|
|
| A2 | User listens to full AI response | TTS finishes → returns to IDLE | ✅ |
|
|
| A3 | User stops TTS with stop button | TTS interrupted → can tap mic again | ✅ |
|
|
| A4 | User types text manually | Message sent → AI responds → spoken | ✅ |
|
|
|
|
### B. Interruptions & Conflicts
|
|
|
|
| # | Scenario | Problem | Solution | Status |
|
|
|---|----------|---------|----------|--------|
|
|
| B1 | Tap mic while AI speaking | Mic would hear TTS | Block mic while `isSpeaking` | ✅ DONE |
|
|
| B2 | AI speaking, user wants to stop | No way to interrupt | Stop button (red) | ✅ DONE |
|
|
| B3 | User speaking, changes mind | Need to cancel without sending | Tap again = cancel (no text = don't send) | ✅ DONE |
|
|
| B4 | AI speaking, user switches tab | Should TTS stop? | Stop TTS on blur | ⚠️ TODO |
|
|
| B5 | App goes to background during TTS | TTS continues in background? | Platform-specific behavior | ⚠️ TODO |
|
|
| B6 | Double/triple tap on mic | States get confused | Debounce + transition lock | ⚠️ TODO |
|
|
|
|
### C. Speech Recognition Errors (STT)
|
|
|
|
| # | Scenario | Problem | Solution | Status |
|
|
|---|----------|---------|----------|--------|
|
|
| C1 | No microphone permission | Speech recognition fails | Show permission alert + Open Settings | ✅ DONE |
|
|
| C2 | Microphone busy (other app) | Can't start recording | Show "Microphone busy" error | ⚠️ TODO |
|
|
| C3 | User silent for 5+ seconds | No text to send | Auto-cancel with hint | ⚠️ TODO |
|
|
| C4 | Speech recognition returns empty | Nothing recognized | Show "Didn't catch that" + auto-hide | ✅ DONE |
|
|
| C5 | Network unavailable (Android) | Recognition doesn't work | Expo STT needs network on Android | ⚠️ NOTE |
|
|
| C6 | Unsupported language | Recognition works poorly | Hardcode 'en-US' | ✅ DONE |
|
|
|
|
### D. Text-to-Speech Errors (TTS)
|
|
|
|
| # | Scenario | Problem | Solution | Status |
|
|
|---|----------|---------|----------|--------|
|
|
| D1 | SherpaTTS not initialized | Model not loaded | Fallback to expo-speech | ⚠️ TODO |
|
|
| D2 | SherpaTTS crashes mid-playback | Speech interrupted | Handle error, reset state | ⚠️ TODO |
|
|
| D3 | Very long AI response | TTS plays for 2+ minutes | Show progress or split | ⚠️ TODO |
|
|
| D4 | TTS model not downloaded | First launch without network | Bundle model or pre-download | ⚠️ NOTE |
|
|
| D5 | Voice sounds bad | Model quality issue | Voice picker (Lessac/Ryan/Alba) | ⚠️ TODO |
|
|
|
|
### E. UI Edge Cases
|
|
|
|
| # | Scenario | Problem | Solution | Status |
|
|
|---|----------|---------|----------|--------|
|
|
| E1 | TextInput focused + tap mic | Keyboard in the way | Hide keyboard when listening | ⚠️ TODO |
|
|
| E2 | User typing + taps mic | What to do with typed text? | Keep or replace? | ⚠️ TODO |
|
|
| E3 | Scroll chat during TTS | Unclear which message is playing | Highlight speaking message | ⚠️ TODO |
|
|
| E4 | Multiple messages queued | Which one to speak? | Only latest AI message | ✅ DONE |
|
|
| E5 | AI responds in chunks (streaming) | When to start TTS? | After full response | ✅ DONE |
|
|
|
|
### F. Permission Scenarios
|
|
|
|
| # | Scenario | Action | Status |
|
|
|---|----------|--------|--------|
|
|
| F1 | First launch - no permission | Show custom UI → request | ⚠️ TODO |
|
|
| F2 | Permission denied before | Open Settings app | ⚠️ TODO |
|
|
| F3 | Permission "Ask Every Time" (iOS) | Request each time | ⚠️ TODO |
|
|
| F4 | Permission revoked during session | Graceful degradation | ⚠️ TODO |
|
|
|
|
### Implementation Priority
|
|
|
|
**🔴 Critical (voice won't work without these):**
|
|
- B1: Block mic during speaking ✅ DONE
|
|
- B2: Stop button ✅ DONE
|
|
- C1: Permission handling
|
|
- D1: TTS fallback
|
|
|
|
**🟡 Important (UX suffers without these):**
|
|
- B3: Cancel recording without sending
|
|
- C3: Timeout on silence
|
|
- C4: "Didn't catch that" feedback
|
|
- E1: Hide keyboard
|
|
- E3: Visual indicator for speaking message
|
|
|
|
**🟢 Nice to have:**
|
|
- B4-B5: Background behavior
|
|
- E5: Streaming TTS
|
|
- Voice picker UI
|
|
|
|
---
|
|
|
|
## Related
|
|
- Main WellNuo voice.tsx (reference implementation)
|
|
- [expo-speech-recognition docs](https://docs.expo.dev/versions/latest/sdk/speech-recognition/)
|
|
- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)
|