wellnua-lite-Robert/specs/FEATURE-001-voice-integration.md

# FEATURE-001: Voice Integration for Chat

## Summary
Integrate voice communication with AI in the chat screen - speech recognition for input and text-to-speech for AI responses.

## Status: 🟡 In Progress

## Priority: High

## Dependencies
- expo-speech-recognition (STT)
- expo-speech (fallback TTS)
- react-native-sherpa-onnx-offline-tts (neural TTS - cross-platform iOS/Android)

---

## Requirements

### Functional
1. **Voice Input (STT)**
   - Tap microphone button to start listening
   - Real-time transcript display
   - Auto-send when user stops speaking OR tap again to stop
   - Visual indicator when listening (pulsing animation)

2. **Voice Output (TTS)**
   - AI responses are spoken automatically
   - Visual indicator when speaking
   - Stop button to interrupt speech
   - Multiple voice options (Lessac/Ryan/Alba)

3. **States & Indicators**
   - `isListening` - microphone active, user speaking
   - `isSpeaking` - AI voice response playing
   - `ttsInitialized` - TTS engine ready
   - Animated pulse on microphone when listening

### Non-Functional
- Works offline (SherpaTTS uses local neural models)
- Cross-platform: iOS and Android
- Low latency speech synthesis

---

## Technical Design

### Architecture
```
┌─────────────────────────────────────────────────────────┐
│                    chat.tsx                              │
├─────────────────────────────────────────────────────────┤
│  State:                                                  │
│  - isListening (from useSpeechRecognition)              │
│  - recognizedText (from useSpeechRecognition)           │
│  - isSpeaking                                            │
│  - ttsInitialized                                        │
│  - pulseAnim (Animated.Value)                           │
├─────────────────────────────────────────────────────────┤
│  Handlers:                                               │
│  - handleVoiceToggle() - start/stop listening           │
│  - handleVoiceSend() - send recognized text             │
│  - speakText(text) - speak AI response                  │
│  - stopSpeaking() - interrupt speech                    │
└─────────────────────────────────────────────────────────┘
           │                        │
           ▼                        ▼
┌─────────────────────┐  ┌─────────────────────────────────┐
│ useSpeechRecognition│  │       sherpaTTS.ts              │
│ (hooks/)            │  │       (services/)               │
├─────────────────────┤  ├─────────────────────────────────┤
│ - startListening()  │  │ - initializeSherpaTTS()        │
│ - stopListening()   │  │ - speak(text, options)         │
│ - recognizedText    │  │ - stop()                        │
│ - isListening       │  │ - isAvailable()                 │
│ - isAvailable       │  │ - setVoice(voiceId)            │
└─────────────────────┘  └─────────────────────────────────┘
           │                        │
           ▼                        ▼
┌─────────────────────┐  ┌─────────────────────────────────┐
│expo-speech-         │  │ react-native-sherpa-onnx-       │
│recognition          │  │ offline-tts (Piper VITS)        │
│(native module)      │  │ (native module)                 │
└─────────────────────┘  └─────────────────────────────────┘
```

### Available Piper Voices
| ID | Name | Gender | Accent | Model |
|----|------|--------|--------|-------|
| lessac | Lessac | Female | US | en_US-lessac-medium |
| ryan | Ryan | Male | US | en_US-ryan-medium |
| alba | Alba | Female | UK | en_GB-alba-medium |

### Voice Flow

```
User taps mic button
        │
        ▼
  handleVoiceToggle()
        │
  ┌─────┴─────┐
  │ isListening?│
  └─────┬─────┘
        │
   NO   │   YES
   │    │    │
   │    │    ▼
   │    │ stopListening()
   │    │ handleVoiceSend()
   │    │
   ▼    │
startListening()
   │
   ▼
Speech Recognition active
(recognizedText updates)
   │
   ▼
User stops speaking / taps again
   │
   ▼
handleVoiceSend()
   │
   ▼
sendMessage(recognizedText)
   │
   ▼
AI responds
   │
   ▼
speakText(response)
   │
   ▼
SherpaTTS plays audio
```

---

## Implementation Steps

### Phase 1: Setup (DONE)
- [x] Add dependencies to package.json
- [x] Create sherpaTTS.ts service
- [x] Create useSpeechRecognition.ts hook
- [x] Add voice imports to chat.tsx
- [x] Add voice states (isListening, isSpeaking, ttsInitialized, pulseAnim)

### Phase 2: Logic (DONE)
- [x] Implement handleVoiceToggle()
- [x] Implement handleVoiceSend()
- [x] Implement speakText()
- [x] Implement stopSpeaking()
- [x] TTS initialization on component mount
- [x] Auto-speak AI responses

### Phase 3: UI (DONE)
- [x] Add microphone button to input area
- [x] Add voice status indicator (Listening.../Speaking...)
- [x] Add stop button for speech
- [x] Add pulse animation for listening state
- [x] Add styles for voice UI elements

### Phase 4: Build & Test (IN PROGRESS)
- [ ] Run npm install
- [ ] Run expo prebuild --clean
- [ ] Build iOS (native modules required)
- [ ] Test on iOS simulator
- [ ] Test on Android (emulator or device)

### Phase 5: Polish (TODO)
- [ ] Handle permissions properly (microphone access)
- [ ] Add voice picker UI
- [ ] Add speech rate control
- [ ] Test edge cases (no network, no mic permission)

---

## Files Modified/Created

| File | Status | Description |
|------|--------|-------------|
| `package.json` | Modified | Added voice dependencies |
| `services/sherpaTTS.ts` | Created | SherpaTTS service for offline TTS |
| `hooks/useSpeechRecognition.ts` | Created | Speech recognition hook |
| `app/(tabs)/chat.tsx` | Modified | Voice integration in chat |

---

## Testing Checklist

### Manual Testing
- [ ] Tap mic button - starts listening
- [ ] Speak - text appears in input field
- [ ] Tap again - sends message
- [ ] AI responds - voice speaks response
- [ ] Tap stop - speech stops immediately
- [ ] Mic button disabled during sending
- [ ] Visual indicators show correct state

### Edge Cases
- [ ] No microphone permission - shows alert
- [ ] TTS not available - falls back to expo-speech
- [ ] Empty speech recognition - doesn't send
- [ ] Long AI response - speech handles gracefully
- [ ] Interrupt speech and start new input

---

## Notes

### SherpaTTS Cross-Platform Support
- **iOS**: Uses native module via bridged ObjC/Swift
- **Android**: Uses native module via JNI/Kotlin
- **Model files**: Must be bundled in app (assets/tts-models/)
- **Size**: ~20MB per voice model

### Known Limitations
- Speech recognition requires device microphone permission
- SherpaTTS requires native build (not Expo Go)
- Model download may be needed on first launch

---

## Voice Interaction Scenarios (All Cases)

### State Machine

```
┌─────────────────────────────────────────────────────────────┐
│                     VOICE STATE MACHINE                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│    ┌──────────┐                                             │
│    │   IDLE   │◄────────────────────────────────────┐       │
│    └────┬─────┘                                     │       │
│         │ tap mic                                   │       │
│         ▼                                           │       │
│    ┌──────────┐                                     │       │
│    │LISTENING │───── user stops / tap ─────────┐   │       │
│    └────┬─────┘                                 │   │       │
│         │ recognized text                      │   │       │
│         ▼                                      │   │       │
│    ┌──────────┐                                ▼   │       │
│    │PROCESSING│─────────────────────────► SENDING  │       │
│    └────┬─────┘                              │      │       │
│         │ AI responded                       │      │       │
│         ▼                                    │      │       │
│    ┌──────────┐                              │      │       │
│    │ SPEAKING │◄─────────────────────────────┘      │       │
│    └────┬─────┘                                     │       │
│         │ finished / user tap stop                  │       │
│         └───────────────────────────────────────────┘       │
│                                                              │
└─────────────────────────────────────────────────────────────┘
```

### A. Happy Path Scenarios

| # | Scenario | Expected Behavior | Status |
|---|----------|-------------------|--------|
| A1 | User taps mic → speaks → taps again | Text recognized → sent → AI responds → spoken | ✅ |
| A2 | User listens to full AI response | TTS finishes → returns to IDLE | ✅ |
| A3 | User stops TTS with stop button | TTS interrupted → can tap mic again | ✅ |
| A4 | User types text manually | Message sent → AI responds → spoken | ✅ |

### B. Interruptions & Conflicts

| # | Scenario | Problem | Solution | Status |
|---|----------|---------|----------|--------|
| B1 | Tap mic while AI speaking | Mic would hear TTS | Block mic while `isSpeaking` | ✅ DONE |
| B2 | AI speaking, user wants to stop | No way to interrupt | Stop button (red) | ✅ DONE |
| B3 | User speaking, changes mind | Need to cancel without sending | Tap again = cancel (no text = don't send) | ✅ DONE |
| B4 | AI speaking, user switches tab | Should TTS stop? | Stop TTS on blur | ⚠️ TODO |
| B5 | App goes to background during TTS | TTS continues in background? | Platform-specific behavior | ⚠️ TODO |
| B6 | Double/triple tap on mic | States get confused | Debounce + transition lock | ⚠️ TODO |

### C. Speech Recognition Errors (STT)

| # | Scenario | Problem | Solution | Status |
|---|----------|---------|----------|--------|
| C1 | No microphone permission | Speech recognition fails | Show permission alert + Open Settings | ✅ DONE |
| C2 | Microphone busy (other app) | Can't start recording | Show "Microphone busy" error | ⚠️ TODO |
| C3 | User silent for 5+ seconds | No text to send | Auto-cancel with hint | ⚠️ TODO |
| C4 | Speech recognition returns empty | Nothing recognized | Show "Didn't catch that" + auto-hide | ✅ DONE |
| C5 | Network unavailable (Android) | Recognition doesn't work | Expo STT needs network on Android | ⚠️ NOTE |
| C6 | Unsupported language | Recognition works poorly | Hardcode 'en-US' | ✅ DONE |

### D. Text-to-Speech Errors (TTS)

| # | Scenario | Problem | Solution | Status |
|---|----------|---------|----------|--------|
| D1 | SherpaTTS not initialized | Model not loaded | Fallback to expo-speech | ⚠️ TODO |
| D2 | SherpaTTS crashes mid-playback | Speech interrupted | Handle error, reset state | ⚠️ TODO |
| D3 | Very long AI response | TTS plays for 2+ minutes | Show progress or split | ⚠️ TODO |
| D4 | TTS model not downloaded | First launch without network | Bundle model or pre-download | ⚠️ NOTE |
| D5 | Voice sounds bad | Model quality issue | Voice picker (Lessac/Ryan/Alba) | ⚠️ TODO |

### E. UI Edge Cases

| # | Scenario | Problem | Solution | Status |
|---|----------|---------|----------|--------|
| E1 | TextInput focused + tap mic | Keyboard in the way | Hide keyboard when listening | ⚠️ TODO |
| E2 | User typing + taps mic | What to do with typed text? | Keep or replace? | ⚠️ TODO |
| E3 | Scroll chat during TTS | Unclear which message is playing | Highlight speaking message | ⚠️ TODO |
| E4 | Multiple messages queued | Which one to speak? | Only latest AI message | ✅ DONE |
| E5 | AI responds in chunks (streaming) | When to start TTS? | After full response | ✅ DONE |

### F. Permission Scenarios

| # | Scenario | Action | Status |
|---|----------|--------|--------|
| F1 | First launch - no permission | Show custom UI → request | ⚠️ TODO |
| F2 | Permission denied before | Open Settings app | ⚠️ TODO |
| F3 | Permission "Ask Every Time" (iOS) | Request each time | ⚠️ TODO |
| F4 | Permission revoked during session | Graceful degradation | ⚠️ TODO |

### Implementation Priority

**🔴 Critical (voice won't work without these):**
- B1: Block mic during speaking ✅ DONE
- B2: Stop button ✅ DONE
- C1: Permission handling
- D1: TTS fallback

**🟡 Important (UX suffers without these):**
- B3: Cancel recording without sending
- C3: Timeout on silence
- C4: "Didn't catch that" feedback
- E1: Hide keyboard
- E3: Visual indicator for speaking message

**🟢 Nice to have:**
- B4-B5: Background behavior
- E5: Streaming TTS
- Voice picker UI

---

## Related
- Main WellNuo voice.tsx (reference implementation)
- [expo-speech-recognition docs](https://docs.expo.dev/versions/latest/sdk/speech-recognition/)
- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)