Sergei cde44adc5c Add TTS model metadata and documentation

TTS Model (Piper VITS):
- MODEL_CARD: Voice model information
- tokens.txt: Phoneme tokenization
- onnx.json: Model configuration
- Model: en_US-lessac-medium (60MB ONNX - not in git)

Documentation:
- APP_REVIEW_NOTES.txt: App Store review notes
- specs/: Feature specifications
- plugins/: Expo config plugins

.gitignore updates:
- Exclude large ONNX models (60MB+)
- Exclude espeak-ng-data (phoneme data)
- Exclude credentials.json
- Exclude store-screenshots/

Note: TTS models must be downloaded separately.
See specs/ for setup instructions.

2026-01-14 19:10:13 -08:00

15 KiB

Raw Blame History

FEATURE-001: Voice Integration for Chat

Summary

Integrate voice communication with AI in the chat screen - speech recognition for input and text-to-speech for AI responses.

Status: 🟡 In Progress

Priority: High

Dependencies

expo-speech-recognition (STT)
expo-speech (fallback TTS)
react-native-sherpa-onnx-offline-tts (neural TTS - cross-platform iOS/Android)

Requirements

Functional

Voice Input (STT)
- Tap microphone button to start listening
- Real-time transcript display
- Auto-send when user stops speaking OR tap again to stop
- Visual indicator when listening (pulsing animation)
Voice Output (TTS)
- AI responses are spoken automatically
- Visual indicator when speaking
- Stop button to interrupt speech
- Multiple voice options (Lessac/Ryan/Alba)
States & Indicators
- isListening - microphone active, user speaking
- isSpeaking - AI voice response playing
- ttsInitialized - TTS engine ready
- Animated pulse on microphone when listening

Non-Functional

Works offline (SherpaTTS uses local neural models)
Cross-platform: iOS and Android
Low latency speech synthesis

Technical Design

Architecture

┌─────────────────────────────────────────────────────────┐
│                    chat.tsx                              │
├─────────────────────────────────────────────────────────┤
│  State:                                                  │
│  - isListening (from useSpeechRecognition)              │
│  - recognizedText (from useSpeechRecognition)           │
│  - isSpeaking                                            │
│  - ttsInitialized                                        │
│  - pulseAnim (Animated.Value)                           │
├─────────────────────────────────────────────────────────┤
│  Handlers:                                               │
│  - handleVoiceToggle() - start/stop listening           │
│  - handleVoiceSend() - send recognized text             │
│  - speakText(text) - speak AI response                  │
│  - stopSpeaking() - interrupt speech                    │
└─────────────────────────────────────────────────────────┘
           │                        │
           ▼                        ▼
┌─────────────────────┐  ┌─────────────────────────────────┐
│ useSpeechRecognition│  │       sherpaTTS.ts              │
│ (hooks/)            │  │       (services/)               │
├─────────────────────┤  ├─────────────────────────────────┤
│ - startListening()  │  │ - initializeSherpaTTS()        │
│ - stopListening()   │  │ - speak(text, options)         │
│ - recognizedText    │  │ - stop()                        │
│ - isListening       │  │ - isAvailable()                 │
│ - isAvailable       │  │ - setVoice(voiceId)            │
└─────────────────────┘  └─────────────────────────────────┘
           │                        │
           ▼                        ▼
┌─────────────────────┐  ┌─────────────────────────────────┐
│expo-speech-         │  │ react-native-sherpa-onnx-       │
│recognition          │  │ offline-tts (Piper VITS)        │
│(native module)      │  │ (native module)                 │
└─────────────────────┘  └─────────────────────────────────┘

Available Piper Voices

ID	Name	Gender	Accent	Model
lessac	Lessac	Female	US	en_US-lessac-medium
ryan	Ryan	Male	US	en_US-ryan-medium
alba	Alba	Female	UK	en_GB-alba-medium

Voice Flow

User taps mic button
        │
        ▼
  handleVoiceToggle()
        │
  ┌─────┴─────┐
  │ isListening?│
  └─────┬─────┘
        │
   NO   │   YES
   │    │    │
   │    │    ▼
   │    │ stopListening()
   │    │ handleVoiceSend()
   │    │
   ▼    │
startListening()
   │
   ▼
Speech Recognition active
(recognizedText updates)
   │
   ▼
User stops speaking / taps again
   │
   ▼
handleVoiceSend()
   │
   ▼
sendMessage(recognizedText)
   │
   ▼
AI responds
   │
   ▼
speakText(response)
   │
   ▼
SherpaTTS plays audio

Implementation Steps

Phase 1: Setup (DONE)

Add dependencies to package.json
Create sherpaTTS.ts service
Create useSpeechRecognition.ts hook
Add voice imports to chat.tsx
Add voice states (isListening, isSpeaking, ttsInitialized, pulseAnim)

Phase 2: Logic (DONE)

Implement handleVoiceToggle()
Implement handleVoiceSend()
Implement speakText()
Implement stopSpeaking()
TTS initialization on component mount
Auto-speak AI responses

Phase 3: UI (DONE)

Add microphone button to input area
Add voice status indicator (Listening.../Speaking...)
Add stop button for speech
Add pulse animation for listening state
Add styles for voice UI elements

Phase 4: Build & Test (IN PROGRESS)

Run npm install
Run expo prebuild --clean
Build iOS (native modules required)
Test on iOS simulator
Test on Android (emulator or device)

Phase 5: Polish (TODO)

Handle permissions properly (microphone access)
Add voice picker UI
Add speech rate control
Test edge cases (no network, no mic permission)

Files Modified/Created

File	Status	Description
`package.json`	Modified	Added voice dependencies
`services/sherpaTTS.ts`	Created	SherpaTTS service for offline TTS
`hooks/useSpeechRecognition.ts`	Created	Speech recognition hook
`app/(tabs)/chat.tsx`	Modified	Voice integration in chat

Testing Checklist

Manual Testing

Tap mic button - starts listening
Speak - text appears in input field
Tap again - sends message
AI responds - voice speaks response
Tap stop - speech stops immediately
Mic button disabled during sending
Visual indicators show correct state

Edge Cases

No microphone permission - shows alert
TTS not available - falls back to expo-speech
Empty speech recognition - doesn't send
Long AI response - speech handles gracefully
Interrupt speech and start new input

Notes

SherpaTTS Cross-Platform Support

iOS: Uses native module via bridged ObjC/Swift
Android: Uses native module via JNI/Kotlin
Model files: Must be bundled in app (assets/tts-models/)
Size: ~20MB per voice model

Known Limitations

Speech recognition requires device microphone permission
SherpaTTS requires native build (not Expo Go)
Model download may be needed on first launch

Voice Interaction Scenarios (All Cases)

State Machine

┌─────────────────────────────────────────────────────────────┐
│                     VOICE STATE MACHINE                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│    ┌──────────┐                                             │
│    │   IDLE   │◄────────────────────────────────────┐       │
│    └────┬─────┘                                     │       │
│         │ tap mic                                   │       │
│         ▼                                           │       │
│    ┌──────────┐                                     │       │
│    │LISTENING │───── user stops / tap ─────────┐   │       │
│    └────┬─────┘                                 │   │       │
│         │ recognized text                      │   │       │
│         ▼                                      │   │       │
│    ┌──────────┐                                ▼   │       │
│    │PROCESSING│─────────────────────────► SENDING  │       │
│    └────┬─────┘                              │      │       │
│         │ AI responded                       │      │       │
│         ▼                                    │      │       │
│    ┌──────────┐                              │      │       │
│    │ SPEAKING │◄─────────────────────────────┘      │       │
│    └────┬─────┘                                     │       │
│         │ finished / user tap stop                  │       │
│         └───────────────────────────────────────────┘       │
│                                                              │
└─────────────────────────────────────────────────────────────┘

A. Happy Path Scenarios

#	Scenario	Expected Behavior	Status
A1	User taps mic → speaks → taps again	Text recognized → sent → AI responds → spoken	✅
A2	User listens to full AI response	TTS finishes → returns to IDLE	✅
A3	User stops TTS with stop button	TTS interrupted → can tap mic again	✅
A4	User types text manually	Message sent → AI responds → spoken	✅

B. Interruptions & Conflicts

#	Scenario	Problem	Solution	Status
B1	Tap mic while AI speaking	Mic would hear TTS	Block mic while `isSpeaking`	✅ DONE
B2	AI speaking, user wants to stop	No way to interrupt	Stop button (red)	✅ DONE
B3	User speaking, changes mind	Need to cancel without sending	Tap again = cancel (no text = don't send)	✅ DONE
B4	AI speaking, user switches tab	Should TTS stop?	Stop TTS on blur	⚠️ TODO
B5	App goes to background during TTS	TTS continues in background?	Platform-specific behavior	⚠️ TODO
B6	Double/triple tap on mic	States get confused	Debounce + transition lock	⚠️ TODO

C. Speech Recognition Errors (STT)

#	Scenario	Problem	Solution	Status
C1	No microphone permission	Speech recognition fails	Show permission alert + Open Settings	✅ DONE
C2	Microphone busy (other app)	Can't start recording	Show "Microphone busy" error	⚠️ TODO
C3	User silent for 5+ seconds	No text to send	Auto-cancel with hint	⚠️ TODO
C4	Speech recognition returns empty	Nothing recognized	Show "Didn't catch that" + auto-hide	✅ DONE
C5	Network unavailable (Android)	Recognition doesn't work	Expo STT needs network on Android	⚠️ NOTE
C6	Unsupported language	Recognition works poorly	Hardcode 'en-US'	✅ DONE

D. Text-to-Speech Errors (TTS)

#	Scenario	Problem	Solution	Status
D1	SherpaTTS not initialized	Model not loaded	Fallback to expo-speech	⚠️ TODO
D2	SherpaTTS crashes mid-playback	Speech interrupted	Handle error, reset state	⚠️ TODO
D3	Very long AI response	TTS plays for 2+ minutes	Show progress or split	⚠️ TODO
D4	TTS model not downloaded	First launch without network	Bundle model or pre-download	⚠️ NOTE
D5	Voice sounds bad	Model quality issue	Voice picker (Lessac/Ryan/Alba)	⚠️ TODO

E. UI Edge Cases

#	Scenario	Problem	Solution	Status
E1	TextInput focused + tap mic	Keyboard in the way	Hide keyboard when listening	⚠️ TODO
E2	User typing + taps mic	What to do with typed text?	Keep or replace?	⚠️ TODO
E3	Scroll chat during TTS	Unclear which message is playing	Highlight speaking message	⚠️ TODO
E4	Multiple messages queued	Which one to speak?	Only latest AI message	✅ DONE
E5	AI responds in chunks (streaming)	When to start TTS?	After full response	✅ DONE

F. Permission Scenarios

#	Scenario	Action	Status
F1	First launch - no permission	Show custom UI → request	⚠️ TODO
F2	Permission denied before	Open Settings app	⚠️ TODO
F3	Permission "Ask Every Time" (iOS)	Request each time	⚠️ TODO
F4	Permission revoked during session	Graceful degradation	⚠️ TODO

Implementation Priority

🔴 Critical (voice won't work without these):

B1: Block mic during speaking ✅ DONE
B2: Stop button ✅ DONE
C1: Permission handling
D1: TTS fallback

🟡 Important (UX suffers without these):

B3: Cancel recording without sending
C3: Timeout on silence
C4: "Didn't catch that" feedback
E1: Hide keyboard
E3: Visual indicator for speaking message

🟢 Nice to have:

B4-B5: Background behavior
E5: Streaming TTS
Voice picker UI

Main WellNuo voice.tsx (reference implementation)
expo-speech-recognition docs
sherpa-onnx

15 KiB Raw Blame History