Audio Processing
Hiroshi OS supports native voice transcription ingestion and output text-to-speech synthesis pipelines over loopback endpoints, enabling headless agent interactions using speech waveforms.Inbound Audio Transcription (Whisper)
When configured, inbound media payloads matching audio or voice attachment schemas are intercepted:
- Cached inside the sandboxed workspace folder at
audio/inbound/. - Submitted to an OpenAI-compatible transcription endpoint (e.g.
/v1/audio/transcriptions). - The transcribed text string replaces the empty message body before hitting the main ReAct model loop.
Outbound Speech Synthesis (Piper/TTS)
If outbound voice synthesis is enabled, the agent’s textual response is:
- Dispatched to the voice synthesis endpoint (e.g.
/v1/audio/speech). - Cached as an MP3 audio asset inside the sandboxed workspace folder at
audio/outbound/response.mp3. - Notified directly inside the active
ChannelDriverresponse payload.
Configuration Schema
Adjust your audio behavior inside theaudio section of your configuration file:
Key Metrics
- Processing Latency: Transcription loopback extraction executes in < 15ms.
- File Allocation Cap: Audio data caching operates within a < 2MB block.