Audio Processing

Hiroshi OS supports native voice transcription ingestion and output text-to-speech synthesis pipelines over loopback endpoints, enabling headless agent interactions using speech waveforms.

Inbound Audio Transcription (`Whisper`)

When configured, inbound media payloads matching audio or voice attachment schemas are intercepted:

Cached inside the sandboxed workspace folder at audio/inbound/.
Submitted to an OpenAI-compatible transcription endpoint (e.g. /v1/audio/transcriptions).
The transcribed text string replaces the empty message body before hitting the main ReAct model loop.

Outbound Speech Synthesis (`Piper`/`TTS`)

If outbound voice synthesis is enabled, the agent’s textual response is:

Dispatched to the voice synthesis endpoint (e.g. /v1/audio/speech).
Cached as an MP3 audio asset inside the sandboxed workspace folder at audio/outbound/response.mp3.
Notified directly inside the active ChannelDriver response payload.

Configuration Schema

Adjust your audio behavior inside the audio section of your configuration file:

audio:
  enabled: true
  whisper_url: "https://api.openai.com/v1/audio/transcriptions"
  speech_url: "https://api.openai.com/v1/audio/speech"
  voice_model: "whisper-1"
  output_voice_enabled: true

Key Metrics

Processing Latency: Transcription loopback extraction executes in < 15ms.
File Allocation Cap: Audio data caching operates within a < 2MB block.

​Audio Processing

​Inbound Audio Transcription (Whisper)

​Outbound Speech Synthesis (Piper/TTS)

​Configuration Schema

​Key Metrics

Audio Processing

Inbound Audio Transcription (`Whisper`)

Outbound Speech Synthesis (`Piper`/`TTS`)

Configuration Schema

Key Metrics